Dynamic Graph Transformer with Spatio-Temporal Attention for Streamflow Forecasting

Li, Bo; Li, Qingping; Zhou, Xinzhi; Deng, Mingjiang; Ling, Hongbo

doi:10.3390/hydrology12120322

Open AccessArticle

Dynamic Graph Transformer with Spatio-Temporal Attention for Streamflow Forecasting

by

Bo Li

¹,

Qingping Li

²,

Xinzhi Zhou

^2,*

,

Mingjiang Deng

³ and

Hongbo Ling

⁴

¹

College of Water Resources and Hydropower, Sichuan University, Chengdu 610065, China

²

College of Electronics and Information Engineering, Sichuan University, Chengdu 610065, China

³

Xinjiang Association for Science and Technology, Urumqi 830054, China

⁴

Xinjiang Institute of Ecology and Geography, Chinese Academy of Sciences, Urumqi 830011, China

^*

Author to whom correspondence should be addressed.

Hydrology 2025, 12(12), 322; https://doi.org/10.3390/hydrology12120322

Submission received: 24 October 2025 / Revised: 30 November 2025 / Accepted: 2 December 2025 / Published: 8 December 2025

Download

Browse Figures

Versions Notes

Abstract

Accurate streamflow forecasting is crucial for water resources management and flood mitigation, yet it remains challenging due to the complex dynamics of hydrological systems. Conventional data-driven approaches often struggle to effectively capture spatio-temporal evolution characteristics, particularly the dynamic interdependencies among streamflow gauges. This study proposes a novel deep learning architecture, termed DynaSTG-Former. It employs a multi-channel dynamic graph constructor to adaptively integrate three spatial dependency patterns: physical topology, statistical correlation, and trend similarity. A dual-stream temporal predictor is designed to collaboratively model long-range dependencies and local transient features. In an empirical study within the Delaware River Basin, the model demonstrated exceptional performance in multi-step-ahead forecasting (12-, 36-, and 72 h). It achieved basin-scale Kling–Gupta Efficiency (KGE) values of 0.961, 0.956, and 0.855, significantly outperforming baseline models such as LSTM, GRU, and Transformer. Ablation studies confirmed the core contribution of the dynamic graph module, with the Pearson correlation graph playing a dominant role in error reduction. The results indicate that DynaSTG-Former effectively enhances the accuracy and stability of streamflow forecasts and demonstrates its strong robustness at the basin scale. It thus provides a reliable tool for precision water management.

Keywords:

streamflow forecasting; graph attention mechanism; transformer model; spatio-temporal attention mechanism; deep learning (DL); hydrological time series forecasting

1. Introduction

As a keystone component of the hydrological cycle, streamflow is critical for a wide range of human endeavors, including agricultural production, public water supply, and urban infrastructure [1,2]. Consequently, streamflow forecasting is indispensable for water resources management, flood disaster monitoring, and water-related risk assessment [3]. However, achieving accurate streamflow forecasts remains challenging. This is primarily due to the inherent complexity of hydrological systems, whose dynamics are influenced by a combination of factors such as precipitation patterns, reservoir operations, land-use changes, and climate variability [4].

Contemporary streamflow forecasting employs two methodological frameworks: physically based hydrological models and data-driven models. Physically based models simulate water cycle processes through mathematical representations of water movement. Their strength lies in their physical interpretability; however, they typically demand extensive input data and inevitably incorporate empirical approximations that can introduce uncertainty [5]. In contrast, data-driven models establish forecasting relationships directly from historical data sequences. While they do not explicitly represent physical processes, they can achieve high forecasting accuracy through statistical learning. The proliferation of large-scale sensor networks globally provides the foundational infrastructure that empowers data-driven hydrological models. This infrastructure, exemplified by the United States’ USGS National Water Information System (NWIS) encompassing water resources data collected at approximately 1.9 million sites in all 50 states (https://waterdata.usgs.gov/nwis/, accessed on 16 July 2025), Portugal’s over-20-year operational SNIRH network (https://data.europa.eu/en/, accessed on 3 December 2025), and China’s integrated observational data control system (IODCS) [6], enables continuous, high-frequency streamflow monitoring, delivering sustained data streams essential for robust model training and operational deployment. Under specific conditions, data-driven approaches can outperform purely physics-based models in both forecasting accuracy and computational efficiency. This is particularly true in basins with adequate observational data, where complex nonlinear processes pose significant challenges to traditional parametrization methods [7,8].

The advancement of data-driven streamflow forecasting is centered on innovations in time-series modeling methodologies and the enhancement of adaptive capabilities. Research in this field originated in the 1970s, when Box et al. established a systematic framework for linear time-series forecasting with the ARIMA model [9]. Thompstone et al. later applied this model to seasonal streamflow forecasting for Canadian rivers in 1985 [10]. During the 1990s, machine learning (ML) techniques began to develop, prized for their strong nonlinear mapping ability and data feature learning ability [11]. Karunanithi et al. pioneered the application of Artificial Neural Networks (ANNs) for streamflow forecasting in the Huron River in 1994, demonstrating performance superior to traditional analytical models [12]. This work promoted the widespread application of ANNs in hydrological forecasting. Subsequent studies, such as those by Jain et al. at the Upper Indravati reservoir in India [13] and Zealand et al. in the Winnipeg River Basin in Canada, ref. [14,15] further confirmed that ANNs exhibit stronger nonlinear fitting capabilities for short-term streamflow forecasting, achieving this without the need to model the internal basin structure. Concurrently, Support Vector Machines (SVMs) gained attention for their effectiveness in learning from small sample sizes [16,17,18].

With advancements in computational power, deep learning (DL) techniques gradually became mainstream. Two variants of Recurrent Neural Networks (RNNs) [19]—Long Short-Term Memory (LSTM) [20] and the Gated Recurrent Unit (GRU)–significantly enhanced the capacity for hydrological time-series modeling. Large-scale validation by Kratzert et al. across 671 catchments in the United States demonstrated that LSTM could achieve accuracy comparable to specialized hydrological models using meteorological data alone [21]. This finding was further validated in flood simulations in China’s Fen River Basin [22] and monthly runoff forecasts in the Yellow River Basin [23]. The integration of Convolutional Neural Networks (CNNs) with LSTM pioneered new pathways for spatio-temporal modeling. For instance, research by Ghimire et al. on the Brisbane River and Teewah Creek in Australia significantly improved forecasting performance, markedly outperforming single models [24,25].

The Transformer architecture [26] marked a significant breakthrough in deep learning, effectively overcoming limitations of LSTM models—such as inadequate long-range dependency capture, gradient vanishing, and overfitting—through its multi-head attention mechanism [27,28]. This has enhanced the capacity to model complex hydrological system behaviors [29,30], demonstrating promising performance in multi-basin modeling, transfer learning scenarios, and for predicting multiple hydrological variables [30,31,32]. To address specific forecasting challenges, improved architectures like Informer have been developed to optimize long-sequence predictions [33,34].

Given the significant spatio-temporal variability in streamflow time-series [35], integrating physical mechanisms and spatio-temporal topology into forecasting remains crucial. Recent approaches combine Graph Convolutional Networks (GCNs) with temporal models to jointly learn spatial and temporal features [36,37], while studies incorporating graph attention mechanisms (GATs) or other GNNs have improved spatial information extraction [38,39,40]. These spatio-temporal deep learning methods better reflect basin-specific hydrological characteristics, mitigate data sparsity issues, and enhance overall forecast performance.

However, despite these advances, the current technological architecture still faces two key challenges. On one hand, the synergistic mechanism between spatio-temporal graph topology and forecasting modules remains suboptimal. Most existing methods rely on static graphs based on physical spatial relationships or simple weighted fusion, which struggle to adapt to the dynamic evolution characteristics of hydrological processes and cannot reflect the potential dynamic spatial relationships in time-varying flow data, making it difficult for models to grasp the evolving interactive features among streamflow nodes over time. On the other hand, while the Transformer possesses powerful global modeling capabilities for capturing long-distance dependencies and is poised to play a significant role, its vanilla attention mechanism lacks granularity in handling local mutations and transient features. This can lead to the loss of local temporal information, manifesting as significant performance degradation, particularly during extreme flood events. Motivated by these challenges, this study proposes a dynamic graph-enhanced spatio-temporal fusion architecture. Through the innovative design of a multi-channel adaptive graph constructor and a dual-stream temporal prediction module, it achieves an organic unification of spatio-temporal dynamic information and driving data in deep learning. The core contributions are as follows:

(1): A multi-channel dynamic graph constructor that constructs three complementary adjacency matrices—hydrological topology, spatial vector similarity, and statistical correlation patterns—and adaptively fuses them via learnable weights to capture evolving node interactions and enable intelligent information propagation.
(2): A lightweight local Temporal Pattern Enhancement module that integrates extended convolutional downsampling with multi-head self-attention to enhance the model’s ability to address short-term fluctuations and local anomalies while maintaining global dependency awareness.
(3): Implementation and validation through ablation studies and comparative experiments using data from the Delaware River Basin, demonstrating the effectiveness of our dynamic graph strategy and local–temporal synergy mechanism under diverse hydrological conditions.

It is important to note that streamflow data, as an integrated reflection of basin hydrological response, encapsulates information from rainfall, underlying surface characteristics, and human activities. The shape of its hydrograph is considered an objective “fingerprint” of hydrological events [41]. This study utilizes instrumental streamflow data alone as the “fuel” for the data-driven model. Our focus under these relatively streamlined conditions is on leveraging spatio-temporal dependencies to enhance the forecasting capability of data-driven models for streamflow.

2. Materials and Methods

This chapter systematically elaborates on our proposed dynamic graph-enhanced spatio-temporal deep learning architecture. The model utilizes a multi-channel dynamic graph fusion module and employs a dual-stream temporal predictor. It then takes the Delaware River Basin in the United States as the study area to design ablation studies and multi-step forecasting experiments.

2.1. Methodology

Here, we introduce the DynaSTG-Former, an enhanced Transformer architecture that integrates adaptive construction capabilities for spatial dynamic graphs of streamflow gauges and strengthens local perception characteristics. This model aims to leverage multi-perspective dynamic spatial information to improve multi-step time-series simulation performance across multiple streamflow gauges within river basins. As illustrated in Figure 1, the DynaSTG-Former incorporates both the physical constraints of river network topology and data-driven features from historical observations. Moreover, it constructs evolving relational graphs among gauges by fusing statistical, pattern-based, and physical perspectives, a process enhanced by an adaptive learning mechanism. This multi-source synergistic mechanism enables the model to accurately reflect the dynamic evolutionary characteristics of streamflow sources within the basin. Building upon this dynamic spatial representation, the DynaSTG-Former employs a coordinated mechanism of global temporal flow and local feature flow to capture temporal patterns, overcoming the modeling limitations of traditional Transformers for localized anomalies and transient variations.

The model architecture consists of two key phases: a dynamic spatial dependency extraction component and a temporal feature capture component. This spatial-then-temporal modeling approach has been validated as effective in the time-series forecasting domain. For instance, in studies [42,43], researchers utilized the Temporal Graph Convolutional Network (TGCN) and Attention Temporal Graph Convolutional Network (A3T-GCN) model structures for traffic flow forecasting, demonstrating the superior performance of this architecture. Consequently, this study constructs a Dynamic Adaptive Spatial Graph Constructor and a Dual-stream Enhanced Temporal Predictor. Crucially, these components are enhanced through multiple residual mechanisms, forming a closed-loop spatio-temporal forecasting system that incorporates a feedback mechanism to continuously optimize the model’s understanding of spatio-temporal relationships.

2.1.1. Dynamic Adaptive Spatial Graph Constructor for Multi-Perspective Spatio-Temporal Graph Construction

Capturing spatial dependencies is critical due to the inherent relationship between hydrological processes and geographic/spatial features. The flow value at downstream gauges inherently incorporates characteristic quantities from upstream gauges. These dependencies are expressed as the upstream and downstream spatial relationships of the river network, and align with a graph-like data structure [40,44]. To formalize gauges spatial relationships, the basin monitoring network is modeled as a graph

G = (V, E)

, where

V = \{v_{1}, v_{2}, \dots, v_{N}\}

(

N

denotes the number of nodes) represents the gauge set, and

E

constitutes the edge set, characterizing hydrological connectivity. Crucially, river discharge dependencies are not static and evolve dynamically over time [45]. This study proposes that the dynamic adaptive spatial graph constructor includes three stages: multi-channel dynamic graph construction, adaptive channel fusion, and GCN-based feature extraction.

Stage 1, multi-channel dynamic graph construction. Static graphs based solely on physical connectivity may inadequately capture latent interdependencies among stream gauges and dynamic transformations of basin sources [46]. Therefore, additional graph-based feature extraction tools from diverse perspectives are incorporated to enhance the model’s ability to capture temporally evolving characteristics. We construct three types of graphs: a static topology graph, a Pearson correlation coefficient graph, and a cosine similarity graph. These graphs characterize both static topological connectivity and dynamic correlations among nodes.

The static topology graph is constructed based on the physical spatial relationships among stream gauges, represented by a static adjacency matrix

A_{s t a t i c} = {(a_{i j})}_{N \times N}

, where

a_{i j} = 1

indicates nodes

v_{i}

and

v_{j}

are hydrologically connected, and

a_{i j} = 0

otherwise. Although river flow is uni-directional, information propagation exhibits bidirectional influences. To maximize information utilization, this study employs undirected graph modeling.

The Pearson correlation graph

A_{p e a r s o n} = (ρ_{i j})

is derived from the Pearson correlation coefficients between gauge flow series. This metric quantifies linear dependence between two variables (

x_{i}, x_{j}

) is defined as follows:

ρ_{i j} = \frac{C O V (x_{i}, x_{j})}{σ_{x_{i}} σ_{x_{j}}} = \frac{\sum_{t = 1}^{T} (x_{t, i} - {\bar{x}}_{i}) (x_{t, j} - {\bar{x}}_{j})}{\sqrt{\sum_{t = 1}^{T} (x_{t, i} - {\bar{x}}_{i})^{2}} \sqrt{\sum_{t = 1}^{T} (x_{t, j} - {\bar{x}}_{j})^{2}}}, {\bar{x}}_{i} = (1 / T) \sum x_{{t, i}} .

(1)

where

ρ_{i j} \in [1, - 1]

. Values of 1, −1, and 0 indicate perfect positive correlation, perfect negative correlation, and no linear correlation, respectively.

The cosine similarity graph

A_{c o s i n e} =

(

{c o s i n e}_{i j}

) is constructed using cosine similarity between flow vectors. This evaluates directional alignment while ignoring magnitude differences. For two gauges’ flow vectors

x_{i} = [x_{1, i}, x_{2, i}, \dots, x_{t, i}]

and

x_{j} = [x_{1, j}, x_{2, j}, \dots, x_{t, j}]

within a time window, the cosine similarity is as follows:

{c o s i n e}_{i j} = \frac{\sum_{t = 1}^{T} x_{t, i} x_{t, j}}{\sqrt{\sum_{t = 1}^{T} x_{t, i}^{2}} \sqrt{\sum_{t = 1}^{T} x_{t, j}^{2}}},

(2)

with values in [−1, 1]. Values closer to 1 indicate greater directional similarity, while values near −1 indicate opposite directions.

Among these three graphs, the static topology graph remains generally invariant. Both the Pearson correlation and cosine similarity graphs are dynamically updated using a sliding window mechanism, incorporating complete historical sequences from the initial timestep to the current moment. This enables dynamic capture of flow synchronization patterns (via Pearson) and trend direction consistency (via cosine similarity). Each graph undergoes row normalization:

\underset{k, i, j}{\tilde{A}} = \frac{A_{k, i, j}}{{\sum_{j = 1}^{N} A}_{k, i, j} + ε}, \forall k \in \{1,2, 3\}, i \in \{1, . . ., N\}, j \in \{1, . . ., N\}, ε = 1 0^{- 8} .

(3)

Followed by multi-channel stacking:

A = s t a c k ([{\tilde{A}}_{p e a r s o n}, {\tilde{A}}_{c o s i n e}, {\tilde{A}}_{s t a t i c}]) \in R^{3 \times N \times N} .

(4)

This ensures comparable fusion of dimensionally heterogeneous graphs (correlation coefficients, similarity metrics, and binary connections) within a unified framework.

Stage 2, adaptive channel fusion. Significant differences exist in the dominant modes of spatial dependencies under varying basin conditions, as runoff propagation is an extremely complex process [47]. Information passing in a river network should be multi-directional and time-varying, rather than strictly following the network topography as prescribed through the physics-based connectivity [44]. Therefore, we use an adaptive multi-channel graph fusion mechanism. This mechanism generates channel weights through learnable parameters (

α \in R^{3}

) and the softmax function:

w = s o f t m a x (α) = (\frac{e^{α_{1}}}{\sum_{k} e^{α_{k}}}, \frac{e^{α_{2}}}{\sum_{k} e^{α_{k}}}, \frac{e^{α_{3}}}{\sum_{k} e^{α_{k}}}), W_{k} > 0, \sum_{k = 1}^{3} w_{k} = 1 .

(5)

The three channels are then weighted and summed using these weights to obtain the fused adjacency matrix

A_{f u s e d} \in R^{B \times N \times N}

. Here,

B

represents the batch size, i.e., the number of samples processed simultaneously. Channel fusion is performed independently for each sample (

b \in [1, B]

):

A_{f u s e d}^{(b)} = \sum_{k = 1}^{3} w_{k} \cdot A^{(k)}

. The fused adjacency matrix can subsequently serve as input to the Graph Convolutional Network (GCN) feature extractor. The model optimizes the parameters (

α \in R^{3}

) through gradient descent, autonomously adjusting the contribution of each channel to enhance generalization capability. This approach is commonly used for establishing adaptive mechanisms [48], and has also been applied to learning adaptive spatial dependency matrices in the field of hydrological forecasting [49].

Stage 3, GCN-based feature extraction. After sparsification and threshold-based filtering, the edge index matrix (

E_{i n d e x} \in Z^{2 \times E_{t o t a l}}

) and edge weight vector (

w \in R^{E_{t o t a l}}

) are processed through two graph convolution operations to obtain embedded features. The first graph convolution layer is as follows:

H^{(1)} = D r o p o u t (R e L U (\tilde{A} X W^{(1)}), p = 0.3) .

(6)

The second graph convolution layer is as follows:

H^{(2)} = \hat{A} H^{(1)} W^{(2)} .

(7)

After completing convolution and symmetric normalization, the final output for each node is represented as an H-dimensional embedding vector:

H^{o u t} = r e s h a p e (H^{(2)}, (B, N, d_{h})) \in R^{B \times N \times d_{h}} .

(8)

This embedding integrates the node’s intrinsic features with information propagated through the graph structure from neighboring nodes, making it suitable for subsequent temporal forecasting. During this process, residual connections preserve original hydrological signals by adding raw input features to processed GCN outputs, mitigating information loss during feature transformation.

2.1.2. Dual-Stream Enhanced Temporal Predictor for Enhanced Global and Local Feature Extraction in Temporal Forecasting

Although Transformers excel in modeling long-range dependencies, their global attention mechanism exhibits limited capability in capturing localized transient events, such as storm-induced flood peaks. Prior studies [50,51] indicate that standard (unmodified) Transformer models offer no advantage over LSTM in hydrological forecasting. This limitation primarily stems from the following: (1) flattened attention diluting critical event signals; and (2) the absence of localized feature extraction mechanisms. Nevertheless, the Transformer framework possesses scalability advantages for learning and storing knowledge in large-scale datasets, while its self-attention mechanism mitigates gradient vanishing issues inherent in RNN models. Consequently, this study retains the Transformer framework but introduces a global–local collaborative dual-stream enhanced architecture. This design enables parallel processing of global temporal dependencies and transient local patterns. Here, the Global Temporal Flow employs a Transformer Encoder to model long-term dependencies, while the Local Feature Flow leverages the localized sensitivity of convolutional attention mechanisms to capture transient features.

The dual-stream enhanced temporal prediction comprises three sequential phases: feature projection, temporal flow modeling, and multi-step iterative forecasting. In the first phase, the predictor receives the enhanced feature tensor from the spatial encoder and projects it into a higher-dimensional space:

X_{e} = X W_{p} + b_{p}, W_{p} \in R^{N \times d},

(9)

where

d

represents the hidden dimension. This operation maps high-dimensional spatio-temporal embeddings back to the original dimensionality, enabling effective feature transformation while maintaining topological integrity.

The second phase focuses on temporal flow modeling. The predictor concurrently processes global and local temporal flows, as illustrated in Figure 2.

The global temporal flow employs a Transformer encoder to capture long-range dependencies across timesteps through self-attention mechanisms:

H_{g l o b a l} = T r a n s f o r m e r E n c o d e r (X_{e}),

(10)

where the Transformer encoder comprises stacked encoder blocks, each containing two core components: (1) multi-head self-attention layer:

A t t e n t i o n (Q, K, V) = s o f t m a x (\frac{{Q K}^{T}}{\sqrt{d_{k}}}) V,

(11)

where

Q, K,

and

V

represent query, key, and value matrices derived from linear projections of the input. The hyperparameter

d_{k}

denotes key vector dimensionality. The multi-head mechanism (experimentally configured with h = 2 heads) computes multiple attention weight sets in parallel, enhancing the model’s capacity to discern diverse temporal patterns. Feed-Forward Network (FFN):

F F N (x) = R e L U (x W_{1} + b_{1}) W_{2} + b_{2}

. This fully connected network amplifies nonlinear representational capabilities and improves training stability [26].

To address standard Transformers’ limitations in modeling transient variations, we design a parallel convolutional attention module, the Local Temporal Pattern Enhancement (LTPE):

H_{l o c a l} = P a t c h B l o c k (X_{e}) .

(12)

Implementation involves the following three operations:

Multi-scale convolutional downsampling:

X_{c} = C o n v 1 D (X) \in R^{B \times [T / p] \times d},

(13)

where the kernel size

p

controls the temporal receptive field for multi-scale feature extraction. Kernel sizes {3,5,7} capture event-specific durations

2.: Local attention enhancement:

H_{l o c a l} = M u l t i H e a d A t t e n t i o n (X_{c}, X_{c}, X_{c}) .

(14)

3.: Temporal dimension recovery:

H_{l o c a l} = I n t e r p o l a t e (H_{l o c a l}) .

(15)

Parallel outputs from global and local temporal modeling are fused via residual connections:

The third phase is multi-step iterative forecasting. An iterative prediction framework generates forecasts for K future steps:

{\hat{Y}}_{t} = {L i n e a r}_{o u t} (M_{t - 1}), X_{t} = C o n c a t (X_{t - 1} [1 :], {\hat{Y}}_{t}),

(16)

where this mechanism dynamically updates the input sequence to achieve temporal recursion, ensuring stability in long-term forecasts.

2.2. Study Area

The Delaware River Basin (DRB) (Figure 3) serves as the validation testbed for this study. Located in the northeastern United States, the DRB spans 2,589,988 km² across portions of New York, Pennsylvania, New Jersey, and Delaware. Its waters support municipal, agricultural, and industrial needs for 14.2 million people (≈4% of the U.S. population).

The Delaware River—the basin’s primary channel—is the longest un-dammed river east of the Mississippi, with a free-flowing mainstem (dams exist only on tributaries). Topography transitions from mountainous terrain along northern/western boundaries to low-gradient plains in the south/east. This physiographic diversity underpins the following heterogeneous land cover: forests (44%), agricultural areas (19%), developed land (21%), and wetlands (16%) (USGS NLCD 2021). Long-term USGS stream gauge observations confirm strong spatio-temporal variability in hydrometeorological processes, driving nonlinear streamflow generation critical for testing model robustness [52].

2.3. Data Sources and Preprocessing

The data employed in this study primarily comprise historical streamflow observations and the topological relationships among stream gauges. Historical streamflow data were obtained from the National Water Information System (NWIS), operated by the United States Geological Survey (USGS), accessed through its National Water Information System (https://waterdata.usgs.gov/nwis, accessed on 10 May 2025). This integrated platform consolidates real-time monitoring data from over 13,500 stations across the United States, encompassing multiple parameters including surface-water, groundwater, or water quality data, typically recorded at 15 to 60 min intervals. This study focused on the DRB; the dataset spans the period from January 2010 to December 2024. The data-processing workflow is as follows:

Data download and preliminary screening: Download water-level- and flow observation data of the gauges with a 15 min resolution. Then, exclude gauges with >5% data missingness.
Water-level time-series gap filling: Applied linear interpolation to reconstruct isolated or small-scale gaps in water-level records.
Streamflow data supplement: For periods with complete water-level data but missing flow measurements, we applied a multi-year comprehensive stage–discharge relationship model. This model used the recorded water level to calculate the corresponding missing stream.
Time-scale aggregation: After completing the above steps, all streamflow data (original 15 min resolution) are aggregated into a 12 h time-series by calculating the arithmetic mean, which is convenient for subsequent model training and testing.

The river network topology was generated by extracting basin boundaries and waterbody data from the National Hydrography Dataset (NHDPlus HR [53]), alongside the river network of the study basin from the global HydroRIVERS database [54]. The topological structure of the gauge stations was constructed based on the geographic coordinates and upstream–downstream relationships of the 45 stream gauges ultimately used in this study. This defined the spatial adjacency between gauges, resulting in a schematic diagram reflecting the basin’s gauges network topology (Figure 4). A static adjacency matrix for the gauges graph was subsequently computed based on these adjacency relationships.

According to data from the Delaware River Basin Commission (DRBC) portal (https://www.nj.gov/drbc/, accessed on 10 May 2025), the basin contains 200 active stream gauges, including 187 USGS gauges. To minimize anthropogenic interference, this study first excluded gauges located upstream of tributary reservoirs. Following a systematic evaluation of the historical data quality from mainstem monitoring gauges, 45 USGS gauges located upstream of the Trenton, New Jersey section on the Delaware River were ultimately selected for analysis. No mainstem gauges downstream of the Trenton section met the required data quality standards. Detailed information for these 45 gauges is provided in Supplementary Information Table S1.

To evaluate the reliability of the streamflow data from the 45 gauging stations used for prediction in this study, statistical assessments of temporal homogeneity and spatial isotropy/anisotropy were conducted [55]. The Mann–Kendall trend test and Pettitt change-point analysis performed on the Rescaled Adjusted Partial Sums (RAPS [56], Rescaled Adjusted Partial Sums) indices of streamflow dataset demonstrated strong temporal homogeneity, as evidenced by p-values ranging from 0.23 to 1.00 (exceeding the 0.05 significance level), Z-statistics between −0.89 and 1.19 (below the critical value of 1.96), and Sen’s slope estimates approaching 0 (−0.034 to 0.076). These results consistently confirm the absence of significant temporal trends and detectable change points in the streamflow records. The confirmed temporal homogeneity indicates that the statistical properties (e.g., mean and variance) of the streamflow series remain stable over time, providing a fundamental and ideal foundation for building robust and reliable forecasting models by ensuring that patterns learned from historical data are likely to persist into the future.

Spatial characteristics were further examined using Innovative Polygon Trend Analysis (IPTA) [57], which revealed significant spatial anisotropy rather than randomness. The results showed coherent spatial patterns across the basin, from upstream areas and major tributaries to the basin outlet, indicating consistent directional shifts in streamflow variation. This suggests that the drivers of streamflow changes are regional or global (e.g., climate fluctuations) rather than local random disturbances. The spatial coherence also exhibited distinct seasonal phase-locking characteristics; for instance, a synergistic increase in streamflow was observed across all stations during the spring months (March to May), corresponding to factors like increased precipitation or snowmelt changes, while a consistent decrease in early June (reflecting influences such as enhanced evapotranspiration) further demonstrated organized hydrological responses spatially. Even minor variations in individual station responses underscore the modulation by local underlying surfaces under dominant climatic signals, rather than negating the overall spatial synergy.

In summary, the demonstrated temporal homogeneity and spatial anisotropy collectively indicate that the streamflow data used in this study possess a stable statistical foundation and an organized spatial structure. This verification not only addresses concerns regarding data reliability but also establishes a solid physical basis for developing predictive models capable of capturing complex spatio-temporal dependencies. Analytical results are provided in Supplementary Information Table S2 and Figure S1.

2.4. Experimental Design

This study aims to evaluate the potential of spatio-temporal information accumulated from numerous stream gauges deployed within river basins for streamflow forecasting, and to investigate a low-computational-cost enhancement method for attention mechanisms to improve Transformer’s capability in extracting local features from temporal data. Accordingly, two experiments were designed. In Experiment I, a Dynamic Adaptive Spatial Graph Constructor and a Dual-stream Enhanced Temporal Predictor were designed to achieve the main objectives of this study. The experiment was analyzed through comparisons with baseline models (unmodified Transformer Encoder) and ablation studies, aiming to quantitatively evaluate the effectiveness of each component in the new model. In Experiment II, comparative analysis was conducted with both widely used temporal forecasting models (LSTM, GRU, Transformer Encoder) and prediction models incorporating graph attention mechanisms (TGCN, A3T-GCN) to quantify the overall superiority of the proposed model in multi-site streamflow prediction in river networks. Evaluation metrics include Root Mean Square Error (RMSE), Mean Absolute Error (MAE), and Kling–Gupta Efficiency (KGE). Metric definitions are detailed in Table 1. Notably, compared to the traditional Nash–Sutcliffe Efficiency (NSE) commonly used in hydrology, the KGE metric has demonstrated advantages in sensitivity to model systematic bias and applications in model evaluation and calibration [58,59].

To objectively evaluate the generalizability of the DynaSTG-Former model, 7 clusters were identified from 45 gauges using k-means clustering algorithm based on flow means, flow coefficient of variation ranges, 1-step KGE values, multi-step KGE decay rates, and mainstem/tributary distribution. The centroid gauge from each cluster (7 total) was selected as a representative node (details in Table 2). The methodology for gauge node classification is provided in Appendix A.

Considering computational costs, a 20-time-step historical window was used to construct training instances. The forecasting horizon was set to one, three, and six steps ahead (corresponding to 12 h, 36 h, and 72 h) to test multi-step forecasting performance. Data collected from 2010 to 2024 were used with an 8:2 split for training and testing sets. For model training, the Mean Square Error (MSE) was employed as the loss function with the Adam optimizer at an initial learning rate of 1 × 10⁻⁴. All models were implemented on Windows 10 systems with the following configuration:

Hardware: 13th Gen Intel (R) Core (TM) i7-13620H CPU; 16 GB RAM; NVIDIA GeForce RTX 4050 GPU.

Software: Python 3.9.16 with PyTorch 2.1.1 (CUDA 11.8); Pandas 2.0.3 for time-series processing; and Scikit-learn 1.3.2 for data preprocessing.

Detailed hyperparameter sets for each model are provided in Supporting Information Table S3, including descriptions and dimensions of each module.

3. Results

3.1. Overall Model Performance

The multi-step-ahead forecasting performance of the DynaSTG-Former model is summarized in Table 3. Overall metrics demonstrate the model’s high proficiency at the basin scale, achieving excellent KGE values of 0.961 and 0.956 for 1-step-ahead (12 h) and 3-step-ahead (36 h) forecasts, respectively, with a negligible decay rate of merely 0.5%. For the 6-step-ahead (72 h) forecast, the overall KGE remained high at 0.855, yet it decayed by 11.0% compared to the 1-step result, highlighting the universal challenge of maintaining accuracy over extended lead times. Furthermore, results from the seven representative nodes selected via the k-means clustering method reveal spatial heterogeneity in model accuracy. Notably, the representative node for the low-flow stable cluster (Cluster 6, node_41) exhibited consistently low errors (RMSE, MAE) across all time horizons. The representative high-flow control node (Cluster 2, node_45) also demonstrated exceptional stability in multi-step forecasts, with KGE values of 0.958, 0.892, and 0.811. The representative node for Cluster 5 (node_9) even outperformed the overall model performance in the 1-step-ahead forecast, achieving a KGE of 0.966. However, as the forecast horizon extended to 72 h, the model showed increased uncertainty at the high-flow node (such as node_19), characterized by rising errors and a significant decline in KGE (from 0.916 to 0.666, 27.3%). This performance degradation is likely attributable to enhanced error propagation and increased sensitivity to initial conditions under high-flow, nonlinear hydrological regimes. In contrast, the representative nodes for the high-variability cluster (Cluster 3, node_10) and the medium-flow, medium-variability cluster (Cluster 0, node_30) experienced the most substantial performance drop, ranking the lowest among all representative nodes at each forecasting step. The performance at these sites underscores the model’s limitations in handling such catchments, the underlying mechanisms of which will be thoroughly analyzed in the subsequent discussion section alongside comparative experimental results.

Figure 5 illustrates the scatter plot distribution of predicted versus observed values across different forecast horizons (12 h, 36 h, 72 h). Overall, the model’s forecasts exhibit a strong positive correlation with the observations. Specifically, in Figure 5a, the scatter points are tightly clustered along the 1:1 ideal line (red dashed line), indicating the model’s high accuracy in short-term forecasting. As the forecast horizon extends, the dispersion of the scatter points gradually increases. In Figure 5b, some points begin to deviate from the ideal line, showing a degree of underestimation, particularly in the medium-to-high flow ranges. By Figure 5c, the scatter distribution becomes markedly more spread out, with a more pronounced systematic underestimation evident in the high-flow region (>1500 m³/s), accompanied by greater uncertainty. This pattern clearly demonstrates the typical decay in model performance as the forecast lead time increases. This performance degradation is primarily attributable to the propagation and accumulation of initial minor errors over successive forecasting steps, leading to significantly elevated uncertainty in longer-term forecasts. Furthermore, uncertainties in factors influencing streamflow forecasts, such as precipitation, increase substantially with longer lead times. For high-flow events, which are often driven by intense rainfall, the model’s forecasts become particularly sensitive to input errors, resulting in the observed systematic underestimation in the high value range shown in the figure.

Figure 6 depicts the training process of the proposed DynaSTG-Former model across different forecast horizons. All three models (1-step-, 3-step-, and 6-step-ahead forecasting tasks) ultimately achieved convergence yet exhibited distinct training characteristics reflective of varying forecasting task complexities. The 1-step model (Figure 6a) exhibited rapid and stable convergence, with the loss value decreasing smoothly and stabilizing after approximately 30 training epochs. This indicates that learning the immediate, single-step hydrological transitions poses a relatively well-conditioned problem for the model architecture. In contrast, the training curves for the 3-step and 6-step models (Figure 6b,c) revealed two key phenomena: (1) a notably slower rate of loss reduction during the initial phase (epochs 0–40), and (2) significant oscillations throughout the training process. Particularly for the 6-step model, higher loss values and greater instability persisted even in the later stages of training. These observations align with expectations, as predicting further into the future inherently incorporates greater uncertainty and requires modeling more complex, long-term spatio-temporal dependencies within the river network.

3.2. Ablation Study Results

To evaluate the contribution of each core component within the proposed model, a systematic ablation study was conducted. The analysis focused on the 3-step-ahead (36 h) forecasting metrics, which represent a balanced performance point, with the results summarized in Table 4. The complete model (Full Model) achieved the best or near-best performance across all evaluation metrics (RMSE = 31.87 m³/s, KGE = 0.956), validating the effectiveness of its overall architecture.

The removal of the graph attention mechanism resulted in a very significant performance degradation, with the KGE dropping to 0.885, a decrease of 7.33%. Specifically, removing the Pearson correlation graph (w/o Pearson) led to the most severe performance drop (RMSE increased to 35.89 m³/s, KGE decreased to 0.937). This underscores the critical importance of capturing dynamic hydrological similarities between gauges for enhancing forecasting accuracy. Concurrently removing both the Pearson and cosine similarity graphs (w/o Pearson and Cosine) also caused a substantial performance loss (KGE decreased to 0.907), confirming the necessity of integrating multiple node relationships to construct a multi-perspective dynamic graph structure. In contrast, removing the Local Temporal Patch Enhancement module (w/o LTPE) had a comparatively minor impact on the core metrics. The KGE decreased to 0.94, a reduction of only 1.7%, with the RMSE remaining almost unchanged. This indicates that the contribution of the Graph Neural Network module to the model’s predictive performance far exceeds that of the local temporal enhancement module. An interesting finding emerged as follows: while removing the LTPE module from the Full Model resulted in a performance loss (manifested as a decrease in KGE and an increase in MAE), adding the LTPE module to a standard Transformer encoder base, conversely, led to a slight performance degradation (KGE decreased by 0.015, MAE increased by 0.26). This phenomenon suggests that the LTPE module’s functionality likely exhibits a synergistic relationship with the graph attention module rather than operating independently.

Figure 7 presents the performance degradation rates (expressed as percentage changes) for various metrics after the removal of model modules, thereby indicating their relative contribution to the model—a greater performance loss signifies a higher contribution. The results clearly demonstrate that the graph structure modules are indispensable components of our model architecture. The complete removal of the graph attention module (w/o Graph) led to the most severe performance degradation, with KGE, MAE, and RMSE deteriorating by 7.33%, 11.22%, and 15.65%, respectively. Within the dynamic graph module, the Pearson correlation graph plays a particularly critical role. The performance degradation caused by its individual removal (w/o Pearson)—MAE: +11.86%, RMSE: +12.62%—was substantially more significant than that resulting from the removal of the cosine similarity graph (MAE: −1.59%, RMSE: +4.38%). Simultaneously, Figure 7 reveals the distinctive behavior associated with the LTPE module. An important finding is that the combined removal of the Pearson and cosine graphs resulted in a KGE loss (5.11%) approximately equal to the sum of their individual losses (2.49% + 2.76% ≈ 5.25%), and an MAE loss (10.69%) nearly matching the sum of the individual losses (11.86–1.59% = 10.27%). This suggests that these two graph types provide complementary hydrological information. However, the RMSE metric exhibited a nonlinear degradation pattern: while the individual removal of the Pearson graph caused a substantial increase (12.62%), and the removal of the cosine graph led to a 4.38% increase, their combined removal resulted in a total increase (10.69%) less than the arithmetic sum might suggest. This indicates that when both graph types are absent, the model may fail to capture complex hydrological fluctuations and instead revert to producing more conservative predictions closer to the historical average. The specific mechanisms underlying this phenomenon merit further in-depth discussion and investigation.

To evaluate the predictive performance and physical interpretability of the multi-channel graph architecture under different hydrological mechanisms, systematic ablation experiments on the 3-step-ahead forecast were conducted based on seven identified clusters (Table 5). The results reveal that regions with distinct hydrological characteristics exhibit varying dependencies on graph structural information, and these differences correlate with basin physical processes.

Cluster_2 (outlet control station) and Cluster_4 (high-performance mainstem stations), which act as the primary confluence pathways in the basin, are dominated by the linear superposition effect of upstream inflows. Ablation experiments show these clusters are most sensitive to the Pearson correlation graph, as indicated by KGE decreases of 7.8% and 7.5%, respectively. This suggests that capturing the linear synchrony of flow variations between gauges is key for predicting flows at these locations. Cluster_3 (low-flow, high-variability), representing rapid response tributary basins strongly influenced by localized rainfall, suffered the most significant performance drop when the cosine similarity graph was ablated (KGE decreased by 2.6%, whereas removal of the Pearson graph resulted in only a 1.5% decrease. This implies that for such sub-basins, recognizing the instantaneous response shapes triggered by localized storm events is more critical than synchrony in absolute flow values. A notable performance decline occurred across all clusters when both the Pearson and cosine graphs were removed, demonstrating the necessity of the multi-channel architecture that integrates linear correlation and pattern similarity to adapt to diverse hydrological mechanisms. The full model maintained the highest performance across all clusters, confirming its strong adaptability to hydrological heterogeneity within the basin.

3.3. Comparative Analysis

In this section, we conducted a multi-dimensional comparison between the DynaSTG-Former and baseline models (including LSTM, GRU, standard Transformer (Trans), TGCN, and A3T-GCN) to thoroughly evaluate our model’s streamflow forecasting capabilities. The evaluation specifically assessed performance at the basin scale, generalizability across representative gauges (nodes), sensitivity to diverse hydrological characteristics, and robustness under extreme boundary conditions.

3.3.1. Basin-Scale Benchmark Performance Evaluation

To evaluate the benchmark performance of DynaSTG-Former at the basin scale, we compared it against various mainstream baseline models. As shown in Table 6, the performance of all models across multiple forecast horizons (12 h, 36 h, 72 h) was measured using basin-aggregated RMSE, MAE, and KGE metrics. Overall, DynaSTG-Former (Ours) achieved the best or near-best performance across almost all forecast horizons and evaluation metrics. Specifically, in the short-term (1-step-ahead, 12 h) forecast, our model achieved a KGE of 0.961, outperforming all other baseline models by 1.0% to 2.0%. Its RMSE (16.88 m³/s) was significantly superior to all other models, representing an approximately 14.6% reduction compared to the second-best-performing model, GRU. This demonstrates the model’s strong capability in capturing instantaneous hydrological dynamics. As the forecast horizon extended, all models exhibited the expected performance degradation, yet DynaSTG-Former showed the most gradual decline. For instance, in the 3-step-ahead (36 h) forecast, our model was the only one maintaining a KGE above 0.95 (at 0.956), which represents a 5.6% improvement over the second-best model, Standard Transformer. Its RMSE (31.87 m³/s) and MAE (8.98 m³/s) also led other models. In the 6-step-ahead (72 h) forecast, DynaSTG-Former still achieved optimal performance across all three metrics, with a KGE of 0.855 that outperformed the second-best model, LSTM, by 5.8%, demonstrating its exceptional stability.

3.3.2. Validation of Generalizability Across Representative Nodes

To evaluate the model’s generalizability at the basin scale, we utilized the seven hydrologically representative gauges (nodes) previously selected through clustering. The median values of performance metrics (KGE, MAE, RMSE) across these nodes for three forecasting horizons were calculated for inter-model comparison. This statistic effectively mitigates the influence of outliers and better reflects the model’s consistent performance across the basin. As shown in Figure 8, DynaSTG-Former (Ours) demonstrated superior median performance across all three metrics and forecasting horizons, fully affirming its advantages in predictive accuracy and efficiency. Short-term forecasting (1-step-ahead, 12 h): Our model achieved a significantly higher median KGE (0.916) than comparative models (e.g., LSTM: 0.895; GRU: 0.904). Its median MAE (3.59) was 11.6% lower than the second-best model, and its median RMSE (7.47) was 16.1% lower than the second-best model. Medium-term forecasting (3-Step-ahead, 36 h): While all models exhibited performance decay, our model showed a decay rate of only 3.4% in KGE—less than one-fifth of LSTM’s decay rate (17.7%)—demonstrating stronger stability. Long-term forecasting (6-step-ahead, 72 h): Our model maintained the highest median KGE (0.666), outperforming the second-best model (Trans: 0.618) by 7.8% and the worst-performing models (TGCN and A3T-GCN: 0.520) by 28.1%.

Figure 9 compares the performance decay patterns, as measured by the median KGE values across seven representative nodes, for 1-step (12 h), 3-step (36 h), and 6-step (72 h) forecast horizons. The analysis reveals the following findings: all models exhibit the expected decline in predictive performance as the lead time increases; however, our proposed model (Ours) demonstrates exceptional resistance to this decay.

Specifically, the total decay rate in the median KGE from the 1-step-ahead to the 6-step-ahead forecast is 27.3% for our model, which is the smallest among all models. More importantly, this advantage is particularly pronounced in the short-to medium-term (one to three steps). During this period, our model shows a minimal decay rate of only 3.4%, while the baseline models (e.g., LSTM, GRU) experience sharp performance degradation, with their average KGE decay rate reaching 15.3%. This indicates that our model can more effectively capture and perpetuate the initial hydrological dynamics of the basin. Even in the long-term forecast (6-step-ahead, 72 h), our model maintains the highest median KGE value. Its decay rate from the 3-steps to the 6-steps forecast is also comparable to the average decay rate across all models. In summary, this analysis not only confirms the superior overall accuracy of our model but also highlights its robust capability in handling basin hydrological heterogeneity and maintaining long-term forecasting stability.

Another interesting finding is that TGCN and A3T-GCN exhibited almost identical performance decay trajectories and rates in this study. As shown in Figure 9, their curves are nearly indistinguishable. This indicates that, under the specific hydrological conditions and data configuration of our study basin, the attention mechanism introduced in A3T-GCN failed to significantly enhance the long-term forecasting stability of its base model (TGCN). Future work could further explore its effectiveness across different river basins or under varying input conditions.

3.3.3. Sensitivity Analysis by Hydrological Attributes Grouping

To investigate the sensitivity of model performance to key hydrological attributes, we conducted a preliminary analysis of the characteristic distribution of hydrological nodes within the basin. Initial analysis revealed that the selected stream gauges in the study area exhibit the following distribution: the first quartile (Q1) of mean annual 12 h flow is 4.8 m³/s, the third quartile (Q3) is 89.0 m³/s, while the Q1 and Q3 of the flow coefficient of variation are 0.173 and 0.250, respectively. Traditional hydrological classification criteria based solely on quartiles of mean flow and coefficient of variation showed limited applicability in this basin. To ensure objectivity and representativeness in nodes grouping, we instead selected four distinct clusters (Cluster 3, 6, 5, 4) that emerged naturally from the previously applied clustering algorithm. These clusters represent four characteristic hydrological groups: Low-Flow High-Variability (LF-HV), Low-Flow Medium-Variability (LF-MV), High-Flow Low-Variability (HF-LV), and High-Flow Medium-Variability (HF-MV). Supplementary Figure S2 shows the gauge distribution, thereby revealing the model’s sensitivity differences across key hydrological features.

The performance comparison of various models under different hydrological conditions is shown in Figure 10. Regarding the KGE metric (Figure 10a–c), our proposed model (Ours, purple) demonstrates the best or near-best performance in the vast majority of scenarios. In short-term forecasting (1-step-ahead, 12 h; Figure 10a), DynaSTG-Former’s advantage is primarily evident in two hydrological groups: Low-Flow High-Variability (LF-HV) and High-Flow Low-Variability (HF-LV), achieving the highest median KGE values of 0.889 and 0.957 within their respective groups. As the forecast horizon extends, our model’s advantage becomes more pronounced. In both 3-step-ahead and 6-step-ahead forecasts, our model achieves the highest median KGE values across all hydrological groups, particularly reaching a high value of 0.929 under the High-Flow Medium-Variability (HF-MV) condition during the 3-step-ahead (36 h) forecast.

Conclusions based on error metrics (RMSE, MAE) remain consistent with those from KGE. As shown in Figure 10d–i, our model maintains the lowest median errors under most hydrological conditions. It is noteworthy that in predicting high-flow nodes, the performance differences among models are relatively small in the comparatively simpler HF-LV scenario (Figure 10f,i), indicating limited room for architectural improvement when hydrological processes are stable. However, as variability increases—such as in the HF-MV scenario—the errors of our model become significantly lower than those of baseline models, highlighting its superior capability in managing forecasting errors under complex hydrological dynamics. In summary, this comprehensive comparison demonstrates that the DynaSTG-Former model performs well not only under ideal conditions but also exhibits significant and consistent superiority in challenging forecasting scenarios, such as long lead times and highly variable nodes.

3.3.4. Robustness Testing in Extreme Scenarios

To ultimately examine the robustness boundaries of the model, we conducted case studies on three extreme gauges: node_45 (the ultra-high flow gauge at the basin outlet), node_37 (where our model exhibited collapse in long-term (6-step-ahead) forecasting), and node_17 (the gauge with the lowest KGE for our model in short-term (1-step-ahead) forecasting). The basic metrics of these three nodes are shown in Table 7.

Specifically, node_45 is the mainstream outlet gauge of the study basin, with an area of 17,500 km². Land cover consists of 23.5% built-up area and 7.4% cropland, with an elevation difference of 1179.4 m. Node_37 is an upstream tributary gauge, draining 137 km² with 24.1% built-up area, 33.6% cropland, and an elevation difference of 218.7 m. Node_17 is another upstream tributary gauge, covering 443 km² with 15.6% built-up area, 1.3% cropland, and an elevation difference of 845.5 m (land cover data sourced from study [60]).

As illustrated in Figure 11, the three extreme nodes selected based on performance characteristics reveal the model’s behavior under different challenging conditions.

In Figure 11a, our model demonstrates excellent predictive performance at the basin outlet gauge across all forecast steps, achieving the lowest RMSE among all models. Notably, it maintains a KGE of 0.811 even at the 6-step-ahead horizon, while other models decline to 0.7 or below. Figure 11b shows significant performance degradation for our model at the high-decay gauge (node_37). While it achieves the best performance at 1-step-ahead horizon (KGE = 0.92), it sharply declines to 0.42 at 6-step-ahead—though still superior to other models. This suggests that the hydrological processes at this gauge may involve strong temporal dependencies or nonlinear mechanisms that cause rapid model deterioration in long-term forecasts. In Figure 11c, all models show their globally worst performance at the 1-step-ahead horizon for the low-performance gauge (node_17), with our model achieving a KGE of only 0.773. This indicates a common forecasting challenge at this location that affects all modeling approaches.

To further investigate the underlying causes of performance variations, this study systematically evaluated the multi-step forecasting performance of each model at three representative hydrological gauges (node_45, node_37, and node_17) through decomposition of the KGE components (correlation coefficient r, variability ratio α, and bias ratio β). The analysis results, shown in Figure 12, provide insights into the fundamental drivers of model performance. By examining these components, we can identify whether performance limitations originate from correlation deficiencies, variability mismatches, or systematic biases in forecasts across different gauge types and forecast horizons.

Through decomposition of the three KGE components, we identified a consistent source of our model’s performance advantage: its superior and stable performance in the ratio β. Specifically, across all three nodes, our model’s β values remained closest to the ideal value of 1.0 throughout multi-step forecasts. One-step: node_45: 0.972, node_37: 0.980, and node_17: 0.932. Three-step: node_45: 0.972, node_37: 1.040, and node_17: 1.025. Six-step: node_45: 0.951, node_37: 0.979, and node_17: 0.982. This indicates that our model successfully achieves systematic bias control across diverse gauges and hydrological regimes, with the overall mean of its forecast series highly consistent with observations. It effectively avoids systemic overestimation (β > 1.1) seen in models like GRU and LSTM at node_37 and node_17, or systemic underestimation (β < 0.9) observed in the standard Transformer (Trans) at node_37.

Concurrently, our model maintains strong competitiveness in r and α. For instance, at node_45, our model achieved the highest r values across all steps (0.988, 0.947, 0.835 for 1 to 6 steps, respectively). At node_37, it led in α values at all steps, demonstrating a more reliable capacity for capturing dynamic variations in the flow process. At node_17, the performance of our model’s α values were average among all models (lagging in 1-step-ahead forecast, ranking high in 2-step, and being the second-best in 3-step-ahead forecast). This pattern precisely explains the suboptimal overall performance of our model at node_17, pinpointing insufficient variability capture capability at this node and providing a clear direction for future model optimization.

In summary, this experiment demonstrates that our model’s superior composite KGE is not achieved by sacrificing any single metric. Instead, it results from excellent bias control (β) combined with balanced and competitive capabilities in capturing both correlation (r) and variability (α). The stability of β, even at the highly challenging node_37 and node_17, provides a solid reliability guarantee for its practical application in hydrological forecasting, as further corroborated by Figure 13.

Figure 13 presents a comparative analysis of hydrograph fitting performance during a significant flow variation period (4 November 2023 to 1 February 2024) at the three extreme nodes, encompassing two distinct flood events (17–18 December 2023 and 9–10 January 2024, as confirmed by the Delaware River Basin Commission website, https://www.nj.gov/drbc/, accessed on 1 July 2025). Overall, our proposed model (Ours) demonstrates exceptional and stable performance in 1-step-ahead (12 h) forecasts across all nodes. Its forecast trajectory (solid red line) most closely aligns with the observed values (solid blue line), exhibiting superior tracking capability and smaller forecasting deviations during rapid flow changes compared to baseline models (LSTM, GRU, Transformer, TGCN, A3T-GCN).

Specifically, at the basin outlet gauge (node_45, Figure 13a), all models perform reasonably well in 1-step-ahead predictions, but our model achieves the highest fitting accuracy across all flow quantiles and particularly during the rising limb of flood events. When the forecast horizon extends to six steps (Figure 13b), the forecast trajectories of baseline models (especially LSTM and GRU) show significant dispersion and lag, whereas our model’s trajectory remains highly consistent with the observed hydrograph trends, demonstrating its excellent generalization and resistance to error accumulation.

For the high-decay node (node_37, Figure 13c) in 1-step-ahead forecasts, our model’s advantage is particularly pronounced, showing very high fitting accuracy. Other models exhibit noticeable peak overestimation (e.g., GRU and A3T-GCN). The errors in our model primarily occur during the late recession limb and early baseflow recession period, where the baseline GRU model performs best, while our model shows clear underestimation. Future work could investigate GRU’s forecasting mechanisms under these hydrological conditions to improve the recession limb fitting accuracy of our model. In 6-step-ahead forecasts (Figure 13d), our model experiences significant performance decay (KGE collapse) and all models show substantial uncertainty and notable deviations during peak flow periods, indicating considerable challenges in long-term forecasting for this node.

In 1-step-ahead forecasts at the low-performance gauge (node_17, Figure 13e), our model fits the observed hydrograph with high accuracy under low-to-medium flow conditions (Q < Q25 or Q25 < Q < Q75). However, noticeable underestimation occurs during flood peaks and late recession/early baseflow periods. Baseline models exhibit significant deviations across all hydrological conditions. In 6-step-ahead forecasts, forecast curves from all models show considerable fluctuation compared to the actual hydrograph, though our model generally tracks the flow trend.

In conclusion, this extreme scenario testing not only validates our model’s superiority under non-extreme conditions but also highlights its outstanding capability in handling forecasting uncertainty and resisting performance degradation, which holds significant value for practical operational hydrological forecasting applications using deep learning models.

4. Discussion

Our model demonstrates superior performance over baseline models across multiple dimensions, including basin-scale forecasting, generalizability across representative nodes, adaptability to diverse hydrological scenarios, and robustness in extreme conditions. The key innovation driving this performance gain is the multi-channel dynamic graph constructor we proposed, which adaptively models the time-varying spatial dependencies among stream gauges by integrating physical topology (static graph), dynamic correlation (Pearson graph), and trend similarity (cosine graph). Ablation studies confirm its critical role, showing a 7.33% decrease in KGE upon its removal (Figure 7). The single-channel static graph provides the foundational physical connectivity between gauges, serving as the basis for spatial feature extraction. However, the dependency relationship between gauges is inherently dynamic [38]. The incorporation of the Pearson correlation graph addresses this issue and plays a dominant role in error control, as evidenced by significant improvements in MAE and RMSE—a finding consistent with prior research [61]. The integration of the cosine similarity graph also contributes substantially to overall performance, particularly enhancing the KGE and RMSE metrics. Mechanistically, the Pearson graph captures linear correlations between gauges, simultaneously capturing dynamic hydrological responses, such as the tightly coupled discharge fluctuations between upstream and downstream gauges during storm events. The cosine similarity graph leverages its sensitivity to the similarity in flow change trends between gauge sites [62], strengthening connections between nodes with similar hydrological response characteristics while ignoring magnitude differences. The sliding window mechanism, designed in the model, continuously updates these graphs to adapt to evolving basin features (e.g., localized precipitation, human activities), enabling dynamic information transfer across nodes.

Notably, while the KGE degradation from removing both the Pearson and cosine graphs is approximately additive (total KGE loss ≈ sum of individual losses), the RMSE loss exhibits strong nonlinearity. This phenomenon aligns with the understanding that nonlinear systems may harbor extreme events [63], where the interaction between different factors undergoes a fundamental shift—transitioning from a primarily additive or synergistic state to an antagonistic one, or vice versa. The squared nature of the RMSE metric amplifies this effect.

The relationships uncovered in the graph ablation experiments across hydrological clusters reflect underlying hydrological logic. The superior performance of the Pearson graph along mainstem gauges substantiates that processes like flood wave propagation and water balance manifest as strong linear dynamic linkages in these reaches. Conversely, the better performance of the cosine graph in high-variability tributaries highlights that the rainfall–runoff response in sub-basins emphasizes the synchrony of response shapes rather than synchrony in absolute magnitudes. Furthermore, the ablation experiments confirm the fundamental role of the river network topology graph as an indispensable physical constraint. The significant performance drop upon its removal shows that any data-driven spatial dependencies must be grounded in the real physical connectivity skeleton, which aligns perfectly with basic hydrological principles.

Regarding the extreme nodes, our model achieved significantly higher forecasting accuracy (KGE = 0.914) than other baseline models in the short-term forecast (1-Step-ahead, 12 h) for node_37 (the high-decay gauge). However, its performance collapsed in the 6-Step-ahead (72 h) forecast, although it remained superior to other models. Anthropogenic interference is the primary cause of this phenomenon: the sub-basin where this gauge is located has a high built-up area percentage (24.1%), where urban drainage networks disrupt natural flow paths, causing a phase shift in peak flow forecast (as illustrated in Figure 13d). Furthermore, cropland constitutes 33.6% of the area, leading to artificial flow abrupt changes that significantly challenge long-term forecasting [64]. Another plausible, non-excludable reason is the scarcity of gauges with similar hydrological effects within the graph, resulting in insufficient learning samples for the model to develop an appropriate response. For node_17 (the low-performance gauge), its transitional topographic characteristics (elevation difference of 845.5 m over 443 km²) cause a sharp increase in the nonlinearity of the flow concentration process, consequently leading to universally poor performance across all models [65]. This reveals a common limitation of current data-driven models in areas with intense human disturbance and complex geomorphology [63,66].

Although our model outperformed other baseline models at these extreme nodes, the systematic underestimation observed during the late recession limb and early baseflow period at both node_37 and node_17 (e.g., Figure 12c and Figure 13e) warrants attention. A potential explanation is that as surface runoff diminishes during this phase, the contribution of interflow gradually increases until groundwater discharge becomes dominant. This process exhibits a highly nonlinear, coupled recession effect from multiple water sources. During this period, water contributions from various upstream sub-basins are relatively independent and sub-basin specific, causing spatial correlations to become less effective. Consequently, the current graph attention mechanism might introduce noise rather than useful information. This interpretation is supported by our model’s performance at node_45. As shown in Figure 13a,b, the basin outlet gauge is strongly influenced by upstream confluence, allowing the graph attention mechanism’s advantages to be fully utilized, resulting in good performance even during the late recession and early baseflow periods.

In comparison, the GRU model demonstrated relatively better performance during this specific period at both node_37 and node_17. This advantage may be attributed to its more direct state transmission mechanism, which could potentially reduce error accumulation during the recession period [67]. Furthermore, GRU shows certain comparative advantages when evaluated against other baseline models. Studies have also indicated that compared to GRU, LSTM exhibits slightly inferior consistency in flood forecasting and assessment applications, demonstrating the most significant instability and errors in flood volume estimation [68,69]. These findings are consistent with the phenomena we observed in our baseline model comparison experiments. Ultimately, the black-box nature of DL models necessitates that substantial further experimentation be conducted in future research for validation, or that efforts be made to enhance model interpretability to uncover their intrinsic mechanisms [70].

Beyond the aspects requiring further investigation mentioned above, our model exhibits a significant dependency on data availability. The construction of dynamic graphs necessitates sufficient historical data, which is often difficult to obtain in practice. Future work will focus on integrating transfer learning and optimizing feature extraction for specific nodes subject to strong anthropogenic influences, such as node_37. Additionally, the current model does not incorporate explicit physical constraints or mechanistic rules. Although physical characteristics have been indirectly introduced via the static graph, the lack of explicit embedding remains a limitation. Explicitly integrating physical mechanisms will constitute a major focus of subsequent research.

5. Conclusions

Accurate streamflow forecasting plays a critical role in water resources management and planning. This study introduces a novel spatio-temporal deep learning model, termed DynaSTG-Former. The model adaptively captures time-varying spatial dependencies through a multi-channel dynamic graph fusion module, which integrates physical-, statistical-, and trend-based perspectives. Subsequently, a dual-stream temporal predictor, combining a Transformer for long-range dependencies and a Local Temporal Patch Enhancement (LTPE) module for transient features, collaboratively processes these dependencies. The final multi-step streamflow forecasts are generated through an iterative mechanism. Empirical research conducted in the Delaware River Basin demonstrates that DynaSTG-Former outperforms baseline models, exhibiting superior performance at the basin scale, generalizability across representative nodes, adaptability to diverse hydrological scenarios, and robustness in extreme conditions. These findings confirm our initial hypothesis regarding the benefits of incorporating spatial data into hydrological modeling.

A key insight from this study is that stream gauges within a basin are not isolated but form an interconnected organic system via the river network. Traditional single-site forecasting methods (e.g., LSTM, GRU, Transformer) overlook these spatial correlations. In contrast, DynaSTG-Former effectively captures this spatial dependency through its graph structure. This is evidenced by its KGE value of 0.951 even at the 36 h forecast horizon at the basin scale, representing improvements of 12.0%, 6.9%, and 5.7% over LSTM, GRU, and Transformer, respectively. This robustly validates the scientific value of hydrological modeling from an integrated basin system perspective.

Hydrological systems exhibit dynamic, multi-scale spatial dependencies. Compared to traditional static graph methods (e.g., TGCN, A3T-GCN), our model adaptively adjusts the weights of different spatial correlation patterns via the multi-channel dynamic graph fusion module. The Pearson correlation graph plays a dominant role in error control, while the cosine similarity graph contributes significantly to overall performance enhancement. This approach led to KGE improvements of 13.9% and 8.8% over TGCN and A3T-GCN, respectively, at the 36 h horizon, revealing the dynamic evolution of hydrological spatial correlations with hydrological processes.

The study also confirms that while the Transformer excels at capturing long-range dependencies like baseflow, the introduced LTPE module effectively complements it by capturing local transient features. Although the LTPE module’s contribution to the overall performance gain is moderate, it generates a synergistic effect with the graph-based features.

In summary, the development of DynaSTG-Former represents not only an algorithmic exploration but also deepens our understanding of hydrological system complexity. It demonstrates the significant potential of tightly coupled spatio-temporal modeling in hydrology, indicating that future models can transcend pure time-series analysis and deeply integrate spatial heterogeneity information. In the fields of smart water management and precision basin management, such models provide more reliable decision support for water resources optimization, forecasting, and early warning systems, with their global and accurate forecasting capabilities holding immense value.

Supplementary Materials

The following supporting information can be downloaded at: https://www.mdpi.com/article/10.3390/hydrology12120322/s1, Table S1: Basic information for the 45 selected stream gauges; Table S2: Homogeneity assessment of streamflow RAPS indices across the 45 gauges; Table S3: Summary of model parameters and hyperparameters; Figure S1: Spatial coherence of monthly streamflow patterns across the DRB revealed by Innovative Polygon Trend Analysis (IPTA); Figure S2: Basic information for the 45 selected stream gauges.

Author Contributions

Conceptualization, X.Z. and B.L.; methodology, B.L. and Q.L.; software, Q.L.; validation, Q.L., B.L. and H.L.; formal analysis, B.L.; investigation, Q.L.; resources, X.Z.; data curation, B.L.; writing—original draft preparation, B.L.; writing—review and editing, X.Z.; visualization, B.L.; supervision, M.D.; project administration, X.Z. and H.L.; funding acquisition, M.D. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Basic and Cross-Cutting Frontier Scientific Research Pilot Projects of Chinese Academy of Sciences, grant number: XDB0720102.

Data Availability Statement

The data presented in this study are publicly available in U.S. Geological Survey (USGS) National Water Information System (NWIS). The streamflow data for the selected gauges within the Delaware River Basin can be accessed directly from the NWIS web interface at: https://waterdata.usgs.gov/nwis (accessed on 10 May 2025).

Acknowledgments

During the preparation of this work, the authors used Deepseek-R1 in order to improve language and ensure consistency in the paper. After using this tool/service, the authors reviewed and edited the content as needed and take full responsibility for the content of the publication.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

ANN	Artificial Neural Network
A3T-GCN	Attention-Based Temporal Graph Convolutional Network
CV	Coefficient of Variation
DL	Deep Learning
DynaSTG-Former	Dynamic Spatio-Temporal Graph Transformer
DRB	Delaware River Basin
GNN	Graph Neural Network
GCN	Graph Convolutional Network
GAT	Graph Attention Mechanisms
GRU	Gated Recurrent Unit
KGE	Kling–Gupta Efficiency
LSTM	Long Short-Term Memory
LTPE	Local Temporal Pattern Enhancement
MAE	Mean Absolute Error
IPTA	Innovative Polygon Trend Analysis
RMSE	Root Mean Square Error
RAPS	Rescaled Adjusted Partial Sums
TGCN	Temporal Graph Convolutional Network
trans	Transformer

Appendix A. Methodology for Gauge Node Classification

This appendix details the classification process for stream gauge nodes. Departing from traditional classification based on predefined hydrological characteristics, this study employed unsupervised clustering analysis to identify inherent, more representative grouping patterns within the data. This strategy of “letting the data speak for itself” was selected for its particular suitability to the data-driven paradigm of this study. The effectiveness of using clustering methods to uncover hydrologically meaningful patterns has been well established in the field [71].

The classification procedure involved two primary steps: first, the Interquartile Range (IQR) statistical method was used to exclude anomalous nodes, followed by the application of the k-means clustering algorithm. The clustering was performed using the following parameters: flow means, flow coefficient of variation ranges, 1-step-ahead Kling–Gupta Efficiency (KGE) values, and multi-step KGE decay rates (from 1- to 6-step-ahead forecasts). The specific steps are outlined below:

Step 1: IQR Statistical Filtering. The IQR method [72] was utilized to identify the core data distribution. The IQR is defined as follows:

I Q R = Q_{3} - Q_{1} .

(A1)

The normal value boundaries for the station data were defined as follows:

N o r m a l V a l u e B o u n d a r i e s = [Q_{1} - 1.5 \times I Q R, Q_{3} + 1.5 \times I Q R],

(A2)

where Q₁ is the 25th percentile and Q₃ is the 75th percentile. The calculated screening indicators based on this statistical method are summarized in Table A1.

Table A1. Summary of IQR statistics for screening indicators.

Indicator	$Q_{1}$	Median	$Q_{3}$	$I Q R$	Normal Value Range
${K G E}_{1 - s t e p}$	0.866	0.900	0.931	0.065	[0.769,1.029]
$K G E_D e c a y$	22.2%	35.6%	41.2%	19.0%	[6.3%,69.7%]
Q_CV	0.18	0.20	0.25	0.07	[0.08,0.36]

Using the calculated screening indicators and considering the value ranges of the parameters, nodes were filtered according to the following criteria:

1-step KGE ∈ [0.769, 1.0], resulting in the exclusion of one node (node_18) with an abnormally low value;
KGE decay rate ∈ [6.3%, 69.7%], a criterion met by all nodes;
Flow coefficient of variation (Q_CV) ∈ [0.08, 0.36], resulting in the exclusion of one node (node_14) with an abnormally high value;
Q_mean > 0, a criterion met by all stations.

Step 2: k-means Clustering. To determine the optimal grouping of representative nodes, we employed the k-means clustering algorithm and conducted a comprehensive evaluation based on the Silhouette Score [73] and Feature Coverage. As shown in Table A2, when the number of clusters K = 7, the Silhouette Score reached its maximum value (0.391), indicating the highest level of intra-cluster consistency under this scheme. Simultaneously, the Feature Coverage also peaked at 97.7%, demonstrating that the selected stations can maximally cover all key hydrological response patterns within the study area. The sample size distribution across clusters was balanced, meeting the minimum sample size requirement. Consequently, we finalized the selection of the seven clusters generated with K = 7, using the centroid node of each cluster as a representative of the basin’s hydrological diversity.

Table A2. Evaluation of clustering performance for different numbers of clusters (K).

K	Silhouette Score	Feature Coverage
3	0.3565	0.9535
4	0.3739	0.8372
5	0.3700	0.8837
6	0.3777	0.9535
7	0.3912	0.9767
9	0.3823	0.9302

Table A3 summarizes the key parameters and characteristics of the seven representative nodes.

Table A3. Parameters of the seven representative nodes.

Node Number	Cluster	Q_Mean	Q_CV	KGE_1-step	KGE_Decay (1- to 6-Step)	Stream
node_30	0	16.000	0.180	0.869	40.78%	Tributary
node_6	1	16.117	0.170	0.939	20.77%	Tributary
node_45	2	428.000	0.210	0.958	15.33%	Mainstem
node_10	3	2.001	0.254	0.866	41.15%	Tributary
node_19	4	186.000	0.205	0.916	27.28%	Mainstem
node_9	5	89.001	0.134	0.966	35.60%	Tributary
node_41	6	3.300	0.200	0.883	17.66%	Tributary

References

Wang, X.; Tian, W.; Zheng, W.; Shah, S.; Li, J.; Wang, X.; Zhang, X. Quantitative Relationships between Salty Water Irrigation and Tomato Yield, Quality, and Irrigation Water Use Efficiency: A Meta-Analysis. Agric. Water Manag. 2023, 280, 108213. [Google Scholar] [CrossRef]
Huan, S. Geographic Heterogeneity of Activation Functions in Urban Real-Time Flood Forecasting: Based on Seasonal Trend Decomposition Using Loess-Temporal Convolutional Network-Gated Recurrent Unit Model. J. Hydrol. 2024, 636, 131279. [Google Scholar] [CrossRef]
Bai, X.; Zhao, W. Impacts of Climate Change and Anthropogenic Stressors on Runoff Variations in Major River Basins in China since 1950. Sci. Total Environ. 2023, 898, 165349. [Google Scholar] [CrossRef] [PubMed]
Jesus, G.; Mardani, Z.; Alves, E.; Oliveira, A. Deep Learning-Based River Flow Forecasting with MLPs: Comparative Exploratory Analysis Applied to the Tejo and the Mondego Rivers. Sensors 2025, 25, 2154. [Google Scholar] [CrossRef]
Danandeh Mehr, A.; Kahya, E.; Olyaie, E. Streamflow Prediction Using Linear Genetic Programming in Comparison with a Neuro-Wavelet Technique. J. Hydrol. 2013, 505, 240–249. [Google Scholar] [CrossRef]
Guo, J.; Zhang, M.; Shang, Q.; Liu, F.; Wu, A.; Li, X. River Basin Cyberinfrastructure in the Big Data Era: An Integrated Observational Data Control System in the Heihe River Basin. Sensors 2021, 21, 5429. [Google Scholar] [CrossRef]
Kratzert, F.; Klotz, D.; Herrnegger, M.; Sampson, A.K.; Hochreiter, S.; Nearing, G.S. Toward Improved Predictions in Ungauged Basins: Exploiting the Power of Machine Learning. Water Resour. Res. 2019, 55, 11344–11354. [Google Scholar] [CrossRef]
Ougahi, J.H.; Rowan, J.S. Enhanced Streamflow Forecasting Using Hybrid Modelling Integrating Glacio-Hydrological Outputs, Deep Learning and Wavelet Transformation. Sci. Rep. 2025, 15, 2762. [Google Scholar] [CrossRef]
Box, G.E.P.; Jenkins, G.M.; Reinsel, G.C. Time Series Analysis, 1st ed.; Wiley Series in Probability and Statistics; Wiley: Hoboken, NJ, USA, 2008; ISBN 978-0-470-27284-8. [Google Scholar]
Thompstone, R.; Mcleod, A. Forecasting Quarter-Monthly Riverflow. Water Resour. Bull. 1985, 21, 731–741. [Google Scholar] [CrossRef]
Li, B.; Jin, C.; Lin, R.; Zhou, X.; Deng, M. A Method for Constructing Open-Channel Velocity Field Prediction Model Based on Machine Learning and CFD. Comput. Intell. 2025, 41, e70043. [Google Scholar] [CrossRef]
Karunanithi, N.; Grenney, W.J.; Whitley, D.; Bovee, K. Neural Networks for River Flow Prediction. J. Comput. Civ. Eng. 1994, 8, 201–220. [Google Scholar] [CrossRef]
Jain, S.K.; Das, A.; Srivastava, D.K. Application of ANN for Reservoir Inflow Prediction and Operation. J. Water Resour. Plann. Manag. 1999, 125, 263–271. [Google Scholar] [CrossRef]
Zealand, C.; Burn, D.; Simonovic, S. Short Term Streamflow Forecasting Using ANNs. In Proceedings of the Water Resources and the Urban Environment University of Manitoba, Chicago, IL, USA, 7–10 June 1998; Loucks, E., Ed.; pp. 229–234. [Google Scholar]
Zealand, C.M.; Burn, D.H.; Simonovic, S.P. Short Term Streamflow Forecasting Using Artificial Neural Networks. J. Hydrol. 1999, 214, 32–48. [Google Scholar] [CrossRef]
Vapnik, V.N. The Nature of Statistical Learning Theory; Springer New York: New York, NY, USA, 2000; ISBN 978-1-4419-3160-3. [Google Scholar]
Okkan, U.; Serbes, Z.A. Rainfall–Runoff Modeling Using Least Squares Support Vector Machines. Environmetrics 2012, 23, 549–564. [Google Scholar] [CrossRef]
Sivapragasam, C.; Liong, S.-Y. Flow Categorization Model for Improving Forecasting. Hydrol. Res. 2005, 36, 37–48. [Google Scholar] [CrossRef]
Cho, K.; van Merrienboer, B.; Gulcehre, C.; Bahdanau, D.; Bougares, F.; Schwenk, H.; Bengio, Y. Learning Phrase Representations Using RNN Encoder-Decoder for Statistical Machine Translation. arXiv 2014, arXiv:1406.1078. [Google Scholar] [CrossRef]
Hochreiter, S.; Schmidhuber, J. Long Short-Term Memory. Neural Comput. 1997, 9, 1735–1780. [Google Scholar] [CrossRef] [PubMed]
Kratzert, F.; Klotz, D.; Brenner, C.; Schulz, K.; Herrnegger, M. Rainfall–Runoff Modelling Using Long Short-Term Memory (LSTM) Networks. Hydrol. Earth Syst. Sci. 2018, 22, 6005–6022. [Google Scholar] [CrossRef]
Hu, C.; Wu, Q.; Li, H.; Jian, S.; Li, N.; Lou, Z. Deep Learning with a Long Short-Term Memory Networks Approach for Rainfall-Runoff Simulation. Water 2018, 10, 1543. [Google Scholar] [CrossRef]
Wu, S.; Dong, Z.; Guzmán, S.M.; Conde, G.; Wang, W.; Zhu, S.; Shao, Y.; Meng, J. Two-Step Hybrid Model for Monthly Runoff Prediction Utilizing Integrated Machine Learning Algorithms and Dual Signal Decompositions. Ecol. Inform. 2024, 84, 102914. [Google Scholar] [CrossRef]
Ghimire, S.; Yaseen, Z.M.; Farooque, A.A.; Deo, R.C.; Zhang, J.; Tao, X. Streamflow Prediction Using an Integrated Methodology Based on Convolutional Neural Network and Long Short-Term Memory Networks. Sci. Rep. 2021, 11, 17497. [Google Scholar] [CrossRef]
Huang, J.; Chen, J.; Huang, H.; Cai, X. Deep Learning-Based Daily Streamflow Prediction Model for the Hanjiang River Basin. Hydrology 2025, 12, 168. [Google Scholar] [CrossRef]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.; Polosukhin, I. Attention Is All You Need. Adv. Neural Inf. Process. Syst. 2017, 30. [Google Scholar]
Girihagama, L.; Naveed Khaliq, M.; Lamontagne, P.; Perdikaris, J.; Roy, R.; Sushama, L.; Elshorbagy, A. Streamflow Modelling and Forecasting for Canadian Watersheds Using LSTM Networks with Attention Mechanism. Neural Comput. Applic 2022, 34, 19995–20015. [Google Scholar] [CrossRef]
Yin, R.; Ren, J. Sequence-to-Sequence LSTM-Based Dynamic System Identification of Piezo-Electric Actuators. In Proceedings of the 2023 American Control Conference (ACC), San Diego, CA, USA, 31 May–2 June 2023; pp. 673–678. [Google Scholar]
Zhang, S.; Zhang, X.; Zhao, X.; Fang, J.; Niu, M.; Zhao, Z.; Yu, J.; Tian, Q. MTDAN: A Lightweight Multi-Scale Temporal Difference Attention Networks for Automated Video Depression Detection. IEEE Trans. Affect. Comput. 2024, 15, 1078–1089. [Google Scholar] [CrossRef]
Xu, Y.; Lin, K.; Hu, C.; Wang, S.; Wu, Q.; Zhang, L.; Ran, G. Deep Transfer Learning Based on Transformer for Flood Forecasting in Data-Sparse Basins. J. Hydrol. 2023, 625, 129956. [Google Scholar] [CrossRef]
Yin, H.; Guo, Z.; Zhang, X.; Chen, J.; Zhang, Y. RR-Former: Rainfall-Runoff Modeling Based on Transformer. J. Hydrol. 2022, 609, 127781. [Google Scholar] [CrossRef]
Subhadarsini, S.; Kumar, D.N.; Govindaraju, R.S. Enhancing Hydro-Climatic and Land Parameter Forecasting Using Transformer Networks. J. Hydrol. 2025, 655, 132906. [Google Scholar] [CrossRef]
Zhou, H.; Zhang, S.; Peng, J.; Zhang, S.; Li, J.; Xiong, H.; Zhang, W. Informer: Beyond Efficient Transformer for Long Sequence Time-Series Forecasting. In Proceedings of the AAAI Conference on Artificial Intelligence, Philadelphia, PA, USA, 25 February–4 March 2025; Volume 35, pp. 11106–11115. [Google Scholar]
Ghobadi, F.; Yaseen, Z.M.; Kang, D. Long-Term Streamflow Forecasting in Data-Scarce Regions: Insightful Investigation for Leveraging Satellite-Derived Data, Informer Architecture, and Concurrent Fine-Tuning Transfer Learning. J. Hydrol. 2024, 631, 130772. [Google Scholar] [CrossRef]
Hall, J.; Arheimer, B.; Borga, M.; Brázdil, R.; Claps, P.; Kiss, A.; Kjeldsen, T.R.; Kriaučiūnienė, J.; Kundzewicz, Z.W.; Lang, M.; et al. Understanding Flood Regime Changes in Europe: A State-of-the-Art Assessment. Hydrol. Earth Syst. Sci. 2014, 18, 2735–2772. [Google Scholar] [CrossRef]
Zhao, Q.; Zhu, Y.; Shu, K.; Wan, D.; Yu, Y.; Zhou, X.; Liu, H. Joint Spatial and Temporal Modeling for Hydrological Prediction. IEEE Access 2020, 8, 78492–78503. [Google Scholar] [CrossRef]
Roudbari, N.S.; Poullis, C.; Patterson, Z.; Eicker, U. TransGlow: Attention-Augmented Transduction Model Based on Graph Neural Networks for Water Flow Forecasting. In Proceedings of the 2023 International Conference on Machine Learning and Applications (ICMLA), Jacksonville, FL, USA, 15 December 2023; pp. 626–632. [Google Scholar]
Feng, J.; Sha, H.; Ding, Y.; Yan, L.; Yu, Z. Graph Convolution Based Spatial-Temporal Attention LSTM Model for Flood Forecasting. In Proceedings of the 2022 International Joint Conference on Neural Networks (IJCNN), Padua, Italy, 18 July 2022; pp. 1–8. [Google Scholar]
Lu, J.; Xie, Z.; Chen, J.; Li, M.; Xu, C.; Cao, H. GC-SALM: Multi-Task Runoff Prediction Using Spatial-Temporal Attention Graph Convolution Networks. In Proceedings of the 2023 IEEE International Conference on Systems, Man, and Cybernetics (SMC), Honolulu, Oahu, HI, USA, 1 October 2023; pp. 3633–3638. [Google Scholar]
Deng, L.; Zhang, X.; Slater, L.J.; Liu, H.; Tao, S. Integrating Euclidean and Non-Euclidean Spatial Information for Deep Learning-Based Spatiotemporal Hydrological Simulation. J. Hydrol. 2024, 638, 131438. [Google Scholar] [CrossRef]
Hu, Y.; Li, H.; Zhang, C.; Xu, B.; Chu, W.; Shen, D.; Li, R. Streamflow Regime-Based Classification and Hydrologic Similarity Analysis of Catchment Behavior Using Differentiable Modeling with Multiphysics Outputs. J. Hydrol. 2025, 653, 132766. [Google Scholar] [CrossRef]
Zhao, L.; Song, Y.; Zhang, C.; Liu, Y.; Wang, P.; Lin, T.; Deng, M.; Li, H. T-GCN: A Temporal Graph Convolutional Network for Traffic Prediction. IEEE Trans. Intell. Transport. Syst. 2020, 21, 3848–3858. [Google Scholar] [CrossRef]
Bai, J.; Zhu, J.; Song, Y.; Zhao, L.; Hou, Z.; Du, R.; Li, H. A3T-GCN: Attention Temporal Graph Convolutional Network for Traffic Forecasting. IJGI 2021, 10, 485. [Google Scholar] [CrossRef]
Sun, A.Y.; Jiang, P.; Yang, Z.-L.; Xie, Y.; Chen, X. A Graph Neural Network (GNN) Approach to Basin-Scale River Network Learning: The Role of Physics-Based Connectivity and Data Fusion. Hydrol. Earth Syst. Sci. 2022, 26, 5163–5184. [Google Scholar] [CrossRef]
Sun, A.Y.; Jiang, P.; Mudunuru, M.K.; Chen, X. Explore Spatio-Temporal Learning of Large Sample Hydrology Using Graph Neural Networks. Water Resour. Res. 2021, 57, e2021WR030394. [Google Scholar] [CrossRef]
Bai, T.; Tahmasebi, P. Graph Neural Network for Groundwater Level Forecasting. J. Hydrol. 2023, 616, 128792. [Google Scholar] [CrossRef]
Weiler, M.; McGlynn, B.L.; McGuire, K.J.; McDonnell, J.J. How Does Rainfall Become Runoff? A Combined Tracer and Runoff Transfer Function Approach. Water Resour. Res. 2003, 39, 2003WR002331. [Google Scholar] [CrossRef]
Wu, Z.; Pan, S.; Long, G.; Jiang, J.; Zhang, C. Graph WaveNet for Deep Spatial-Temporal Graph Modeling. arXiv 2019, arXiv:1906.00121. [Google Scholar]
Gao, S.; Zhang, S.; Huang, Y.; Han, J.; Zhang, T.; Wang, G. A Hydrological Process-Based Neural Network Model for Hourly Runoff Forecasting. Environ. Model. Softw. 2024, 176, 106029. [Google Scholar] [CrossRef]
Liu, J.; Bian, Y.; Lawson, K.; Shen, C. Probing the Limit of Hydrologic Predictability with the Transformer Network. J. Hydrol. 2024, 637, 131389. [Google Scholar] [CrossRef]
Yin, H.; Zheng, Q.; Wei, C.; Liang, C.; Fan, M.; Zhang, X.; Zhang, Y. Monthly Streamflow Forecasting with Temporal-Periodic Transformer. J. Hydrol. 2025, 660, 133308. [Google Scholar] [CrossRef]
Smith, J.A.; Baeck, M.L.; Villarini, G.; Krajewski, W.F. The Hydrology and Hydrometeorology of Flooding in the Delaware River Basin. J. Hydrometeorol. 2010, 11, 841–859. [Google Scholar] [CrossRef]
Moore, R.B.; McKay, L.D.; Rea, A.H.; Bondelid, T.R.; Price, C.V.; Dewald, T.G.; Johnston, C.M. User’s Guide for the National Hydrography Dataset Plus (NHDPlus) High Resolution: U.S.; Geological Survey Open-File Report 2019–1096; U.S. Environmental Protection Agency: Washington, DC, USA, 2019; 66p. [Google Scholar]
Lehner, B.; Grill, G. Global River Hydrography and Network Routing: Baseline Data and New Approaches to Study the World’s Large River Systems. Hydrol. Process. 2013, 27, 2171–2186. [Google Scholar] [CrossRef]
Đurin, B.; Raič, M.; Banejad, H. Analysis of Homogeneity and Isotropy of the Flow in the Watercourses by Applying the RAPS and IPTA Methods. ACAE 2024, 15, 67–83. [Google Scholar] [CrossRef]
Şen, Z. Innovative Trend Analysis Methodology. J. Hydrol. Eng. 2012, 17, 1042–1046. [Google Scholar] [CrossRef]
Şen, Z.; Şişman, E.; Dabanli, I. Innovative Polygon Trend Analysis (IPTA) and Applications. J. Hydrol. 2019, 575, 202–210. [Google Scholar] [CrossRef]
Gupta, H.V.; Kling, H.; Yilmaz, K.K.; Martinez, G.F. Decomposition of the Mean Squared Error and NSE Performance Criteria: Implications for Improving Hydrological Modelling. J. Hydrol. 2009, 377, 80–91. [Google Scholar] [CrossRef]
Knoben, W.J.M.; Freer, J.E.; Woods, R.A. Technical Note: Inherent Benchmark or Not? Comparing Nash–Sutcliffe and Kling–Gupta Efficiency Scores. Hydrol. Earth Syst. Sci. 2019, 23, 4323–4331. [Google Scholar] [CrossRef]
Potapov, P.; Hansen, M.C.; Pickens, A.; Hernandez-Serna, A.; Tyukavina, A.; Turubanova, S.; Zalles, V.; Li, X.; Khan, A.; Stolle, F.; et al. The Global 2000-2020 Land Cover and Land Use Change Dataset Derived From the Landsat Archive: First Results. Front. Remote Sens. 2022, 3, 856903. [Google Scholar] [CrossRef]
Ke, W.; Hui-qin, W.; Ying, Y.; Li, M.; Yi, Z. Time Series Prediction Method Based on Pearson Correlation BP Neural Network. Opt. Precis. Eng. 2018, 26, 2805–2813. [Google Scholar] [CrossRef]
Bin-lin, Y.; Wen-sheng, W.; Man, Y. Application of NNBR model based different similarity index in medium-long-term runoff prediction. Water Resour. Power 2017, 35, 14–17. [Google Scholar]
Chen, N.; Majda, A.J. Predicting Observed and Hidden Extreme Events in Complex Nonlinear Dynamical Systems with Partial Observations and Short Training Time Series. Chaos Interdiscip. J. Nonlinear Sci. 2020, 30, 033101. [Google Scholar] [CrossRef]
Liu, J.; Cho, H.-S.; Osman, S.; Jeong, H.-G.; Lee, K. Review of the Status of Urban Flood Monitoring and Forecasting in TC Region. Trop. Cyclone Res. Rev. 2022, 11, 103–119. [Google Scholar] [CrossRef]
Wang, F.; Mu, J.; Zhang, C.; Wang, W.; Bi, W.; Lin, W.; Zhang, D. Deep Learning Model for Real-Time Flood Forecasting in Fast-Flowing Watershed. J. Flood Risk Manag. 2025, 18, e70036. [Google Scholar] [CrossRef]
Luo, Y.; Zhou, Y.; Chen, H.; Xiong, L.; Guo, S.; Chang, F.-J. Exploring a Spatiotemporal Hetero Graph-Based Long Short-Term Memory Model for Multi-Step-Ahead Flood Forecasting. J. Hydrol. 2024, 633, 130937. [Google Scholar] [CrossRef]
Gao, S.; Huang, Y.; Zhang, S.; Han, J.; Wang, G.; Zhang, M.; Lin, Q. Short-Term Runoff Prediction with GRU and LSTM Networks without Requiring Time Step Optimization during Sample Generation. J. Hydrol. 2020, 589, 125188. [Google Scholar] [CrossRef]
Heidari, E.; Samadi, V.; Khan, A.A. Leveraging Recurrent Neural Networks for Flood Prediction and Assessment. Hydrology 2025, 12, 90. [Google Scholar] [CrossRef]
Rugină, A.M. Alternative Hydraulic Modeling Method Based on Recurrent Neural Networks: From HEC-RAS to AI. Hydrology 2025, 12, 207. [Google Scholar] [CrossRef]
Ding, Y.; Zhu, Y.; Feng, J.; Zhang, P.; Cheng, Z. Interpretable Spatio-Temporal Attention LSTM Model for Flood Forecasting. Neurocomputing 2020, 403, 348–359. [Google Scholar] [CrossRef]
Sawicz, K.; Wagener, T.; Sivapalan, M.; Troch, P.A.; Carrillo, G. Catchment Classification: Empirical Analysis of Hydrologic Similarity Based on Catchment Function in the Eastern USA. Hydrol. Earth Syst. Sci. 2011, 15, 2895–2911. [Google Scholar] [CrossRef]
Tukey, J.W. Exploratory Data Analysis; Addison-Wesley Series in Behavioral Science; Addison-Wesley Pub. Co.: Reading, MA, USA, 1977; ISBN 978-0-201-07616-5. [Google Scholar]
Rousseeuw, P.J. Silhouettes: A Graphical Aid to the Interpretation and Validation of Cluster Analysis. J. Comput. Appl. Math. 1987, 20, 53–65. [Google Scholar] [CrossRef]

Figure 1. The architecture of the proposed model DynaSTG-Former. The architecture consists of two key components: (1) Dynamic Spatial Graph Constructor (multi-channel graphs fusion), and (2) Dual-stream Temporal Predictor integrating global transformer and local patch-based attention. The solid arrows illustrate the workflow of the model, from the input of gauge data and river network topology to the iterative output of forecasts, which incorporates residual and feedback connections.

Figure 2. Global–local temporal encoding architecture. This diagram illustrates the dual-branch temporal encoding framework composed of the following: (1) Global path (left): Standard Transformer encoder with stacked layers for modeling long-term dependencies. (2) Local path (right): Enhanced local feature extractor combining 1D convolution (patch-based kernels) and multi-head self-attention.

Figure 3. Location of the Delaware River Basin (DRB). The illustration shows its location in the northeastern United States. The main image shows the terrain, water system, and basin boundaries of DRB. The transitional features of the terrain from the northern highlands to the southern coastal plains are clearly visible. The significant hydrometeorological spatio-temporal heterogeneity of DRB, such as elevation gradients and river networks, provides an ideal location for testing model robustness.

Figure 4. River network structure of DRB and topology structure of 45 selected streamflow gauges. These gauges are all located upstream of Trenton, New Jersey on the Delaware River, and they can form a complete graph reflecting the upstream and downstream dependencies of the DRB’s streamflow.

Figure 5. Forecasting performance of the DynaSTG-Former for 1- (a), 3- (b), and 6-step-ahead (c) streamflow forecasting.

Figure 6. Model training loss over epochs for (a) 1-step- (12 h), (b) 3-step- (36 h), and (c) 6-step-ahead (72 h) streamflow forecasting. The loss curves illustrate the convergence behavior and stability of the proposed model during the training phase.

Figure 7. Impact of individual components on model performance in the ablation study. The bars depict the relative performance decrement (in %) for KGE, MAE, and RMSE metrics when the LTPE module, dynamic graph (Pearson, cosine), and graph modules are removed progressively.

Figure 8. Model performance comparison heatmaps based on median values of representative nodes (by forecast horizon): (a) 1-step (12 h), (b) 3-step (36 h), and (c) 6-step (72 h). Note: Heatmap data reflect median values across seven hydrologically representative nodes. Lighter shades indicate better performance for each metric (KGE: higher better; MAE/RMSE: lower better).

Figure 9. Comparison of KGE performance decay across models at representative nodes (1- to 6-step horizons). The solid red line represents our proposed model (Ours). The solid blue line and the light blue shaded area represent the mean and standard deviation (±1 SD), respectively, of all baseline models (LSTM, GRU, Trans, TGCN, A3T-GCN). The decay rates of the KGE mean values for our model and all models are labeled in the figure.

Figure 10. Comparison of multi-step forecasting performance metrics across models under different hydrological regimes. (a–c) KGE metric at 1-, 3-, and 6-step forecasting horizons; (d–f) RMSE metric at 1-, 3-, and 6-step forecasting horizons; (g–i) MAE metric at 1-, 3-, and 6-step forecasting horizons. Boxplots present performance distributions across four hydrological groups: Low-Flow High-Variability (LF-HV), Low-Flow Medium-Variability (LF-MV), High-Flow Medium-Variability (HF-MV), and High-Flow Low-Variability (HF-LV). The proposed model (Ours) is shown in purple with median values annotated.

Figure 11. Multi-model forecast performance comparison at extreme nodes: (a) outlet gauge (node_45); (b) high-decay gauge (node_37); and (c) low-performance gauge (node_17). Error bars (RMSE) and lines (KGE) share the same color scheme for each model.

Figure 12. Ternary analysis of multi-step-ahead forecasting performance across models at hydrologically extreme nodes. Each vertex represents a Taylor metric—correlation coefficient (r), variability ratio (α), and bias ratio (β). Proximity to the central ideal point (1.0, 1.0, 1.0) indicates superior holistic performance. The ★ symbol denotes our model, consistently closest to the ideal point across all nodes.

Figure 13. Multi-model streamflow forecasting comparison at hydrologically extreme nodes during a representative period of high hydrological variability: (a,c,e) 1-step ahead vs. (b,d,f) 6-step ahead forecasts at outlet (node_45), high-decay (Node_37), and low-performance (Node_17) gauges. Dashed lines indicate historical flow quantiles Q25/Q75/Q95, while colored lines represent forecasts from different models. Period selection focused on high-variability phases.

Table 1. Summary of the statistical metrics used for model evaluation.

Metrics	Equation	Range
Root Mean Squared Error (RMSE)	$R M S E = \sqrt{\frac{1}{n} \sum_{t = 1}^{T} (Q_{o b s, t} - Q_{s i m, t})^{2}}$	$[0, + \infty)$ , close to 0 is better
Mean Absolute Error (MAE)	$M A E = \frac{1}{n} \sum_{t = 1}^{T} \|Q_{o b s, t} - Q_{s i m, t}\|$	$[0, + \infty)$ , close to 0 is better
Kling–Gupta Efficiency (KGE)	$K G E = 1 - \sqrt{(α - 1)^{2} + (β - 1)^{2} + (γ - 1)^{2}}$	$(- \infty, 1]$ , close to 1 is better

Note:

Q_{s i m}

and

Q_{o b s}

are the simulations and observations,

r

denotes the Pearson’s correlation coefficient,

β

represents the ratio of simulated to observed mean values and refers to the ratio of simulated to observed coefficients of variation.

Table 2. Hydrological classification of basin gauges using k-means clustering.

Cluster ID	Representative Node	Hydrological Response Pattern	Hydrological Signatures
Cluster 0	node_30	Medium-Flow High-Decay Predictability	Medium-flow, medium coefficient of variation (CV), and good short-term forecast accuracy but significant performance degradation in long-term forecasts.
Cluster 1	node_6	Medium-Flow Stable Prediction	Medium-flow, low-flow variability, and excellent short-term forecasting capability with maintained stability in long-term forecasts.
Cluster 2	node_45	High-Flow Control Gauge	Extremely high-flow, high short-term forecasting accuracy, and low-performance decay rate in long-term forecasts.
Cluster 3	node_10	Low-Flow High-Variability Challenge	Low-flow, high hydrological variability, and acceptable short-term forecasting accuracy but severe performance degradation in long-term forecasts.
Cluster 4	node_19	High-Flow General Performance	High-flow, moderate CV, and good short-term forecasting performance with moderate decay in long-term forecasts.
Cluster 5	node_9	High Accuracy Low-Decay Excellence	Exceptional short-term forecast accuracy, outstanding long-term forecasting stability, and stable hydrological conditions with low-variability.
Cluster 6	node_41	Low-Flow Low-Decay Stability	Low-flow conditions, moderate variability, and good short-term forecasting performance with relatively low-decay rate in long-term forecasts.

Table 3. Multi-step forecasting performance metrics for the overall and representative nodes.

Horizon	Node	RMSE (m³/s)	MAE (m³/s)	KGE (-)
1-step (12 h)	Overall	16.88	5.33	0.961
	node_6	7.47	3.36	0.939
	node_9	19.13	8.81	0.966
	node_10	1.41	0.47	0.866
	node_19	28.6	18.08	0.916
	node_30	7.43	3.59	0.869
	node_41	0.53	0.28	0.883
	node_45	53.74	30.08	0.958
3-step (36 h)	Overall	31.87	8.98	0.956
	node_6	11.18	5.21	0.885
	node_9	43.48	20.11	0.915
	node_10	2.45	1.04	0.797
	node_19	58.65	27.63	0.929
	node_30	11.72	6.14	0.734
	node_41	0.8	0.4	0.866
	node_45	113.15	60.28	0.892
6-step (72 h)	Overall	57.60	14.10	0.855
	node_6	16.43	7.61	0.744
	node_9	82.18	34.77	0.622
	node_10	3.72	1.46	0.510
	node_19	125.33	52.90	0.666
	node_30	14.86	8.33	0.515
	node_41	1.03	0.61	0.727
	node_45	184.07	82.99	0.811

Table 4. Ablation study of the proposed model. The performance of the full model and various ablated variants is evaluated using RMSE, MAE, and KGE.

Model Configuration	RMSE (m³/s)	MAE (m³/s)	KGE (-)	Abbreviation	Purpose
Full Model	31.87	8.98	0.956	Full	Complete proposed model
Static + Pearson + Cosine + Transformer	31.88	8.49	0.940	w/o LTPE	Ablates local perception
Static + Cosine + LTPE + Transformer	35.89	10.04	0.937	w/o Pearson	Ablates Pearson graph
Static + Pearson + LTPE + Transformer	33.26	8.83	0.929	w/o Cosine	Ablates cosine graph
Static Graph + LTPE + Transformer	34.52	9.94	0.907	w/o Pearson and Cosine	Ablates cosine graph and Pearson graph
P	36.85	9.99	0.885	w/o Graph	Tests temporal module only
Transformer Baseline	40.28	9.73	0.900	Base	Baseline model

Table 5. Ablation study results (KGE) of the proposed graph architecture for the hydrological clusters.

Cluster ID	Full	w/o Cosine	w/o Pearson	w/o Pearson and Cosine	w/o Graph
Cluster_0	0.886	0.813	0.800	0.704	0.709
Cluster_1	0.862	0.849	0.865	0.819	0.794
Cluster_2	0.938	0.899	0.865	0.868	0.784
Cluster_3	0.802	0.781	0.790	0.746	0.694
Cluster_4	0.938	0.886	0.868	0.831	0.752
Cluster_5	0.883	0.845	0.861	0.805	0.760
Cluster_6	0.882	0.820	0.814	0.816	0.740

Table 6. Performance comparison of different models for multi-step-ahead streamflow forecasting. The proposed model (Ours) achieves superior performance across almost all horizons and metrics, demonstrating significant advantages especially in short-term forecasting.

Method	Horizon	RMSE (m³/s)	MAE (m³/s)	KGE (-)
LSTM	1-step (12 h)	24.48	6.85	0.951
	3-step (36 h)	58.37	16.14	0.849
	6-step (72 h)	62.54	15.99	0.808
GRU	1-step (12 h)	19.77	6.40	0.946
	3-step (36 h)	46.06	12.77	0.890
	6-step (72 h)	67.20	18.91	0.781
Standard Transformer (Trans)	1-step (12 h)	20.97	5.02	0.941
	3-step (36 h)	40.28	9.73	0.900
	6-step (72 h)	68.54	16.34	0.812
TGCN	1-step (12 h)	25.80	6.39	0.951
	3-step (36 h)	42.79	11.90	0.835
	6-step (72 h)	59.44	14.75	0.781
A3T-GCN	1-step (12 h)	24.60	5.96	0.943
	3-step (36 h)	47.90	11.01	0.874
	6-step (72 h)	63.23	16.32	0.783
DynaSTG-Former (Ours)	1-step (12 h)	16.88	5.33	0.961
	3-step (36 h)	31.87	8.98	0.956
	6-step (72 h)	57.60	14.10	0.855

Table 7. Hydrological and predictive performance characteristics of extreme nodes for model robustness boundary testing. Note: Q_mean denotes mean discharge, Q_CV denotes coefficient of variation in discharge, and KGE denotes KGE value of our model.

Node_ID	Q_Mean (m³/s)	Q_CV (-)	Stream	KGE1-Step (-)	KGE3-Step (-)	KGE6-Step (-)	KGE_Decay (1 to 6 Step)
node_45	428.0	0.210	Mainstem	0.958	0.892	0.811	15.33%
node_37	3.8	0.200	Tributary	0.914	0.638	0.417	54.34%
node_17	7.0	0.167	Tributary	0.773	0.693	0.461	40.41%

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Li, B.; Li, Q.; Zhou, X.; Deng, M.; Ling, H. Dynamic Graph Transformer with Spatio-Temporal Attention for Streamflow Forecasting. Hydrology 2025, 12, 322. https://doi.org/10.3390/hydrology12120322

AMA Style

Li B, Li Q, Zhou X, Deng M, Ling H. Dynamic Graph Transformer with Spatio-Temporal Attention for Streamflow Forecasting. Hydrology. 2025; 12(12):322. https://doi.org/10.3390/hydrology12120322

Chicago/Turabian Style

Li, Bo, Qingping Li, Xinzhi Zhou, Mingjiang Deng, and Hongbo Ling. 2025. "Dynamic Graph Transformer with Spatio-Temporal Attention for Streamflow Forecasting" Hydrology 12, no. 12: 322. https://doi.org/10.3390/hydrology12120322

APA Style

Li, B., Li, Q., Zhou, X., Deng, M., & Ling, H. (2025). Dynamic Graph Transformer with Spatio-Temporal Attention for Streamflow Forecasting. Hydrology, 12(12), 322. https://doi.org/10.3390/hydrology12120322

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Dynamic Graph Transformer with Spatio-Temporal Attention for Streamflow Forecasting

Abstract

1. Introduction

2. Materials and Methods

2.1. Methodology

2.1.1. Dynamic Adaptive Spatial Graph Constructor for Multi-Perspective Spatio-Temporal Graph Construction

2.1.2. Dual-Stream Enhanced Temporal Predictor for Enhanced Global and Local Feature Extraction in Temporal Forecasting

2.2. Study Area

2.3. Data Sources and Preprocessing

2.4. Experimental Design

3. Results

3.1. Overall Model Performance

3.2. Ablation Study Results

3.3. Comparative Analysis

3.3.1. Basin-Scale Benchmark Performance Evaluation

3.3.2. Validation of Generalizability Across Representative Nodes

3.3.3. Sensitivity Analysis by Hydrological Attributes Grouping

3.3.4. Robustness Testing in Extreme Scenarios

4. Discussion

5. Conclusions

Supplementary Materials

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

Abbreviations

Appendix A. Methodology for Gauge Node Classification

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI