Monthly Runoff Prediction for Xijiang River via Gated Recurrent Unit, Discrete Wavelet Transform, and Variational Modal Decomposition

: Neural networks have become widely employed in streamflow forecasting due to their ability to capture complex hydrological processes and provide accurate predictions. In this study, we propose a framework for monthly runoff prediction using antecedent monthly runoff, water level, and precipitation. This framework integrates the discrete wavelet transform (DWT) for denoising, variational modal decomposition (VMD) for sub-sequence extraction, and gated recurrent unit (GRU) networks for modeling individual sub-sequences. Our findings demonstrate that the DWT–VMD– GRU model, utilizing runoff and rainfall time series as inputs, outperforms other models such as GRU, long short-term memory (LSTM), DWT–GRU, and DWT–LSTM, consistently exhibiting superior performance across various evaluation metrics. During the testing phase, the DWT–VMD– GRU model yielded RMSE, MAE, MAPE, NSE, and KGE values of 245.5 m 3 /s, 200.5 m 3 /s, 0.033, 0.997, and 0.978, respectively. Additionally, optimal sliding window durations for different input combinations typically range from 1 to 3 months, with the DWT–VMD–GRU model (using runoff and rainfall) achieving optimal performance with a one-month sliding window. The model’s superior accuracy enhances water resource management, flood control, and reservoir operation, supporting better-informed decisions and efficient resource allocation.


Introduction
Accurate runoff forecasts play a crucial role in water resource management, flood control, and reservoir operation.While physical-based hydrological models typically involve complex computations and rely on abundant data [1], data-driven models have garnered considerable attention due to their ability to provide satisfactory forecast results even in regions with limited data availability [2].
The long short-term memory (LSTM) neural network model represents an advancement over the recurrent neural network (RNN) architecture, addressing the inherent challenge of vanishing or exploding gradients in RNNs [3,4].LSTM is specifically designed to effectively capture and retain dependencies between input and output variables, with a focus on long-term dependency [5].Both basic and hybrid forms of LSTM have found extensive application in runoff simulation due to their ability to accurately capture underlying patterns in time series data [6].Recent studies have indicated that gated recurrent units (GRU), a variant of LSTM, achieve comparable performance with a simpler structure and lower computational burden compared to LSTM, extreme learning machines [7], and support vector machines (SVM) in monthly runoff prediction [8].
Wavelet transform (WT), empirical mode decomposition (EMD) [9], and variational mode decomposition (VMD) [10,11] are commonly employed to identify and separate fluctuation components (referred to as noise) in time series data to achieve satisfactory simulations [7,12,13].WT serves as an effective data processing technique to enhance the efficiency of any network by reducing noise in model input data.Discrete WT (DWT), a type of WT, offers computational advantages over continuous WT, requiring less computational time and data.Additionally, decomposition-based methods decompose hydrological time series with high non-stationarity into sub-signals, thereby extracting useful information across multiple time scales.Mode mixing is still a concern after the EMD decomposition, but VMD demonstrates robustness to noise and effectively mitigates mode mixing to achieve satisfactory performance, albeit requiring careful parameter selection for optimal decomposition [14].
In runoff prediction, the GRU model has shown superior performance compared to the SVM model, with further enhancement in prediction accuracy observed upon incorporating sequence processing technologies [15].For instance, the RMSE of the DWT-GRU model decreased in comparison to the single GRU model [16]; the VMD-GRU model exhibited higher prediction accuracy than the standalone GRU model for monthly runoff prediction [17]; and the CEEMDAN-FE-VMD-SVM-GRU model outperformed eight other models for the daily runoff prediction [18].Additionally, identifying the optimal combination of input variables, typically involving rainfall and runoff measurements [19], for a DWT-VMD-GRU runoff model is crucial.
In this study, the DWT-VMD-GRU framework has been devised and is being contrasted against other models.We hold that this framework is capable of capturing the nonlinear relationships between rainfall and runoff, thereby enhancing the precision of monthly runoff forecasts.We believe it is particularly well suited for runoff prediction in regions with limited observational hydrological data.Consequently, this study first employs DWT to denoise the runoff sequence, mitigating the impact of sequence noise on model accuracy.Subsequently, by decomposing the sequence with the VMD technique, sequence features are accentuated, facilitating sequence recognition.Finally, four different input combinations are utilized as inputs for two neural network models (GRU and LSTM), with performance comparison conducted for each model.

Study Area and Data
The Xijiang River, serving as the principal stem of the Pearl River, assumes a pivotal socioeconomic role within South China (Figure 1).Characterized by a humid subtropical monsoon climate, the region experiences concentrated rainfall primarily from April to September, constituting around 65% of the annual total.Similarly, high runoffs predominantly transpire from June to August, accounting for approximately 57% of the yearly sum.The hydrological dynamics of this basin are profoundly influenced by shifts in seasonal climate patterns and hydrological cycles.Within this context, the intricate river network of this basin, coupled with the scarcity of historical flood data, poses challenges in acquiring comprehensive flood series.Given the anticipated increase in the frequency and intensity of floods, the accurate prediction of flood events holds paramount importance in mitigating potential losses.
Situated within the Pearl River Basin, the Wuzhou hydrological station (111.33 • E, 23.46 • N) occupies a strategic location in the lower reaches of the Xijiang River, specifically within the eastern region of Guangxi province.This station experiences an annual average precipitation of 1618 mm and an annual average runoff of 75,277 m 3 /s, with a significant drainage area of 327,006 km 2 .The monitoring section at the station records a range of monthly average water levels, with maximum, mean, and minimum values of 20.41 m, 6.78 m, and 6.23 m, respectively.The upstream landscape has a plateau terrain, while there is a downstream expanse to the Greater Bay Area.This downstream region is characterized by a significant emphasis on flood control measures.The topography surrounding Wuzhou exhibits distinctive features, with a northwest high and southeast low configuration, marked by the intersection of mountainous and plain terrains.Notably, during episodes of concurrent flooding originating from both the main river and tributaries, the Wuzhou station consistently encounters elevated flows, large flood volumes, and prolonged flooding periods.

Discrete Wavelet Transform
Two types of wavelet transforms exist: DWT and the continuous wavelet transform.The rainfall and runoff time series inherently exhibit non-stationarity and disruptive noise.To derive meaningful insights for runoff simulation, it is imperative to subject the original series to noise reduction processing.Given the discrete nature of hydrological time series, the former is typically preferred [20].The formulation of the DWT equation [21,22] is as follows: where a0 represents the scale parameter, typically to 2; g and h are integers that determine the wavelet scaling and translation amplitudes, respectively; S(t) signifies the original series at the t-th time step; n denotes sequence length; f(g,h) is the DWT coefficient; φ is the wavelet function; and * represents the conjugate.The initial sequence (i.e., noise signal) undergoes decomposition into the approximate and detail signals (Figure 3).Continuous decomposition effectively unveils concealed noisy signals present within the original sequence.While the DWT theoretically accommodates an infinite number of layers, this approach escalates computational complexity significantly.However, the number of DWT layers minimally impacts model prediction [23].

Discrete Wavelet Transform
Two types of wavelet transforms exist: DWT and the continuous wavelet transform.The rainfall and runoff time series inherently exhibit non-stationarity and disruptive noise.To derive meaningful insights for runoff simulation, it is imperative to subject the original series to noise reduction processing.Given the discrete nature of hydrological time series, the former is typically preferred [20].The formulation of the DWT equation [21,22] is as follows: where a 0 represents the scale parameter, typically to 2; g and h are integers that determine the wavelet scaling and translation amplitudes, respectively; S(t) signifies the original series at the t-th time step; n denotes sequence length; f (g,h) is the DWT coefficient; φ is the wavelet function; and * represents the conjugate.The initial sequence (i.e., noise signal) undergoes decomposition into the approximate and detail signals (Figure 3).Continuous decomposition effectively unveils concealed noisy signals present within the original sequence.While the DWT theoretically accommodates an infinite number of layers, this approach escalates computational complexity significantly.However, the number of DWT layers minimally impacts model prediction [23].where M is the number of layers and int denotes the rounding function.
Noise reduction for detail signals is achieved through threshold processing on the decomposition coefficients.A decomposition coefficient (f(g,h)) that falls below λi (a threshold value for the i-th decomposition layer) indicates its predominantly noise-induced nature, warranting its removal.Conversely, coefficients surpassing this threshold suggest proximity to the original signal and thus are retained.The selection of the wavelet threshold directly influences denoising outcomes, with excessively high values inducing signal distortion and overly low values allowing noise persistence.Notably, λi diminishes as the number of decomposition layers increases, as proximity to the original signal heightens.The threshold determination adheres to the expression: where i is the decomposition scale, and Ni is the signal length of the i-th layer.
The standard variance of noise (σ) is quantified as: where D1 denotes the detail signal subsequent to the initial wavelet transform and 0.6745 serves as the adjustment factor.The effectiveness of noise reduction is significantly influenced by the choice of wavelet basis function and the number of wavelet layers employed.The optimal selection of these factors facilitates an enhanced representation of noise within the detail coefficients, thereby simplifying its removal and leading to superior noise reduction outcomes.To quantitatively assess noise reduction performance, the signal-to-noise ratio (SNR) is employed.A larger SNR value corresponds to a more pronounced noise reduction effect.The calculation of SNR follows the formula: where P1 and P2 signify the effective powers of the original and noise signals, respectively.It is crucial to emphasize that the noise signal represents the disparity between the original and reconstructed signals.The empirical formula [24] guides the determination of the optimal number of layers: where M is the number of layers and int denotes the rounding function.
Noise reduction for detail signals is achieved through threshold processing on the decomposition coefficients.A decomposition coefficient (f (g,h) ) that falls below λ i (a threshold value for the i-th decomposition layer) indicates its predominantly noise-induced nature, warranting its removal.Conversely, coefficients surpassing this threshold suggest proximity to the original signal and thus are retained.
The selection of the wavelet threshold directly influences denoising outcomes, with excessively high values inducing signal distortion and overly low values allowing noise persistence.Notably, λ i diminishes as the number of decomposition layers increases, as proximity to the original signal heightens.The threshold determination adheres to the expression: where i is the decomposition scale, and N i is the signal length of the i-th layer.
The standard variance of noise (σ) is quantified as: where D 1 denotes the detail signal subsequent to the initial wavelet transform and 0.6745 serves as the adjustment factor.The effectiveness of noise reduction is significantly influenced by the choice of wavelet basis function and the number of wavelet layers employed.The optimal selection of these factors facilitates an enhanced representation of noise within the detail coefficients, thereby simplifying its removal and leading to superior noise reduction outcomes.To quantitatively assess noise reduction performance, the signal-to-noise ratio (SNR) is employed.A larger SNR value corresponds to a more pronounced noise reduction effect.The calculation of SNR follows the formula: where P 1 and P 2 signify the effective powers of the original and noise signals, respectively.It is crucial to emphasize that the noise signal represents the disparity between the original and reconstructed signals.
Water 2024, 16, 1552 6 of 17 The thresholds for DWT on the monthly runoff, water level, and rainfall sequences were computed to effectively implement this approach.Employing the wavelet toolbox, the hard thresholding method was selected to process the signals, culminating in the derivation of denoised signals.

Variational Modal Decomposition
VMD is an effective approach tailored to address nonlinear and nonstationary time series; it proficiently resolves the mode mixing problem encountered in EMD through the assignment of the total number (K) of Intrinsic Mode Functions (IMFs), which are subsequently iteratively computed.The k-th IMF is characterized within VMD as a modulation function (u k ) defined by the following expression [25]: where t is the time step of IMF k , month.A k (t) is the upper envelope of the signal.Φ k (t) is a non-decreasing function.
The number (K) of IMFs is commonly determined by the center frequency method.The optimal value of K is ascertained when the center frequencies of the modal components in the K-th layer and (K + 1)-th layers exhibit similarity.The fundamental principle of VMD revolves around the formulation and solution of variational problems, comprising two distinct steps.In the initial step, a variational problem is constructed.The spectrum (F k ) of the k-th IMF (IMF k ) is expressed as follows: where δ(t) represents the Dirac distribution and j is the complex symbol (i.e., j 2 = −1).Subsequently, the variational problem is solved through the equation: where L signifies the augmented Lagrangian function, ω k denotes the center frequency, (Hz), λ is the Lagrangian multiplier, α represents the penalty factor; and f (t) stands for the original signal.The symbol < > represents the scalar product [26].The iterative process entails the updating of u k , ω k , and λ.Upon completion of the iteration, the decomposed subsequences and ω k are obtained.

Gated Recurrent Unit
GRU stands as a variant of RNN distinguished by its straightforward architecture, characterized by reset and update gates (Figure 4a) [27].The reset gate governs the extent to which newly input information aligns with preceding memory.Notably, a gating signal approaching 0 denotes a diminished correlation.Conversely, the gating signal inherent in the update gates influences the extent of past information retention.Ranging between 0 and 1, a value near 0 signifies heightened information loss, while a value closer to 1 signifies preservation [19].The training of the GRU model was accomplished using the 'grulayer' function within MATLAB 2021a.
where σ represents the logical sigmoid function, compressing activation outcomes within [0, 1], with proximity to 0 indicating information discard; xt represents the input sequence at time t (current moment); ht−1 denotes the hidden status information at the moment t − 1; tanh denotes the hyperbolic tangent activation function; ⊙ symbolizes the Hadamard product operator; and W (r) , U (r) , W (z) , U (z) , W, and U correspond to the weight parameters to be learned.

Long Short-Term Memory
LSTM, a derivative of RNNs equipped with LSTM blocks [29], presents a more intricate architecture compared to GRU (Figure 4b).LSTM has exhibited superior performance over alternative RNNs across a diverse array of applications, including time series prediction [30].The 'lstmlayer' function in MATLAB was used to train the LSTM models.
LSTM transmits the hidden state from the previous moment to the current moment, thereby enabling the learning of both short and long-term dependencies.In the LSTM architecture, every cell comprises three layers: the input layer, the recurrent layer, and the output layer.The output layer interfaces with the cell's three gates: the input gate (it, regulating the influx of information from both the prior instant and the current input into the network structure), the forget gate (ft, determining the extent of memory erasure from the preceding instant), and the output gate (ot, governing the extent of information emission at the current instant).

𝑖 = 𝜎(𝑊 ⋅ ℎ
+  ⋅  +  ) The reset gate (r t ), update gate (z t ), candidate hidden status (h ′ t ), and ultimate hidden status (h t ) are calculated as follows [28]: where σ represents the logical sigmoid function, compressing activation outcomes within [0, 1], with proximity to 0 indicating information discard; x t represents the input sequence at time t (current moment); h t−1 denotes the hidden status information at the moment t − 1; tanh denotes the hyperbolic tangent activation function; ⊙ symbolizes the Hadamard product operator; and W (r) , U (r) , W (z) , U (z) , W, and U correspond to the weight parameters to be learned.

Long Short-Term Memory
LSTM, a derivative of RNNs equipped with LSTM blocks [29], presents a more intricate architecture compared to GRU (Figure 4b).LSTM has exhibited superior performance over alternative RNNs across a diverse array of applications, including time series prediction [30].The 'lstmlayer' function in MATLAB was used to train the LSTM models.
LSTM transmits the hidden state from the previous moment to the current moment, thereby enabling the learning of both short and long-term dependencies.In the LSTM architecture, every cell comprises three layers: the input layer, the recurrent layer, and the output layer.The output layer interfaces with the cell's three gates: the input gate (i t , regulating the influx of information from both the prior instant and the current input into the network structure), the forget gate (f t , determining the extent of memory erasure from the preceding instant), and the output gate (o t , governing the extent of information emission at the current instant).
where i t , f t , and o t are the input gate, forget gate, and output gate at time t, respectively; σ is the logical sigmoid function to compress the activation result in [0,1]; W i , W f , W o , and W c are the weight matrix; b i , b f , b o , and b c are the bias, which is used to control the activation of neurons' states; c t is a potential vector of the cell state; c t is the cell state; ⊙ is the Hadamard product operator; and h t is the hidden status.

Model Development
The hydrological time series were subjected to DWT and VMD techniques to derive reconstructed sequences encompassing runoff, water level, and rainfall data.Subsequently, a GRU model was trained for simulating runoff patterns, denoted as DWT-VMD-GRU.To ascertain the optimal configuration for the GRU model, initial experimentation was conducted within a range of hidden units spanning from 10 to 30.Through systematic trial and error, the most suitable number of hidden units was identified as 28.Additionally, an exploration of sliding window parameters [31] was undertaken, considering an initial range of 1 to 10 time steps (i.e., 1 month to 10 months).For the DWT-VMD-GRU (Q&P) model, the input variables are runoff and rainfall, so the number of features n = 2; the runoff is simulated along the time series with a sliding window of 1 month, so the sliding window size (lead time) ω = 1; and the time series length is 155 and the sliding step m is set to 1, so the dataset generated after sliding 155 steps is [155,2,1].

Evaluation Metrics
To assess the effectiveness of the models, we employed a comprehensive set of indices, including the root mean square error (RMSE), mean absolute error (MAE), mean absolute percentage error (MAPE) [32], Nash-Sutcliffe Efficiency (NSE), and Kling-Gupta Efficiency (KGE) [33].RMSE, MAE, and MAPE exhibit values within the range of (0, +∞), with proximity to 0 indicative of enhanced performance.NSE and KGE range from −∞ to 1, with higher values denoting a better agreement between the observed and predicted data (e.g., a value of 1 signifies a perfect prediction).The simulation effect of the model can be considered as satisfactory when the NSE exceeds 0.5 [34,35].KGE provides a more comprehensive and balanced assessment by incorporating correlation, bias, and variability, whereas NSE primarily focuses on the proportion of variance [36].

Performance of Discrete Wavelet Transform
The application of DWT served to diminish the noise inherent within the monthly runoff, water level, and rainfall time series.Initially, the optimal decomposition layer was determined, leading to the establishment of a two-layer decomposition.Subsequently, Daubechies 5-9 were used as wavelet basis functions (WBFs) and the resulting SNR values after decomposition were computed.
Figure 5 shows the SNR values after decomposition with different WBFs, indicating that Daubechies 5 emerged as the selected WBF for the decomposition of the monthly runoff, water level, and rainfall sequences.This selection was underpinned by the observation that the SNR values for both Daubechies 5 and Daubechies 8 amounted to 19.45 and 19.52, respectively, constituting the highest and second largest values.Furthermore, concerning the water level and rainfall time series, the employment of the Daubechies 5 wavelet consistently yielded the highest SNR values, indicating its efficacy in denoising.With the runoff sequence as the object of simulation, and in situations where SNR values are closely similar (19.45 and 19.52), the preference was for the wavelet with the slightly smaller SNR value, Daubechies 5, to preserve more informative content from the original sequence.This finding aligns with the research conducted by Song et al. (2021) [37], which also employed the WBF set to db5.Within this framework, the original monthly runoff, water level, and rainfall sequences underwent decomposition via DWT, generating both approximate components and detail components (Figure 6).Notably, the signals attributed to D1 exhibited considerable noise, whereas those associated with D2 displayed comparatively reduced noise.Accordingly, we chose to further denoise the D1 signals.Within this framework, the original monthly runoff, water level, and rainfall sequences underwent decomposition via DWT, generating both approximate components and detail components (Figure 6).Notably, the signals attributed to D 1 exhibited considerable noise, whereas those associated with D 2 displayed comparatively reduced noise.Accordingly, we chose to further denoise the D 1 signals.Within this framework, the original monthly runoff, water level, and rainfall sequences underwent decomposition via DWT, generating both approximate components and detail components (Figure 6).Notably, the signals attributed to D1 exhibited considerable noise, whereas those associated with D2 displayed comparatively reduced noise.Accordingly, we chose to further denoise the D1 signals.

Performance of Variational Modal Decomposition
Table 1 presents the outcomes of VMD applied to sequences reconstructed by DWT noise reduction.Notably, among all adjacent modal components, the central frequencies of IMF2 and IMF3 are the closest.As K increases from 6 to 8, the central frequencies of these two modal components converge.This trend suggests that starting from K = 6, the occurrence of modal overlap may become apparent.Conversely, if K is too small, insufficient decomposition may result [38].Consequently, K = 5 is determined as the optimal count of decomposition layers for VMD. Figure 7 shows the VMD outcomes for the monthly runoff sequence, with K = 5.Evidently, VMD successfully decomposes the monthly runoff sequence into five IMFs, characterized by discernible periodic oscillations.Furthermore, each IMF demonstrates a progressive attenuation in oscillation amplitude as time elapses.

Performance of Variational Modal Decomposition
Table 1 presents the outcomes of VMD applied to sequences reconstructed by DWT noise reduction.Notably, among all adjacent modal components, the central frequencies of IMF2 and IMF3 are the closest.As K increases from 6 to 8, the central frequencies of these two modal components converge.This trend suggests that starting from K = 6, the occurrence of modal overlap may become apparent.Conversely, if K is too small, insufficient decomposition may result [38].Consequently, K = 5 is determined as the optimal count of decomposition layers for VMD. Figure 7 shows the VMD outcomes for the monthly runoff sequence, with K = 5.Evidently, VMD successfully decomposes the monthly runoff sequence into five IMFs, characterized by discernible periodic oscillations.Furthermore, each IMF demonstrates a progressive attenuation in oscillation amplitude as time elapses.

Performance of Neural Network Models
Figure 8 presents an overview of the errors observed across the GRU, LSTM, DWT-GRU, DWT-LSTM, and DWT-VMD-GRU models.Particularly noteworthy is the superior performance exhibited by the DWT-VMD-GRU model across both the training and testing phases.The main findings can be summarized as follows: (1) The DWT-VMD-GRU model, utilizing runoff and rainfall as inputs, exhibits superior generalization ability, consistently outperforming other models across diverse performance metrics.This underscores the effectiveness of the integrated denoising, decomposition, and neural network methodology.(2) Overall, GRU demonstrates superior performance compared to LSTM.However, the mere integration of DWT with the GRU or LSTM models may yield adverse effects.Incorporating VMD alongside DWT substantially improves the simulation efficacy of GRU models.The main findings can be summarized as follows: (1) The DWT-VMD-GRU model, utilizing runoff and rainfall as inputs, exhibits superior generalization ability, consistently outperforming other models across diverse performance metrics.This underscores the effectiveness of the integrated denoising, decomposition, and neural network methodology.(2) Overall, GRU demonstrates superior performance compared to LSTM.However, the mere integration of DWT with the GRU or LSTM models may yield adverse effects.Incorporating VMD alongside DWT substantially improves the simulation efficacy of GRU models.

Performance of Neural Network Models
Figure 9 presents the outcomes of GRU(Q&P), LSTM(Q&P), DWT-GRU(Q&P), DWT-LSTM(Q&P), and DWT-VMD-GRU(Q&P).Analysis of the results during the testing phase highlights that DWT-VMD-GRU exhibits the most favorable simulation effect, followed by GRU, DWT-GRU, and DWT-LSTM, with LSTM displaying the least desirable effect.Hence, the following conclusions can be drawn: (1) GRU surpasses the LSTM model in simulation performance.(2) The composite model, integrating denoising, decomposition, and neural network components, exhibits superior simulation performance compared to individual models.
Figure 9 presents the outcomes of GRU(Q&P), LSTM(Q&P), DWT-GRU(Q&P), DWT-LSTM(Q&P), and DWT-VMD-GRU(Q&P).Analysis of the results during the testing phase highlights that DWT-VMD-GRU exhibits the most favorable simulation effect, followed by GRU, DWT-GRU, and DWT-LSTM, with LSTM displaying the least desirable effect.Hence, the following conclusions can be drawn: (1) GRU surpasses the LSTM model in simulation performance.(2) The composite model, integrating denoising, decomposition, and neural network components, exhibits superior simulation performance compared to individual models.window duration for diverse combinations of input variables predominantly ranged from 1 to 3.Among the four model input combinations, a decreasing trend in KGE values was evident as the sliding window increased.It suggests that as the sliding window increases, the model performance deteriorates.This observation aligns with the findings of a study conducted by Weng et al. (2023) [39].(2) All input combinations demonstrated good performance at each sliding window configuration, consistently achieving a KGE exceeding 0.91.Notably, the DWT-VMD-GRU (runoff and rainfall as inputs) model exhibited the highest performance when utilizing a one-month sliding window.This outcome can be attributed to the causal relationship between rainfall and runoff, rooted in physical processes, whereas the association between water level and runoff possesses quasi-linear characteristics.This finding aligns with the research by Zhang et al. (2021) [40], which emphasizes the substantial influence of input variables on the accuracy of runoff prediction using the RNN model.

Window Size and Input Combinations of DWT-VMD-GRU
Figure 10 illustrates the KGE values of runoff prediction across various input combinations and sliding window settings.The following was observed: (1) The preferred sliding window duration for diverse combinations of input variables predominantly ranged from 1 to 3.Among the four model input combinations, a decreasing trend in KGE values was evident as the sliding window increased.It suggests that as the sliding window increases, the model performance deteriorates.This observation aligns with the find- ings of a study conducted by Weng et al. (2023) [39].(2) All input combinations demonstrated good performance at each sliding window configuration, consistently achieving a KGE exceeding 0.91.Notably, the DWT-VMD-GRU (runoff and rainfall as inputs) model exhibited the highest performance when utilizing a one-month sliding window.This outcome can be attributed to the causal relationship between rainfall and runoff, rooted in physical processes, whereas the association between water level and runoff possesses quasi-linear characteristics.This finding aligns with the research by Zhang et al. (2021) [40], which emphasizes the substantial influence of input variables on the accuracy of runoff prediction using the RNN model.2) Regarding the performance in predicting the maximum annual runoff, the DWT-VMD-GRU model demonstrates that the combination of runoff and rainfall as inputs yields the best results, followed by runoff and water level, and finally, runoff alone.In summary, the model   2) Regarding the performance in predicting the maximum annual runoff, the DWT-VMD-GRU model demonstrates that the combination of runoff and rainfall as inputs yields the best results, followed by runoff and water level, and finally, runoff alone.In summary, the model utilizing runoff and rainfall as inputs outperforms the model using only runoff as input in simulating the overall sequence.
Figure 12a depicts flow duration curves (FDCs) for various neural networks, which closely match the observed data, indicating robust performance across training and testing phases.The FDCs perform well in high-flows and less so in low-flows.In the testing phase, the DWT-VMD-GRU(Q&P) model outperforms others in low-flows, followed by the GRU(Q&P) model.Figure 12b shows the FDCs for DWT-VMD-GRU models with different input combinations, with all curves closely matching the observed series, suggesting strong overall model performance.The FDCs of the testing phase perform well in high-flow, capturing extreme flow events.The DWT-VMD-GRU (Q&P) and DWT-VMD-GRU (Q) models excel in high-flows and low-flows, respectively.utilizing runoff and rainfall as inputs outperforms the model using only runoff as input in simulating the overall sequence.

Conclusions
This study presents a DWT-VMD-GRU model for predicting monthly runoff and evaluates its predictive performance against alternative models.The following conclusions can be drawn: (1) The DWT-VMD-GRU model, utilizing runoff and rainfall time series as inputs, demonstrates superior generalization compared to other

Conclusions
This study presents a DWT-VMD-GRU model for predicting monthly runoff and evaluates its predictive performance against alternative models.The following conclusions can be drawn: (1) The DWT-VMD-GRU model, utilizing runoff and rainfall time series as inputs, demonstrates superior generalization compared to other models including GRU, LSTM, DWT-GRU, and DWT-LSTM, consistently exhibiting better performance across various evaluation metrics.This superiority is attributed to the denoising effect of DWT, the enhanced sequential feature extraction by VMD, and the enhanced analytical capability of GRU towards the input data.(2) Optimal sliding window durations for different input combinations typically range from 1 to 3 months, with the DWT-VMD-GRU model (using runoff and rainfall) showing peak performance with a one-month sliding window.
Despite the limitations of the dataset, which was constrained by the availability of measured runoff data, this study provides a foundation for future research directions: (1) Evaluate the robustness of the DWT-VMD-GRU model across diverse hydrological stations and its capability to forecast daily and hourly runoff.(2) Evaluate the computational time and learning efficiency of neural networks, employing regularization to reduce the risk of model overfitting, and utilize GPU acceleration technology.(3) Analyze parameter sensitivity, including the hyperparameter optimization of DWT, VMD, and neural networks, as well as conduct principal component analysis [40] to determine a more rational combination of input variables.(4) Explore the integration of other signal decomposition methods, deep learning models, and probabilistic forecasting techniques.
Our dataset encompasses a comprehensive compilation of the monthly average runoff, water level, and rainfall data from the Wuzhou Station.This dataset spans from February 2005 to December 2017, covering a total span of 155 months.The division of the training and test phases involves an 80% to 20% partitioning, where the calibration phase spans from February 2005 to May 2015 (124 months) and the subsequent test phase spans from June 2015 to December 2017 (31 months).

Figure 1 .
Figure 1.Study site.Our dataset encompasses a comprehensive compilation of the monthly average runoff, water level, and rainfall data from the Wuzhou Station.This dataset spans from February 2005 to December 2017, covering a total span of 155 months.The division of the training and test phases involves an 80% to 20% partitioning, where the calibration phase spans from February 2005 to May 2015 (124 months) and the subsequent test phase spans from June 2015 to December 2017 (31 months).

Figure 3 .
Figure 3. Diagram of discrete wavelet transform (DWT).Approximate components, Ai; detail components, Di. i and M are the serial number and total number of decomposed layers.

Figure 3 .
Figure 3. Diagram of discrete wavelet transform (DWT).Approximate components, A i ; detail components, D i .i and M are the serial number and total number of decomposed layers.

Figure 4 .
Figure 4. Architectures of (a) gated recurrent unit and (b) long short-term memory.σr and σz stand for the logical sigmoid function of the reset gate and the update gate.σf, σi, and σo stand for the logical sigmoid function of the forget, input, and output gates at t-th time step.

Figure 4 .
Figure 4. Architectures of (a) gated recurrent unit and (b) long short-term memory.σ r and σ z stand for the logical sigmoid function of the reset gate and the update gate.σ f , σ i , and σ o stand for the logical sigmoid function of the forget, input, and output gates at t-th time step.

Figure 5 .
Figure 5. Signal-to-noise ratio after decomposition with various basis functions.Daubechies, DB.

Figure 5 .
Figure 5. Signal-to-noise ratio after decomposition with various basis functions.Daubechies, DB.

Figure 5 .
Figure 5. Signal-to-noise ratio after decomposition with various basis functions.Daubechies, DB.

Figure 6 .
Figure 6.Discrete wavelet transform coefficient of (a) runoff, (b) water level, and (c) rainfall time series.A i , i-th approximate component; D i , i-th detail component; S, original signal.

Figure 7 .
Figure 7. Decomposition monthly runoff series by variational modal decomposition with K = 5.It includes the denoised sequence and intrinsic mode functions (IMFs).

Figure 7 .
Figure 7. Decomposition monthly runoff series by variational modal decomposition with K = 5.It includes the denoised sequence and intrinsic mode functions (IMFs).

Figure 8
Figure 8  presents an overview of the errors observed across the GRU, LSTM, DWT-GRU, DWT-LSTM, and DWT-VMD-GRU models.Particularly noteworthy is the superior performance exhibited by the DWT-VMD-GRU model across both the training and testing phases.The main findings can be summarized as follows: (1) The DWT-VMD-GRU model, utilizing runoff and rainfall as inputs, exhibits superior generalization ability, consistently outperforming other models across diverse performance metrics.This underscores the effectiveness of the integrated denoising, decomposition, and neural network methodology.(2) Overall, GRU demonstrates superior performance compared to LSTM.However, the mere integration of DWT with the GRU or LSTM models may yield adverse effects.Incorporating VMD alongside DWT substantially improves the simulation efficacy of GRU models.Figure9presents the outcomes of GRU(Q&P), LSTM(Q&P), DWT-GRU(Q&P), DWT-LSTM(Q&P), and DWT-VMD-GRU(Q&P).Analysis of the results during the testing phase highlights that DWT-VMD-GRU exhibits the most favorable simulation effect, followed by GRU, DWT-GRU, and DWT-LSTM, with LSTM displaying the least desirable effect.Hence, the following conclusions can be drawn: (1) GRU surpasses the LSTM model in simulation performance.(2) The composite model, integrating denoising, decomposition, and neural network components, exhibits superior simulation performance compared to individual models.

Figure 9 .
Figure 9. Observed monthly runoff series and predicted series by neural network models.The input variables for all models encompass monthly runoff and rainfall series.Discrete wavelet transform, DWT; gate recurrent unit, GRU; long short-term memory, LSTM; variational modal decomposition, VMD.

4. 4 .
Figure 10 illustrates the KGE values of runoff prediction across various input combinations and sliding window settings.The following was observed: (1) The preferred sliding

Figure 9 .
Figure9.Observed monthly runoff series and predicted series by neural network models.The input variables for all models encompass monthly runoff and rainfall series.Discrete wavelet transform, DWT; gate recurrent unit, GRU; long short-term memory, LSTM; variational modal decomposition, VMD.

Figure 10 .
Figure 10.Kling-Gupta Efficiency values of DWT-VMD-GRU in predicting monthly runoff with different input combinations and sliding windows.Discrete wavelet transform, DWT; dots illustrate the optimal number of sliding Windows for each input combination; gate recurrent unit, GRU; variational modal decomposition, VMD.

Figure 11
Figure 11 depicts the runoff time series simulated by the DWT-VMD-GRU model utilizing various input variables.Our analysis yields the following conclusions: (1) The examination of the residual plot indicates that the DWT-VMD-GRU model, utilizing runoff and rainfall as inputs, exhibits superior performance compared to three other input combinations, a finding consistent with Tiwari et al.'s (2022) conclusion [41].(2) Regarding the performance in predicting the maximum annual runoff, the DWT-VMD-GRU model demonstrates that the combination of runoff and rainfall as inputs yields the best results, followed by runoff and water level, and finally, runoff alone.In summary, the model

Figure 10 .
Figure 10.Kling-Gupta Efficiency values of DWT-VMD-GRU in predicting monthly runoff with different input combinations and sliding windows.Discrete wavelet transform, DWT; dots illustrate the optimal number of sliding Windows for each input combination; gate recurrent unit, GRU; variational modal decomposition, VMD.

Figure 11
Figure 11 depicts the runoff time series simulated by the DWT-VMD-GRU model utilizing various input variables.Our analysis yields the following conclusions: (1) The examination of the residual plot indicates that the DWT-VMD-GRU model, utilizing runoff and rainfall as inputs, exhibits superior performance compared to three other input combinations, a finding consistent with Tiwari et al.'s (2022) conclusion [41].(2) Regarding the performance in predicting the maximum annual runoff, the DWT-VMD-GRU model demonstrates that the combination of runoff and rainfall as inputs yields the best results, followed by runoff and water level, and finally, runoff alone.In summary, the model utilizing runoff and rainfall as inputs outperforms the model using only runoff as input in simulating the overall sequence.Figure12adepicts flow duration curves (FDCs) for various neural networks, which closely match the observed data, indicating robust performance across training and testing phases.The FDCs perform well in high-flows and less so in low-flows.In the testing phase, the DWT-VMD-GRU(Q&P) model outperforms others in low-flows, followed by the GRU(Q&P) model.Figure12bshows the FDCs for DWT-VMD-GRU models with different input combinations, with all curves closely matching the observed series, suggesting strong overall model performance.The FDCs of the testing phase perform well in high-flow, capturing extreme flow events.The DWT-VMD-GRU (Q&P) and DWT-VMD-GRU (Q) models excel in high-flows and low-flows, respectively.

Figure
Figure12adepicts flow duration curves (FDCs) for various neural networks, which closely match the observed data, indicating robust performance across training and testing phases.The FDCs perform well in high-flows and less so in low-flows.In the testing phase, the DWT-VMD-GRU(Q&P) model outperforms others in low-flows, followed by the GRU(Q&P) model.Figure12bshows the FDCs for DWT-VMD-GRU models with different input combinations, with all curves closely matching the observed series, suggesting strong overall model performance.The FDCs of the testing phase perform well in high-flow, capturing extreme flow events.The DWT-VMD-GRU (Q&P) and DWT-VMD-GRU (Q) models excel in high-flows and low-flows, respectively.

Figure 12 .
Figure 12.Flow duration curves: (a) optimal models: GRU, LSTM, DWT-GRU, DWT-LSTM, and DWT-VMD-GRU; (b) DWT-VMD-GRU models with various input combinations.Solid lines indicate flow duration curves across both training and testing phases; dashed lines illustrate flow duration curves specifically for the testing phase.Discrete wavelet transform, DWT; gate recurrent unit, GRU; long short-term memory, LSTM; variational modal decomposition, VMD.Q, L, and P denote runoff, water level, and precipitation time series, respectively.

Figure 12 .
Figure 12.Flow duration curves: (a) optimal models: GRU, LSTM, DWT-GRU, DWT-LSTM, and DWT-VMD-GRU; (b) DWT-VMD-GRU models with various input combinations.Solid lines indicate flow duration curves across both training and testing phases; dashed lines illustrate flow duration curves specifically for the testing phase.Discrete wavelet transform, DWT; gate recurrent unit, GRU; long short-term memory, LSTM; variational modal decomposition, VMD.Q, L, and P denote runoff, water level, and precipitation time series, respectively.

Table 1 .
Center frequencies of modal components in different variational modal decomposition scenarios.

Table 1 .
Center frequencies of modal components in different variational modal decomposition scenarios.