3.1. Runoff Simulation Results of Process-Driven Models
To maintain consistency with the calibration and validation periods used in the subsequent machine learning models, both physically based hydrological models adopted the period from 2005 to 2018 for calibration and from 2019 to 2021 for validation. The Xinanjiang model was calibrated using data from 2005 to 2018, after which the calibrated parameters were fixed to simulate runoff for the validation period of 2019–2021.
The results indicate that both models performed well in the study area, as summarized in
Table 4. The NSE values for the calibration period reached 0.829 and 0.806 for the Xinanjiang and SWAT models, respectively, while those for the validation period were 0.840 and 0.825. Comparative analysis shows that the Xinanjiang model achieved slightly higher simulation accuracy than the SWAT model, with better performance in all evaluation metrics during both the calibration and validation periods. The detailed daily runoff simulation results for both models during the validation period are shown in
Figure 6 and
Figure 7.
Although the SWAT model theoretically provides a more detailed representation of hydrological processes across heterogeneous landscapes by accounting for spatial variability, its actual performance in this study area was slightly inferior to that of the Xinanjiang model. This may be attributed to the combined influence of regional factors such as topography, land use, and meteorological conditions, under which the simpler structure and fewer parameters of the Xinanjiang model allowed it to better adapt to the hydrological characteristics of the basin. Therefore, considering all aspects, the Xinanjiang model is deemed more suitable for daily runoff simulation in this study area.
Figure 7.
Daily Runoff Simulation Results of the Xinanjiang Model during the Validation Period.
Figure 7.
Daily Runoff Simulation Results of the Xinanjiang Model during the Validation Period.
3.2. Runoff Simulation Results of Data-Driven Models
In the process-driven modeling stage, two types of meteorological data—observed and gridded—were used for runoff simulation. To ensure comparability, the same two datasets were adopted as input for the data-driven models in this section, with historical runoff data additionally included to improve simulation and prediction accuracy. The effects of different input data types and model structures on runoff prediction performance were analyzed by comparing evaluation metrics across experiments.
Four data-driven models were employed in this study, and five different input data combinations were designed:
Using only historical runoff for prediction;
Using rainfall and evaporation data;
Using rainfall, evaporation, and historical runoff;
Using six meteorological variables for prediction; and
Using six meteorological variables together with historical runoff.
Each of the five input combinations was applied to the four data-driven models, resulting in a total of 20 model configurations. The objective was to compare and analyze the impact of different input datasets and model architectures on prediction accuracy. The specific scheme configurations are summarized in
Table 5.
A comparison of the best-performing schemes among different models is presented in
Table 6. As shown in the table, the XGB-3 and GRU-3 models exhibit comparable prediction accuracy, with NSE values of 0.844 and 0.846, respectively, during the validation period. The Seq2seq-3 model achieved the highest accuracy, with a validation NSE of 0.859, indicating that under identical input data conditions, the Seq2seq model performs best in this study area.
A comparison of the effects of different input schemes on runoff prediction accuracy is illustrated in
Figure 8. As shown in the figure, during the validation period, all models achieved their highest NSE values when using the input data from Scheme 3, followed by Scheme 2. In contrast, when using Schemes 4 and 5, the NSE values during the validation period generally ranged between 0.5 and 0.6, indicating relatively poor predictive performance.
Scheme 2 was based on observed rainfall and evaporation data from rain gauge and hydrological stations, while Scheme 3 further incorporated observed runoff data, leading to noticeable improvements in the results across all four data-driven models compared with Scheme 2. Scheme 4 utilized gridded meteorological data—including rainfall, relative humidity, and temperature—and Scheme 5 additionally included observed runoff data. However, the predictive performance of Schemes 4 and 5 did not differ significantly from that of Scheme 1, which used only observed runoff as input.
These results suggest that the relatively small size of the study basin and the coarse spatial resolution of the gridded dataset limit its ability to accurately capture regional variations. To further substantiate this conclusion, we conducted a quantitative evaluation of the CN05.1 gridded precipitation dataset. CN05.1 has a spatial resolution of 0.25° × 0.25°, with only six grid cells covering the study basin. Basin-averaged daily rainfall was derived for both station observations and gridded data using the Thiessen polygon method. The evaluation results show that CN05.1 exhibits noticeable discrepancies relative to gauge observations, with MAE = 3.97 mm, RMSE = 8.69 mm, and NSE = 0.40. The basin-wide mean bias is −0.255 mm (bias ratio −2.9%), indicating a slight overall underestimation. In addition, the mean bias at mountainous stations (−0.48 mm) is substantially larger than that at plain stations (−0.10 mm), reflecting the inability of coarse-resolution grids to capture orographic precipitation enhancement.
Considering the spatial continuity of rainfall, we further computed Pearson correlation coefficients for all 15 pairwise combinations among the six stations/grids using annual mean precipitation from 2005 to 2020. The average spatial correlation among the gauge stations is 0.911, whereas CN05.1 shows a much higher average correlation of 0.975. This inflated correlation indicates a pronounced smoothing effect: the coarse grid spacing suppresses spatial variability and causes neighboring grid cells to become overly similar, thereby artificially increasing spatial correlation.
These results confirm that although CN05.1 exhibits internally high spatial consistency, its substantial bias and lower predictive performance are mainly attributable to its coarse spatial resolution, which fails to represent the true spatial heterogeneity and localized rainfall features of the study basin. In contrast, traditional observational data more effectively capture spatial rainfall variability within the basin. Moreover, rainfall and evaporation data play a non-negligible role in improving runoff prediction accuracy in this region.
Figure 8.
Bar Chart of NSE Evaluation Indicators for Model Runoff Prediction Results.
Figure 8.
Bar Chart of NSE Evaluation Indicators for Model Runoff Prediction Results.
Based on multiple evaluation metrics, this chapter comprehensively evaluated and compared the performance of four data-driven models—SVR, XGBoost, GRU, and Seq2seq—in daily runoff prediction. The results show that the Seq2seq model exhibited the best overall performance, with Scheme Seq2seq-3 achieving an NSE of 0.859, significantly outperforming the other models. The GRU and XGB models demonstrated comparable accuracy, with NSE values of 0.846 and 0.844, respectively, while the SVR model performed the worst, with an NSE of 0.723.
Through comparative analysis of different input schemes, it was found that all models achieved the highest NSE values when using the input data from Scheme 3, followed by Scheme 2, while Schemes 4 and 5 showed relatively poor predictive performance. These findings indicate that, in this study area, observed data outperform gridded datasets, and the inclusion of observed runoff data plays a crucial role in enhancing the predictive capability of data-driven models.
In addition to predictive performance, we also evaluated the computational practicality of the data-driven models, as operational flood forecasting requires models that are both accurate and efficient. All experiments were conducted on a standard workstation. For a dataset of approximately 5000 time-series samples, the GRU model required about 36 s to complete 100 training epochs, while the Seq2seq model required around 48 s due to its encoder–decoder architecture. Despite the moderately higher training cost, the Seq2seq model’s improved multi-step prediction accuracy suggests that its computational demands remain acceptable for practical hydrological forecasting applications.
3.3. Runoff Simulation Results of Physically Data-Driven Hybrid Models
When the outputs of the physical models (Xinanjiang and SWAT) were incorporated as additional features into the data-driven models, the runoff prediction accuracy improved significantly.
When the surface runoff data generated by the Xinanjiang model were integrated, the Seq2seq model (Scheme XAJ-Seq2seq-2) achieved a validation NSE of 0.912, representing a substantial improvement over the best pre-integration scheme (Seq2seq-3, NSE = 0.859). This demonstrates that intermediate variables provided by the physical model, which directly reflect the runoff generation process, serve as highly valuable prior information for data-driven models.
Similarly, after incorporating the soil moisture data from the SWAT model, the Seq2seq model (Scheme SWAT-Seq2seq-1) also achieved a validation NSE of 0.912. This finding indicates that the inclusion of a key physical variable—basin water storage status (soil moisture)—helps the model more accurately determine runoff responses following rainfall events.
Overall, the success of the feature-fusion strategy verifies that physical models can provide valuable internal hydrological state information that data-driven models cannot directly extract from raw observations, thereby effectively enhancing prediction accuracy.
- 2.
Evaluation of the Physics-Constrained Strategy
Model performance also improved through the introduction of synthetic datasets and physically constrained loss functions, particularly in the simulation of extreme events, as summarized in
Table 7.
Under the optimal scheme (PG-Seq2seq-4), the validation NSE reached 0.898. Although this value is slightly lower than that achieved by the feature-fusion strategy, it is still significantly higher than that of the standalone data-driven model.
Comparison of flood hydrographs revealed that, although the physics-constrained models did not exhibit absolute superiority in overall NSE, they demonstrated excellent capability in capturing flood peaks—especially the highest peaks during the validation period—where predicted values closely matched observations. This finding directly confirms that the designed synthetic rainfall dataset and rainfall-constraint loss function effectively enhance the model’s responsiveness and accuracy in simulating high-flow events.
A comparison of six optimal hybrid schemes—including those integrating the Xinanjiang model (XAJ-Seq2seq-2 and XAJ-Seq2seq-3), the SWAT model (SWAT-Seq2seq-1 and SWAT-Seq2seq-2), and the physics-guided models (PG-Seq2seq-1 and PG-Seq2seq-4)—was conducted, focusing on the flood-period prediction results, as illustrated in
Figure 9.
As shown in the figure, although the overall accuracy of the physics-guided models was slightly lower than that of the other schemes, the introduction of synthetic rainfall datasets and rainfall-constraint loss functions enabled the two hybrid models, PG-Seq2seq-1 and PG-Seq2seq-4, to exhibit superior performance in fitting runoff peaks. In particular, during the validation period, these models more accurately reproduced the highest daily runoff peaks. This finding indicates that the synthetic rainfall dataset and rainfall-constraint loss function effectively enhance the model’s responsiveness to high-flow events.
3.4. Interpretability Analysis
Using SHAP analysis, this study investigated the internal decision-making mechanisms of different hybrid models in runoff prediction. In addition to the qualitative visualization provided by the Global SHAP Contribution Plot (
Figure 10), we further summarized the relative importance of key predictors by reporting their mean absolute SHAP values across samples, enabling a more objective comparison of feature importance.
In the Seq2seq model that incorporated runoff-generation data from the Xinanjiang model (XAJ-Seq2seq-2), the most influential feature was the simulated runoff generation at time t − 1, with a mean |SHAP| value of 0.44. In the model integrating simulated discharge, the simulated runoff at time t − 1 showed the highest importance (0.42).
In contrast, raw rainfall and historical runoff features exhibited lower contributions, with mean |SHAP| values ranging from 0.25 to 0.30. This indicates that the hybrid models successfully shifted their decision focus toward more physically meaningful hydrological variables.
In the SWAT-Seq2seq-1 model, rainfall at time t − 1 remained the most significant feature (mean |SHAP| = 0.35), followed by soil moisture at time t − 1, which reflects the hydrological principle that runoff generation depends jointly on rainfall occurrence and antecedent wetness.
- 2.
Physics-Constrained Models:
In the physics-constrained models (PG-Seq2seq-1 and PG-Seq2seq-4), rainfall at time t − 1 continued to be the most influential feature. An interesting observation is that, due to the inclusion of numerous artificially generated drought scenarios during training, the relative importance of the evaporation feature increased and showed a predominantly negative contribution—higher evaporation corresponded to lower runoff. This finding demonstrates that, through physical constraints, the model was guided to focus on key hydrological processes under specific physical conditions (e.g., drought periods), thereby making its behavior more consistent with physical laws. This change in feature importance is fully consistent with hydrological theory, as evapotranspiration becomes the dominant control on soil-water depletion and streamflow recession under drought conditions.