Monthly Rainfall-Runoff Modeling at Watershed Scale: A Comparative Study of Data-Driven and Theory-Driven Approaches

Data-driven machine learning approaches have been rapidly developed in the past 10 to 20 years and applied to various problems in the field of hydrology. To investigate the capability of data-driven approaches in rainfall-runoff modeling in comparison to theory-driven models, we conducted a comparative study of simulated monthly surface runoff at 203 watersheds across the contiguous USA using a conceptual model, the proportionality hydrologic model, and a data-driven Gaussian process regression model. With the same input variables of precipitation and mean monthly aridity index, the two models showed similar performance. We then introduced two more input variables in the data-driven model: potential evaporation and the normalized difference vegetation index (NDVI), which were selected based on hydrologic knowledge. The modified data-driven model performed much better than either the conceptual or original data-driven model. A sensitivity analysis was conducted on all three models tested in this study, which showed that surface runoff responded positively to increased precipitation. However, a confounding effect on surface runoff sensitivity was found among mean monthly aridity index, potential evaporation, and NDVI. This confounding was caused by complex interconnections among energy supply, vegetation coverage, and climate seasonality of the watershed system. We also conducted an uncertainty analysis on the two data-driven models, which showed that both models had reasonable predictability within the 95% confidence interval. With the additional two input variables, the modified data-driven model had lower prediction uncertainty and higher prediction accuracy.


Introduction
With the rapid development of computational capability, data-driven machine learning methods have become more popular in the past decade in all fields related to data and modeling, including hydrology (e.g., [1][2][3][4][5][6][7]). Unlike typical hydrologic models, data-driven approaches do not rely directly on explicit physical knowledge of the target process. Instead, they build a purely empirical model based on observed relationships between input and output variables. Using various learning algorithms (e.g., [8][9][10][11][12]), data-driven approaches provide a flexible way to model complex phenomena such as runoff generation. State-of-the-art algorithms such as artificial neural network [1,4] and Gaussian process regression [13] have been applied to hydrologic modeling. For example, Sun et al. [13] built a data-driven model based on the Gaussian process regression algorithm to predict monthly streamflow at over 400 watersheds in the US and showed generally good performance of the model. framework. Next, we conducted a sensitivity analysis on all three models to show the controlling effect of input variables on monthly surface runoff and investigate the physical plausibility of the models. Finally, we conducted an uncertainty analysis on the two data-driven models to evaluate their prediction coverage within a 95% confidence interval.

Study Watersheds and Data Sources
Hydrometeorological data from the Model Parameter Estimation Experiment (MOPEX) were used in this study [27]. From a total of 438 MOPEX watersheds across the contiguous US, we selected 203 with higher data accuracy [25,26] (Figure 1). For the study watersheds, we collected daily precipitation and streamflow data from 1983 to 2002 from MOPEX. The streamflow data were separated into surface runoff and baseflow using the one-parameter digital filter method with a filter parameter value of 0.925 [28]. The surface runoff data from the separation were used in this study. The monthly potential evaporation data from 1983 to 2002 were collected from Zhang et al. [29] and validated with flux tower measurements. Values of mean monthly aridity index were obtained from Chen et al. [25] to determine seasonality of the study watersheds. Finally, we used the normalized difference vegetation index (NDVI) to represent vegetation coverage. Bimonthly NDVI data were collected from Advanced Very High Resolution Radiometer (AVHRR) imagery from the Global Inventory Modeling and Mapping Studies (GIMMS), which can be downloaded at http://staff.glcf.umd.edu/sns/htdocs/data/gimms/ index.shtml [30]. All the different types of data collected in this study were converted to watershed scale with monthly time steps.
Water 2018, 10, x FOR PEER REVIEW 3 of 22 effect of input variables on monthly surface runoff and investigate the physical plausibility of the models. Finally, we conducted an uncertainty analysis on the two data-driven models to evaluate their prediction coverage within a 95% confidence interval.

Study Watersheds and Data Sources
Hydrometeorological data from the Model Parameter Estimation Experiment (MOPEX) were used in this study [27]. From a total of 438 MOPEX watersheds across the contiguous US, we selected 203 with higher data accuracy [25,26] (Figure 1). For the study watersheds, we collected daily precipitation and streamflow data from 1983 to 2002 from MOPEX. The streamflow data were separated into surface runoff and baseflow using the one-parameter digital filter method with a filter parameter value of 0.925 [28]. The surface runoff data from the separation were used in this study. The monthly potential evaporation data from 1983 to 2002 were collected from Zhang et al. [29] and validated with flux tower measurements. Values of mean monthly aridity index were obtained from Chen et al. [25] to determine seasonality of the study watersheds. Finally, we used the normalized difference vegetation index (NDVI) to represent vegetation coverage. Bimonthly NDVI data were collected from Advanced Very High Resolution Radiometer (AVHRR) imagery from the Global Inventory Modeling and Mapping Studies (GIMMS), which can be downloaded at http://staff.glcf.umd.edu/sns/htdocs/data/gimms/index.shtml [30]. All the different types of data collected in this study were converted to watershed scale with monthly time steps.

Theory-Driven Conceptual Hydrologic Model
The theory-driven model we selected in this study is the proportionality hydrologic model (PHM), a simple conceptual hydrologic model with 2 equations for the surface runoff simulation [26]:

Theory-Driven Conceptual Hydrologic Model
The theory-driven model we selected in this study is the proportionality hydrologic model (PHM), a simple conceptual hydrologic model with 2 equations for the surface runoff simulation [26]: where Q d is surface runoff (mm), P is precipitation (mm), λ is initial wetting fraction, and W is wetting capacity (mm). In this model, P is the only monthly input, while λ and W are parameters with fixed values for each watershed. In addition, as part of model input preparation, we divided monthly data into energy-limited months with subscript e and water-limited months with subscript w, based on the mean monthly aridity index [25]: where AI m is the mean monthly aridity index of month m, E Pm is the mean monthly potential evaporation (mm), P m is the mean monthly precipitation (mm), and ∆S m is the mean monthly water storage change (mm). The values of AI m for the 203 study watersheds were obtained from Chen et al. [25]. Energy-limited months are the ones with AI m ≤ 1 and water-limited months are the ones with AI m > 1. Surface runoff in energy-limited months was modeled using Equation (1) and in water-limited months by Equation (2). The equations are similar, but with 2 different sets of λ and W. All the energy-limited months in a year form the energy-limited season and the water-limited months in a year form the water-limited season. This model is based on the proportionality hypothesis [31], which is derived from the Soil Conservation Service (SCS) curve number method [32]. This model has shown capability to simulate surface runoff and streamflow in various watersheds at an intra-annual scale [26,33]. In PHM, monthly precipitation is the only monthly input. The mean monthly aridity index is used to define the seasonality of each month, which can be considered as indirect input. For the PHM simulation, the parameter values of λ e , λ w , W e , and W w were calibrated using the data from 1983 to 1992. The calibration was done by simulating surface runoff using all possible combinations of parameter values within predefined parameter value ranges [26] and selecting the combination with the highest Nash-Sutcliffe efficiency (NSE) value. The calibrated PHM was then used to simulate monthly surface runoff using precipitation as input in the period 1993 to 2002 for model validation.
The main reason that we chose the Gaussian process regression model is its parsimonious structure compared to other popular approaches such as artificial neural network (ANN) (e.g., [45]). Our GPR has a similar level of complexity to the conceptual PHM. Both models have 2 main equations. PHM has 4 parameters and GPR has 7. Another reason is that the Gaussian process model can provide a competitive solution in a highly automated manner. In fact, the method requires specification of only the covariance structure, for which the Matérn covariance function can be used in most cases [34]. Due to this advantage, we can model various types of complex relationships between the target response variable (i.e., monthly runoff) and the input variables for a large number of watersheds with high flexibility. Moreover, due to its parsimonious structure, GPR often does not suffer from overfitting issues, unlike more parameterized methods such as ANN.
We trained the following Gaussian process model to learn about the relationship between the input variables X 1 , . . . , X p , which are also called predictors (p) in data-driven methods, and the surface runoff Q d : Cov(w(t), w(s)) = ψI(t = s) + σ 2 where Q d (t) and X 1 , . . . , X p are surface runoff and predictors at time point t; θ is the power for variable transformation; β 0 and β 1 , . . . , β p are intercept and regression coefficients; w(t) is a Gaussian process with the covariance function defined in Equation (5); I(.) is an indicator function with a value of 1 if the condition in (.) holds and 0 otherwise; the parameters ψ, κ, φ 1 , . . . , φ p >0 are covariance parameters to be estimated as part of model training; and the correlation function K is a prespecified correlation function. In this study, we used a Matérn correlation function with smoothness parameter 1.5, but the result is not sensitive to the choice of smoothness parameter. The coefficients β 0 , . . . , β p and the covariance parameters σ 2 and φ 1 , . . . , φ p are estimated using maximum restricted likelihood estimation (see Stein [34], Section 6.4 for details), meaning that the objective function we used in model fitting is the restricted likelihood function. In our study, we set the power θ as 1/3, which resulted in the best modeling performance among other tried transformations, the square root transformation (θ = 1/2) and the log transformation. Altogether, this approach allows flexible modeling with 2p + 3 parameters (7 parameters for a 2-input model and 11 parameters for a 4-input model).
One advantage of GPR is that it is much easier to add new input variables to an existing model than it is in conceptual models. This is because, unlike conceptual models, adding new input variables does not usually require changing the structure of the existing model. The model can simply be refit using the same algorithm on the new dataset that includes new input variables. However, this also poses challenges in variable selection. In many hydrological problems, there are vast numbers of input parameters that can be added to the model, and the availability of potential input variables is growing due to advancing information technology. One way to tackle this issue is data mining or a purely data-driven approach, in which the modeler considers all available input variables and employs an input variable selection method [46][47][48][49][50][51][52]. The limitation of this input selection approach is that the physical reasoning behind the change in modeling performance due to different combinations of inputs is usually not clear.
In our study, the 2 input variables of precipitation and mean monthly aridity index for PHM and GPR were selected based on the input requirement of PHM. In this way, we could compare how PHM and GPR performed given the same input variables. We then selected 2 additional input variables for an extended Gaussian process regression (EGPR) model, potential evaporation and NDVI. Potential evaporation is one of the main external drivers of land surface water partitioning [53], and therefore has been widely used as an input or internal variable in hydrologic models for streamflow simulation (e.g., [54][55][56]). Vegetation coverage, which is represented by NDVI, has also shown controlling effects on intra-annual streamflow in previous studies (e.g., [13,26]). As a result, in the basic GPR, we have 2 input variables and 7 parameters, and in EGPR, we have 4 input variables and 11 parameters. Compared with the 4-parameter PHM, the number of parameters is slightly higher for GPR and EGPR, but still at a similar level. As with the PHM, the GPR and EGPR were trained using the data from 1983 to 1992 and used to predict runoff from 1993 to 2002 for validation.

Comparative Study, Sensitivity Analysis, and Uncertainty Analysis
To evaluate model performance, we used 2 error metrics, Nash-Sutcliffe efficiency (NSE) [57] and normalized root mean square error (NRMSE): where Q d,s,i and Q d,o,i are the simulated and observed surface runoff at time step i, Q d,o is the average observed surface runoff, and n is the total number of time steps during the simulation period. After performance evaluation, a sensitivity analysis was conducted on all 3 models to assess the level of control of each input variable on the simulated surface runoff. In the sensitivity analysis, we varied each of the monthly inputs from 10% to 190% of the observed mean values, while fixing the other input variables at their monthly averages for individual watersheds. Then we used these inputs to simulate monthly surface runoff using calibrated PHM, GPR, and EGPR, respectively. In this way, we obtained curves to represent the sensitivity of surface runoff to the change of input variables in each model for each study watershed. The results of the sensitivity analysis were used to check the physical plausibility of GPR and EGPR.
We also quantified prediction uncertainties by computing the 95% prediction intervals corresponding to each predicted value. The upper and lower limits were computed by following the standard procedure for computing the prediction standard deviation for GPR and utilizing the conditional multivariate normal distribution given by the GPR model [34]. The resulting prediction intervals provided a useful way to measure the amount of uncertainty regarding the prediction through its width. They also provided a useful way to examine whether the prediction models were well calibrated by comparing their nominal coverage to the actual coverage in the test dataset.

Model Performance of PHM and GPR
We used the exceedance frequency curve of NSE to show the overall performance of PHM, GPR, and EGPR in 203 MOPEX watersheds in the validation period ( Figure 2). PHM and GPR had similar model performance. GPR had slightly better performance, in that it had fewer watersheds with NSE lower than 0. This result indicates that with the same amount of input information, our data-driven model achieved a comparable level of modeling accuracy on simulated monthly surface runoff to our theory-driven conceptual hydrologic model. With the additional input information of potential evaporation and NDVI, selected based on hydrologic knowledge, EGPR had much better performance than both GPR and PHM.
To compare the performance of the three models spatially, we mapped the NSE values of watersheds in the US ( Figure 3). Again, GPR and PHM had similar spatial distributions of model performance, while PHM had slightly better performance in the Midwest and GPR had slightly better performance in the Northwest. Performance improvement from GPR to EGPR was found in the Northwest, High Plains, Midwest, and Southeast regions. We selected one watershed in the Northwest region and one in the High Plains region where GPR and EGPR had better performance and two watersheds in the Midwest where PHM had better performance to show the simulation results in time series in the validation period ( Figure 4). For the two Western watersheds (Figure 4a,b), both GPR and PHM had weak performance, with PHM having worse NSEs. In both watersheds, GPR underestimated surface runoff peaks, while PHM had trouble capturing the main peaks. EMLM had much better performance than both PHM and GPR. The improvement of EGPR from GPR was mainly due to better accuracy of the peak magnitude estimation. On the other hand, PHM had acceptable performance in the two selected Midwestern watersheds (Figure 4c,d). GPR overestimated the peaks in Little Blue River and underestimated the peaks in East Fork White River. PHM even outperformed EGPR in these two watersheds, especially at Little Blue River. EGPR still had large error on peak magnitude estimation in this region. Based on this comparison, GPR and EGPR did a better job of capturing the timing and magnitude of runoff peaks in the Northwestern regions, while PHM had higher accuracy on peak magnitude estimation in the Midwest.  In general, EGPR had the best performance among the three models, with the highest NSEs and lowest NRMSEs in the 203 MOPEX watersheds across the contiguous US (Table 1).   In general, EGPR had the best performance among the three models, with the highest NSEs and lowest NRMSEs in the 203 MOPEX watersheds across the contiguous US (Table 1). In general, EGPR had the best performance among the three models, with the highest NSEs and lowest NRMSEs in the 203 MOPEX watersheds across the contiguous US (Table 1).

Sensitivity Analysis
After evaluating the models' performance, we conducted a sensitivity analysis on each model. Our main objective was to show the sensitivity of surface runoff to the variability of input variables in these three models to gain insight into how the controlling effect of input variables on surface runoff is represented in the models. We employed graphic analysis to examine the effect of each input variable. We first tested the sensitivity of surface runoff to precipitation change in PHM across the 203 study watersheds ( Figure 5). Different from performance evaluation, sensitivity analysis was performed for the water-limited season and energy-limited season separately, since PHM simulates the two seasons separately. For both seasons, simulated surface runoff increased with increased precipitation, which was expected based on the form of the PHM equations. Percentage-wise, the sensitivity of surface runoff to precipitation change was higher in water-limited seasons ( Figure 5a). Also, the variability of sensitivity among watersheds in water-limited seasons was higher than that in energy-limited seasons. It should be noted that in both seasons, there was a small number of

Sensitivity Analysis
After evaluating the models' performance, we conducted a sensitivity analysis on each model. Our main objective was to show the sensitivity of surface runoff to the variability of input variables in these three models to gain insight into how the controlling effect of input variables on surface runoff is represented in the models. We employed graphic analysis to examine the effect of each input variable. We first tested the sensitivity of surface runoff to precipitation change in PHM across the 203 study watersheds ( Figure 5). Different from performance evaluation, sensitivity analysis was performed for the water-limited season and energy-limited season separately, since PHM simulates the two seasons separately. For both seasons, simulated surface runoff increased with increased precipitation, which was expected based on the form of the PHM equations. Percentage-wise, the sensitivity of surface runoff to precipitation change was higher in water-limited seasons ( Figure 5a). Also, the variability of sensitivity among watersheds in water-limited seasons was higher than that in energy-limited seasons. It should be noted that in both seasons, there was a small number of watersheds with negative NSE values, indicating that the surface runoff generation process at these watersheds was not well captured in PHM due to the simplicity of the model structure. These watersheds were eliminated from the PHM sensitivity analysis.
watersheds with negative NSE values, indicating that the surface runoff generation process at these watersheds was not well captured in PHM due to the simplicity of the model structure. These watersheds were eliminated from the PHM sensitivity analysis. We then tested the sensitivity of surface runoff to each input variable in GPR and EGPR by varying each variable while fixing the other variables at their monthly means (Figures 6 and 7). Similar to PHM, we also eliminated watersheds with negative NSEs in GPR and EGPR. The variability among sensitivity curves in GPR and EGPR was much higher than in PHM. To identify common patterns in these various complicated responses, we conducted clustering analysis on the individual sensitivity curves. We used hierarchical clustering [58] with the correlation distance [59] as the measure of similarity. The correlation distance, given as 2(1 − ), where is the correlation between two curves being compared, is an increasing function of the correlation coefficient, hence clustering based on this distance metric groups curves with similar shapes into the same cluster. The resulting clusters are shown as differently colored curves, and the overall median curve is shown in black in Figures 6 and 7. We used three clusters for GPR and four for EGPR. For the sake of comparison, GPR and EGPR sensitivity analysis was also performed on the two seasons separately. In these two models, the responses of surface runoff to changes in precipitation were quite similar, in that surface runoff increased with increased precipitation. The model responses to changes in mean monthly aridity index were more complex. The median trend was still similar in GPR and EGPR, in that surface runoff generally decreased with the increase of aridity index, and sensitivity was higher in energy-limited seasons. However, clusters of curves had different trends from the main trend, which will be discussed along with other clustering results later in this section.
For the two additional inputs in EGPR, the sensitivities of surface runoff to potential evaporation differed depending on the season. During water-limited seasons, the median response of surface runoff to the increase in potential evaporation monotonically increased (Figure 7e), while in energylimited seasons, the median surface runoff showed an increasing trend first and then changed to a decreasing trend (Figure 7f). In terms of NDVI, the curves are similar to the sensitivity results of aridity index. The median surface runoff response monotonically decreased with the increase of NDVI in water-limited seasons (Figure 7g). In energy-limited seasons, the surface runoff response still mainly decreased with the increase of NDVI. However, at the end, the curve shows an increasing trend. (Figure 7h).
The clustering of response curves provides further insights about the sensitivity differences between the study watersheds. For the increase of surface runoff response to precipitation, both GPR and EGPR had relatively unified patterns of monotonic increase in both seasons, except for a few We then tested the sensitivity of surface runoff to each input variable in GPR and EGPR by varying each variable while fixing the other variables at their monthly means (Figures 6 and 7). Similar to PHM, we also eliminated watersheds with negative NSEs in GPR and EGPR. The variability among sensitivity curves in GPR and EGPR was much higher than in PHM. To identify common patterns in these various complicated responses, we conducted clustering analysis on the individual sensitivity curves. We used hierarchical clustering [58] with the correlation distance [59] as the measure of similarity. The correlation distance, given as 2(1 − ρ), where ρ is the correlation between two curves being compared, is an increasing function of the correlation coefficient, hence clustering based on this distance metric groups curves with similar shapes into the same cluster. The resulting clusters are shown as differently colored curves, and the overall median curve is shown in black in Figures 6 and 7. We used three clusters for GPR and four for EGPR. For the sake of comparison, GPR and EGPR sensitivity analysis was also performed on the two seasons separately. In these two models, the responses of surface runoff to changes in precipitation were quite similar, in that surface runoff increased with increased precipitation. The model responses to changes in mean monthly aridity index were more complex. The median trend was still similar in GPR and EGPR, in that surface runoff generally decreased with the increase of aridity index, and sensitivity was higher in energy-limited seasons. However, clusters of curves had different trends from the main trend, which will be discussed along with other clustering results later in this section.
For the two additional inputs in EGPR, the sensitivities of surface runoff to potential evaporation differed depending on the season. During water-limited seasons, the median response of surface runoff to the increase in potential evaporation monotonically increased (Figure 7e), while in energy-limited seasons, the median surface runoff showed an increasing trend first and then changed to a decreasing trend (Figure 7f). In terms of NDVI, the curves are similar to the sensitivity results of aridity index. The median surface runoff response monotonically decreased with the increase of NDVI in water-limited seasons (Figure 7g). In energy-limited seasons, the surface runoff response still mainly decreased with the increase of NDVI. However, at the end, the curve shows an increasing trend. (Figure 7h).
The clustering of response curves provides further insights about the sensitivity differences between the study watersheds. For the increase of surface runoff response to precipitation, both GPR and EGPR had relatively unified patterns of monotonic increase in both seasons, except for a few outliers. The spatial distribution of the clusters shows that these outliers are in the Northern regions (Figures 8 and 9). In terms of mean monthly aridity index, GPR shows a decreasing trend in general, which is represented by the cluster of watersheds in the Eastern and Midwest regions. However, a smaller cluster of watersheds shows an opposite trend of increasing, especially in water-limited seasons, mostly located in the Western regions. Also, some of the Midwest watersheds are in a different cluster from the main one. For EGPR, the mean monthly aridity index cluster pattern is more complex. The difference between watersheds in Eastern and Western regions is not as distinguished as in GPR; instead, the number of clusters of watersheds in the Midwest increased from three to four. The complex clustering in the Midwest may be related to the agricultural activities in this region. Potential evaporation sensitivity clustering is similar to the clustering of mean monthly aridity index, in that some Midwest watersheds are in small clusters that have different trends from the main cluster's trend. This difference is more significant in energy-limited seasons. Last but not least, the clustering of NDVI sensitivity is slightly different from mean monthly aridity index and potential evaporation sensitivity clustering. Watersheds in the Midwest and Northeast are mostly in one cluster and the Southeastern watersheds are in other clusters. The physical interpretation of the sensitivity results is discussed in Section 5.2. outliers. The spatial distribution of the clusters shows that these outliers are in the Northern regions (Figures 8 and 9). In terms of mean monthly aridity index, GPR shows a decreasing trend in general, which is represented by the cluster of watersheds in the Eastern and Midwest regions. However, a smaller cluster of watersheds shows an opposite trend of increasing, especially in water-limited seasons, mostly located in the Western regions. Also, some of the Midwest watersheds are in a different cluster from the main one. For EGPR, the mean monthly aridity index cluster pattern is more complex. The difference between watersheds in Eastern and Western regions is not as distinguished as in GPR; instead, the number of clusters of watersheds in the Midwest increased from three to four. The complex clustering in the Midwest may be related to the agricultural activities in this region. Potential evaporation sensitivity clustering is similar to the clustering of mean monthly aridity index, in that some Midwest watersheds are in small clusters that have different trends from the main cluster's trend. This difference is more significant in energy-limited seasons. Last but not least, the clustering of NDVI sensitivity is slightly different from mean monthly aridity index and potential evaporation sensitivity clustering. Watersheds in the Midwest and Northeast are mostly in one cluster and the Southeastern watersheds are in other clusters. The physical interpretation of the sensitivity results is discussed in Section 5.2.

Uncertainty Analysis
We also investigated how GPR and EGPR perform in terms of uncertainty quantification. We computed 95% prediction intervals for each predicted value in each study watershed. Figure 10 shows the time series of the observed and predicted values and their corresponding 95% prediction intervals for the same four example watersheds as in Figure 4. In these watersheds, especially the first two (Gauge IDs 12459000 and 9292500), the upper limits of the prediction intervals from GPR do not capture the surges in observed surface runoff, while the upper limits from EGPR closely follow the individual peaks in the observed series. Therefore, including the two additional input variables not only increases the prediction accuracy but also improves the uncertainty quantification performance, especially for the higher values of surface runoff. The actual coverage probability and average width were first computed for individual catchments and then converted into box plots to show the distribution of those two quantities among the study watersheds ( Figure 11). The results show that both GPR and EGPR have actual coverages that are reasonably close to the nominal 95% coverage. The actual coverage of EGPR seems to be slightly lower, indicating that including the two additional input variables makes it slightly more difficult to obtain well-calibrated prediction intervals. The width of the 95% predication interval (PI) is overall shorter in EGPR, hence using the additional two input variables reduces the prediction uncertainty while maintaining a similar level of actual coverage.

Uncertainty Analysis
We also investigated how GPR and EGPR perform in terms of uncertainty quantification. We computed 95% prediction intervals for each predicted value in each study watershed. Figure 10 shows the time series of the observed and predicted values and their corresponding 95% prediction intervals for the same four example watersheds as in Figure 4. In these watersheds, especially the first two (Gauge IDs 12459000 and 9292500), the upper limits of the prediction intervals from GPR do not capture the surges in observed surface runoff, while the upper limits from EGPR closely follow the individual peaks in the observed series. Therefore, including the two additional input variables not only increases the prediction accuracy but also improves the uncertainty quantification performance, especially for the higher values of surface runoff. The actual coverage probability and average width were first computed for individual catchments and then converted into box plots to show the distribution of those two quantities among the study watersheds ( Figure 11). The results show that both GPR and EGPR have actual coverages that are reasonably close to the nominal 95% coverage. The actual coverage of EGPR seems to be slightly lower, indicating that including the two additional input variables makes it slightly more difficult to obtain well-calibrated prediction intervals. The width of the 95% predication interval (PI) is overall shorter in EGPR, hence using the additional two input variables reduces the prediction uncertainty while maintaining a similar level of actual coverage.

Model Performance Comparison
In our comparative study, GPR and PHM show similar performance on simulated monthly surface runoff at watershed scale when given the same amount of input data. PHM uses a physically motivated relationship between surface runoff and precipitation, while GPR learns the relationship between those variables based on the data. GPR overestimated several summer runoff peaks in the Midwest, as shown in Little Blue River, KS (Gauge ID 6884400), and Little Nemaha River, NE (Gauge ID 6811500). The difficulty of GPR accurately simulating high flows has been reported in previous studies [60], which is related to the dependence of prediction errors on the magnitude of observed values. When the distribution of a variable, such as surface runoff, is highly right-skewed, it is common to observe increased variability for higher magnitudes of values. The prediction error is also affected by this increase of variability. Since we built our prediction model for cubic root transformed data, the prediction errors for the higher variables will be amplified when we transform the data back to the original scale. However, building a model without transformation would also cause modeling problems, such as underestimation of the upper limits. Therefore, despite the error-amplifying effect, cubic root transformation still yields the best modeling performance. In addition, the uncertainty analysis

Model Performance Comparison
In our comparative study, GPR and PHM show similar performance on simulated monthly surface runoff at watershed scale when given the same amount of input data. PHM uses a physically motivated relationship between surface runoff and precipitation, while GPR learns the relationship between those variables based on the data.
GPR overestimated several summer runoff peaks in the Midwest, as shown in Little Blue River, KS (Gauge ID 6884400), and Little Nemaha River, NE (Gauge ID 6811500). The difficulty of GPR accurately simulating high flows has been reported in previous studies [60], which is related to the dependence of prediction errors on the magnitude of observed values. When the distribution of a variable, such as surface runoff, is highly right-skewed, it is common to observe increased variability for higher magnitudes of values. The prediction error is also affected by this increase of variability. Since we built our prediction model for cubic root transformed data, the prediction errors for the higher variables will be amplified when we transform the data back to the original scale. However, building a model without transformation would also cause modeling problems, such as underestimation of the upper limits. Therefore, despite the error-amplifying effect, cubic root transformation still yields the best modeling performance. In addition, the uncertainty analysis Surface runoff (mm) Surface runoff (mm) Figure 11. Box plots for catchment-wise (a) actual coverage and (b) width of 95% prediction interval for all watersheds with NSE > 0.

Model Performance Comparison
In our comparative study, GPR and PHM show similar performance on simulated monthly surface runoff at watershed scale when given the same amount of input data. PHM uses a physically motivated relationship between surface runoff and precipitation, while GPR learns the relationship between those variables based on the data.
GPR overestimated several summer runoff peaks in the Midwest, as shown in Little Blue River, KS (Gauge ID 6884400), and Little Nemaha River, NE (Gauge ID 6811500). The difficulty of GPR accurately simulating high flows has been reported in previous studies [60], which is related to the dependence of prediction errors on the magnitude of observed values. When the distribution of a variable, such as surface runoff, is highly right-skewed, it is common to observe increased variability for higher magnitudes of values. The prediction error is also affected by this increase of variability. Since we built our prediction model for cubic root transformed data, the prediction errors for the higher variables will be amplified when we transform the data back to the original scale. However, building a model without transformation would also cause modeling problems, such as underestimation of the upper limits. Therefore, despite the error-amplifying effect, cubic root transformation still yields the best modeling performance. In addition, the uncertainty analysis results of GPR show that the observed surface runoff exceeding the upper limit of the 95% prediction interval (3.3%) is reasonably close to the nominal tail probability (2.5%), hence we think that our model still provides a valid prediction of surface runoff.
For Western watersheds, especially in the High Plains region, GPR shows better performance than PHM, in that the timing and magnitude of streamflow peaks are captured better by GPR than by PHM. This result indicates that the form of conceptual equations in PHM may not be suitable for these watersheds, or for the High Plains region in general. As discussed in Chen and Wang [26], the snow effect in cold regions is not considered in this PHM. As a result, the performance in these regions is generally not very good. The snowfall and snow melt processes are different from the rainfall-runoff process, therefore we need to model the snow processes separately, which is beyond the scope of this study.
On the one hand, GPR's modeling performance relies heavily on the level of representativeness and variability of input data because of its data-driven nature. On the other hand, PHM is based on a set of predefined conceptual equations that may not be capable of capturing rainfall-runoff processes in different watersheds with a wide range of hydrologic characteristics.
The performance of EGPR is much better than both PHM and GPR. The improvement of modeling performance indicates that, besides the original two input variables, the additional two variables selected based on hydrologic knowledge are also major controlling factors of monthly surface runoff variability. We caution that our results do not indicate that simply adding more input variables can improve the simulation performance of our GPR approach. In fact, we have tried to use other monthly variables as input series, such as monthly maximum temperature, monthly precipitation maxima, length of the longest wet spell, and a number of other variables. We found, however, that including these variables did not improve the simulation performance, indicating that they do not contain the same necessary information as the selected variables to predict monthly runoff. It might be tempting to include all available input variables in the machine learning model, but adding redundant variables in GPR often negatively affects the simulation performance by increasing the variance of predicted values [61]. The selection of input variables is therefore crucial to the GPR approach. With only two additional input series selected based on hydrologic knowledge, we significantly improved our GPR's performance.
To further investigate the reason that EGPR outperformed PHM, we generated a map of the study watersheds with a correlation coefficient between input precipitation and observed surface runoff (Figure 11a). Comparing the performance of PHM and EGPR, PHM works well with watersheds with a strong positive correlation between precipitation and surface runoff but poorly with watersheds with a weak positive or even negative correlation between precipitation and surface runoff. EGPR outperformed PHM in these watersheds, since it has more inputs and a more flexible model structure. Therefore, to improve the model robustness of PHM, we focused on these watersheds with weak positive or negative correlation between precipitation and surface runoff. These watersheds are strongly affected by snow processes [56].

Physical Interpretation of Sensitivity Analysis Results
The sensitivity analysis results show a variety of clusters of shapes, especially in GPR and EGPR, representing a wide range of responses of monthly surface runoff to the variability of input variables.
First, the results show a monotonic increase of surface runoff with increased precipitation in PHM, GPR, and EGPR models. Across 203 watersheds with different climate and land surface characteristics, the change in strong positive sensitivity of surface runoff to precipitation is consistent. However, there are outliers in the GPR and EGPR analysis results that show negative sensitivity between precipitation and surface runoff. The abnormal pattern of these outliers is also shown in the correlation between input precipitation and observed surface runoff (Figure 12a), which could be due to snow effects.
In terms of mean monthly aridity index, surface runoff had a negative sensitivity trend in general in both GPR and EGPR. This negative trend was more significant in energy-limited seasons. The negative sensitivity of surface runoff to mean monthly aridity was expected based on the complementary Budyko-type relationship [53,62], in which the ratio of runoff to precipitation decreases with increased aridity index. The clustering result of surface runoff sensitivity to mean monthly aridity index shows that the majority of watersheds in the Eastern regions form the main cluster, while watersheds in the Midwest are in small clusters different from the main cluster in both seasons. This result indicates unique watershed behaviors in the Midwest, which could be related to the agricultural activities in this region. This spatial pattern of sensitivity is also consistent with the correlation between input aridity index and observed surface runoff (Figure 12b). It should be noted that the spatial patterns shown in Figure 12b-d are similar, indicating that the input variables of aridity index, potential evaporation, and NDVI have similar effects on surface runoff variability. complementary Budyko-type relationship [53,62], in which the ratio of runoff to precipitation decreases with increased aridity index. The clustering result of surface runoff sensitivity to mean monthly aridity index shows that the majority of watersheds in the Eastern regions form the main cluster, while watersheds in the Midwest are in small clusters different from the main cluster in both seasons. This result indicates unique watershed behaviors in the Midwest, which could be related to the agricultural activities in this region. This spatial pattern of sensitivity is also consistent with the correlation between input aridity index and observed surface runoff (Figure 12b). It should be noted that the spatial patterns shown in Figure 12b-d are similar, indicating that the input variables of aridity index, potential evaporation, and NDVI have similar effects on surface runoff variability. For the two additional input variables of EGPR, the positive sensitivity of surface runoff to the change of potential evaporation in the main cluster of watersheds, especially in water-limited seasons, is surprising. This positive trend contradicts the results of some previous studies, which showed that runoff and potential evaporation are negatively correlated [63]. It also contradicts the correlation results shown in Figure 12c. To further investigate the sensitivity of surface runoff to potential evaporation changes, we used only precipitation and potential evaporation as input variables to run an additional machine learning simulation and perform a sensitivity analysis on this additional model following the same procedure. The model shows similar performance to the GPR, with an average NSE of 0.37, and the sensitivity of surface runoff to the increase of potential evaporation became a decreasing trend (Figure 13a,b), which agrees with previous studies. This change of sensitivity trend from increasing to decreasing and the similarity shown in Figure 12b-d indicate that there is confounding among potential evaporation, mean monthly aridity index, and NDVI. That is, potential evaporation is related not only to surface runoff but also to aridity index and vegetation coverage. Therefore, the response of surface runoff to NDVI change is also affected by confounding. Similar to the sensitivity analysis results of the aridity index, the surface runoff response generally decreased with the increase of NDVI (Figure 7g,h). The spatial pattern of surface runoff sensitivity to NDVI can be related to the type of vegetation coverage: Midwest and Northeast For the two additional input variables of EGPR, the positive sensitivity of surface runoff to the change of potential evaporation in the main cluster of watersheds, especially in water-limited seasons, is surprising. This positive trend contradicts the results of some previous studies, which showed that runoff and potential evaporation are negatively correlated [63]. It also contradicts the correlation results shown in Figure 12c. To further investigate the sensitivity of surface runoff to potential evaporation changes, we used only precipitation and potential evaporation as input variables to run an additional machine learning simulation and perform a sensitivity analysis on this additional model following the same procedure. The model shows similar performance to the GPR, with an average NSE of 0.37, and the sensitivity of surface runoff to the increase of potential evaporation became a decreasing trend (Figure 13a,b), which agrees with previous studies. This change of sensitivity trend from increasing to decreasing and the similarity shown in Figure 12b-d indicate that there is confounding among potential evaporation, mean monthly aridity index, and NDVI. That is, potential evaporation is related not only to surface runoff but also to aridity index and vegetation coverage. Therefore, the response of surface runoff to NDVI change is also affected by confounding. Similar to the sensitivity analysis results of the aridity index, the surface runoff response generally decreased with the increase of NDVI (Figure 7g,h). The spatial pattern of surface runoff sensitivity to NDVI can be related to the type of vegetation coverage: Midwest and Northeast regions are covered by temperate deciduous forest, while the Southeast is covered by subtropical evergreen forest.
Water 2018, 10, x FOR PEER REVIEW 18 of 22 regions are covered by temperate deciduous forest, while the Southeast is covered by subtropical evergreen forest. The confounding between input variables in our models is due to the interconnection between these variables in the hydrologic cycle. Within the four input variables we selected, aridity index, potential evaporation, and NDVI have similar correlation with surface runoff (Figure 12b-d), indicating that their effects are partially overlaid. We also calculated the correlation coefficient r values between the three overlying input variables for each of the 203 watersheds. Across the study watersheds, the average r values of aridity index vs. potential evaporation, aridity index vs. NDVI, and potential evaporation vs. NDVI are 0.78, 0.69, and 0.84, respectively. This result confirms that these three input variables are closely related and therefore their effects on surface runoff are partially overlaid. Even with this confounding issue, the model performance is improved with the help of two additional input variables. This improvement may be due to the nonoverlaid parts of the additional input variables. In other words, despite the confounding issue, the additional two input variables can provide helpful information to improve the modeling performance.

Conclusions
In this study, we systematically compared the performance of a theory-driven conceptual PHM and a data-driven GPR model to simulate monthly surface runoff at 203 watersheds across the contiguous US. With the same level of input information and model structure complexity, both models had similar and acceptable performance, indicating that a data-driven approach can achieve a similar level of performance as a theory-driven hydrologic model. Then, we added two more input variables, selected based on hydrologic knowledge, to our data-driven model. This extended datadriven model had better performance than both the conceptual hydrologic model and the original data-driven model.
We also conducted a sensitivity analysis to see the models' responses to variability of input variables. The result showed that the simulated surface runoff in all three models was positively sensitive to the change of precipitation. On the other hand, the sensitivity analysis showed a confounding effect among mean monthly aridity index, potential evaporation, and NDVI, indicating that these three variables have similar effect on surface runoff response.
In future studies, we will include more streamflow-related processes, such as baseflow generation and snow processes, in our modeling framework. We will also further investigate the confounding effect of aridity index, potential evaporation, and vegetation coverage on surface runoff. The confounding between input variables in our models is due to the interconnection between these variables in the hydrologic cycle. Within the four input variables we selected, aridity index, potential evaporation, and NDVI have similar correlation with surface runoff (Figure 12b-d), indicating that their effects are partially overlaid. We also calculated the correlation coefficient r values between the three overlying input variables for each of the 203 watersheds. Across the study watersheds, the average r values of aridity index vs. potential evaporation, aridity index vs. NDVI, and potential evaporation vs. NDVI are 0.78, 0.69, and 0.84, respectively. This result confirms that these three input variables are closely related and therefore their effects on surface runoff are partially overlaid. Even with this confounding issue, the model performance is improved with the help of two additional input variables. This improvement may be due to the nonoverlaid parts of the additional input variables. In other words, despite the confounding issue, the additional two input variables can provide helpful information to improve the modeling performance.

Conclusions
In this study, we systematically compared the performance of a theory-driven conceptual PHM and a data-driven GPR model to simulate monthly surface runoff at 203 watersheds across the contiguous US. With the same level of input information and model structure complexity, both models had similar and acceptable performance, indicating that a data-driven approach can achieve a similar level of performance as a theory-driven hydrologic model. Then, we added two more input variables, selected based on hydrologic knowledge, to our data-driven model. This extended data-driven model had better performance than both the conceptual hydrologic model and the original data-driven model.
We also conducted a sensitivity analysis to see the models' responses to variability of input variables. The result showed that the simulated surface runoff in all three models was positively sensitive to the change of precipitation. On the other hand, the sensitivity analysis showed a confounding effect among mean monthly aridity index, potential evaporation, and NDVI, indicating that these three variables have similar effect on surface runoff response.
In future studies, we will include more streamflow-related processes, such as baseflow generation and snow processes, in our modeling framework. We will also further investigate the confounding effect of aridity index, potential evaporation, and vegetation coverage on surface runoff.
Author Contributions: W.C. developed the machine learning modeling framework and conducted the sensitivity analysis and uncertainty analysis; X.C. contributed to the conceptual modeling part and provided physical interpretation of the study results. All authors contributed equally to the research.
Funding: This research received no external funding.