Development of a Deep Learning Emulator for a Distributed Groundwater–Surface Water Model: ParFlow-ML

Integrated hydrologic models solve coupled mathematical equations that represent natural processes, including groundwater, unsaturated, and overland flow. However, these models are computationally expensive. It has been recently shown that machine leaning (ML) and deep learning (DL) in particular could be used to emulate complex physical processes in the earth system. In this study, we demonstrate how a DL model can emulate transient, three-dimensional integrated hydrologic model simulations at a fraction of the computational expense. This emulator is based on a DL model previously used for modeling video dynamics, PredRNN. The emulator is trained based on physical parameters used in the original model, inputs such as hydraulic conductivity and topography, and produces spatially distributed outputs (e.g., pressure head) from which quantities such as streamflow and water table depth can be calculated. Simulation results from the emulator and ParFlow agree well with average relative biases of 0.070, 0.092, and 0.032 for streamflow, water table depth, and total water storage, respectively. Moreover, the emulator is up to 42 times faster than ParFlow. Given this promising proof of concept, our results open the door to future applications of full hydrologic model emulation, particularly at larger scales.


Introduction
Large-scale hydrologic model predictions can address many grand water challenges, such as flood, drought, and climate change predictions. These modeling approaches are constantly challenged by the need to represent smaller-scale processes and heterogeneities, yet cover large relevant domains while being computationally efficient. As pointed out recently [1,2], increases in model resolution are often needed to make large-scale predictions relevant for a local policy decision. However, the computational cost of these models is large [1], resulting in long run times even for modern supercomputers [3,4]. Several approaches are being leveraged to accelerate this process, including graphical processing unit (GPU) architectures [5,6]; however, further acceleration will enable more a locally relevant simulation. In recent years, there have been tremendous developments in the use of machine learning emulators for complex physical problems as a way to accelerate complex and high-fidelity physical models [7][8][9].
For a DL approach to be a meaningful emulator of large-scale hydrologic models, the spatially and temporally based approaches need to be combined. However, efforts to combine these two approaches are still at an early stage. One such effort is the convolutional long short-term-memory (ConvLSTM) method [29] that combines spatial and temporal autocorrelations. ConvLSTM has demonstrated better performance than CNN [30] and LSTM [29] individually, particularly on video interpolation and extrapolation. However, as the lead in prediction time increases, the vanishing gradient problem (see Section 2.2) of RNNs decreases the prediction accuracy [29], something overcome in recent applications: predictive recurrent neural network (PredRNN [31,32]).
Here, we detail the development of a deep learning emulator for the integrated hydrologic model ParFlow [33]. The emulator is built using the PredRNN model [31,32]. We train the emulator based on simulation results from ParFlow. This emulator takes ParFlow model parameters (e.g., topographic slopes, saturated hydraulic conductivity tensors, and initial pressure) and rainfall information as inputs and produces three-dimensional transient outputs (e.g., pressure head and saturation). We conduct simulations using different rainfall-runoff scenarios in two realistic setups, the Taylor River basin in the Rocky Mountains and the Little Washita basin in Southwestern Oklahoma.

Materials and Methods
In this section, we briefly introduce the ParFlow model and equations. Then we present the DL model, PredRNN, used to construct the ParFlow emulator.

The Integrated Hydrologic Model, ParFlow
We use the integrated hydrologic model ParFlow [34][35][36][37][38] to simulate the response of a pressure head resulting from synthetic rainfall scenarios. ParFlow computes subsurface fluxes and surface fluxes by solving the three-dimensional Richards equation [39] and two-dimensional kinematic wave equations. Subsurface and surface fluxes are integrated using a free surface overland flow boundary condition.
The Richards equation for a variably saturated flow is given as [37]: where h is the pressure head (L), z is the elevation (L), S s is the specific storage (L −1 ), S w (h) is the relative saturation (-), ϕ is the porosity (-), K s (x) is the saturated hydraulic conductivity tensor (LT −1 ), k r (h) is the relative permeability (-), q r is a sink/source term (T −1 ), and θ x is the local angle of a topographic slope (S). The overland flow equation is given as [36,38]: where k is the unit vector in the vertical (-), h, 0 denotes the maximum of the two quantities, v sw is the two-dimensional depth-averaged surface water velocity (LT −1 ), and λ is a constant equal to the inverse of the grid scaling (L −1 ) [38]. The van Genuchten [40] equations are used to relate relative saturation to water head given as: where α (L −1 ) and n (-) are soil parameters, s sat (-) is the relative saturated water content, and s res (-) is the relative residual saturation.

The Emulator Version of ParFlow, ParFlow-ML
The emulator model (ParFlow-ML) is constructed using a predictive recurrent neural network (PredRNN) developed by Wang et al. [31,32]. The PredRNN architecture has two major components: (1) a new spatiotemporal memory mechanism, called causal LSTM, to increase the short-term modeling capacity, and (2) the gradient highway recurrent unit (GHU) to capture the influence of long-term information.
In a causal LSTM, the recurrence depth is increased along the spatiotemporal transition pathway (Figure 1), which includes dual memories, the temporal memory C k t and the spatial memory M k t . The GHU is based on a study by Srivastava et al. [41], which indicates that highway layers can deliver gradients efficiently in deep feed-forward networks. The structure of the GHU is shown with the final PredRNN architecture in Figure 2; equations of the GHU are as follows: Water 2021, 13, x 3 of 18 where is the unit vector in the vertical (-), ‖ℎ, 0‖ denotes the maximum of the two quantities, is the two-dimensional depth-averaged surface water velocity (LT −1 ), and is a constant equal to the inverse of the grid scaling (L −1 ) [38].
The van Genuchten [40] equations are used to relate relative saturation to water head given as: where α (L −1 ) and n (-) are soil parameters, ssat (-) is the relative saturated water content, and sres (-) is the relative residual saturation.

The Emulator Version of ParFlow, ParFlow-ML
The emulator model (ParFlow-ML) is constructed using a predictive recurrent neural network (PredRNN) developed by Wang et al. [31,32]. The PredRNN architecture has two major components: (1) a new spatiotemporal memory mechanism, called causal LSTM, to increase the short-term modeling capacity, and (2) the gradient highway recurrent unit (GHU) to capture the influence of long-term information.
In a causal LSTM, the recurrence depth is increased along the spatiotemporal transition pathway (Figure 1), which includes dual memories, the temporal memory and the spatial memory . The GHU is based on a study by Srivastava et al. [41], which indicates that highway layers can deliver gradients efficiently in deep feed-forward networks. The structure of the GHU is shown with the final PredRNN architecture in Figure  2; equations of the GHU are as follows: = ⊙ + (1 − ) ⊙ (7) Figure 1. From Wang et al. [31], schematic of the causal LSTM, in which the temporal and spatial memories are connected in a cascaded way through gated structures. Colored parts are newly designed operations, concentric circles denote concatenation, and σ is the element-wise Sigmoid function.  [31], schematic of the causal LSTM, in which the temporal and spatial memories are connected in a cascaded way through gated structures. Colored parts are newly designed operations, concentric circles denote concatenation, and σ is the element-wise Sigmoid function.
tween the transformed inputs and the hidden states [31]. In the final PredRNN architecture, the GHU enables an alternative quick route from the first to the last timestep (the blue line in Figure 2). These two components enable the PredRNN to learn complex video sequence dynamics and predict state-of-the-art results [31,32]. The first component, the causal LSTM with a cascade dual memory structure, enhances the ability of the PredRNN to capture short-term dynamics. The second component, the GHU, which links future predictions to distant inputs, alleviates the vanishing gradient problem.

Experiment Design
We train the emulator on two different river basins, the Taylor and Little Washita. In this section, we discuss the study areas, the emulator setup, the hydrologic model setup, and the rainfall-runoff scenarios.

Study Areas
The two modeling domains used in this study are the Taylor River basin and the Little Washita basin ( Figure 3). The Taylor River basin, located in the Rocky Mountains, has an area of 1236 km 2 and a mean elevation of 3500 m ( Figure 3). The outlet of the basin is located at Almont, Colorado, USA. The Little Washita basin, located in the Southwestern Oklahoma, has an area of 600 km 2 and is characterized by rolling terrain. The outlet of the basin is located at Smithville, Oklahoma, USA. Here, W xx are the convolutional filters, P t are the transformed inputs, and Z t−1 are the hidden states. S t is named as switch gate, which enables an adaptive learning between the transformed inputs and the hidden states [31]. In the final PredRNN architecture, the GHU enables an alternative quick route from the first to the last timestep (the blue line in Figure 2). These two components enable the PredRNN to learn complex video sequence dynamics and predict state-of-the-art results [31,32]. The first component, the causal LSTM with a cascade dual memory structure, enhances the ability of the PredRNN to capture short-term dynamics. The second component, the GHU, which links future predictions to distant inputs, alleviates the vanishing gradient problem.

Experiment Design
We train the emulator on two different river basins, the Taylor and Little Washita. In this section, we discuss the study areas, the emulator setup, the hydrologic model setup, and the rainfall-runoff scenarios.

Study Areas
The two modeling domains used in this study are the Taylor River basin and the Little Washita basin (Figure 3). The Taylor River basin, located in the Rocky Mountains, has an area of 1236 km 2 and a mean elevation of 3500 m ( Figure 3). The outlet of the basin is located at Almont, Colorado, USA. The Little Washita basin, located in the Southwestern Oklahoma, has an area of 600 km 2 and is characterized by rolling terrain. The outlet of the basin is located at Smithville, Oklahoma, USA. Water 2021, 13, x 5 of 18

The Emulator Setup
The emulator, ParFlow-ML, is trained with ParFlow's inputs to predict ParFlow outputs of the pressure head and relative saturation. There are two types of ParFlow inputs: static inputs (discussed in Section 2.3.3) and dynamic inputs (discussed in Section 2.3.4). Static inputs include surface and subsurface information, such as topographic slopes, Manning's roughness values, saturated hydraulic conductivity tensors, and domain's initial pressure. Static inputs are used as the initial spatial memory for ParFlow-ML. ParFlow's rainfall information is used as the dynamic input for ParFlow-ML. For each dynamic input timestep, ParFlow-ML outputs the corresponding pressure head and relative saturation.

Model Setup
In this study, the model setups are the same for the Taylor River basin and the Little Washita River basin. Both model setups include five vertical layers that are 0.1, 0.3, 0.6, 1.0, and 100 m thick. The two basins are implemented with a lateral resolution of 1 km. Model inputs consist of two types of data: surface and subsurface.
The surface inputs, topographic slopes and land cover, are evaluated as follows. Topographic slopes are calculated from the elevation input from the hydrological data and maps based on shuttle elevation derivatives at multiple scales (HydroSHEDS) using the priority flow toolbox [42]. The land cover characteristic is upscaled from the original 30 m resolution National Land Cover Database (NLCD).
The subsurface inputs include four soil layers at the top and one geological layer at the bottom, including soil and geological layer characteristics, such as saturated hydraulic conductivity and van Genuchten parameters [43]. Soil and geological layer categories were taken from the Soil Survey Geographic Database (SSURGO) and from a global permeability map developed by Gleeson et al. [44]. More details about the subsurface inputs and their configuration can be found in Maxwell et al. [38] and Maxwell and Condon [45].

Rainfall-Runoff Scenarios
To simulate rainfall-runoff simulations using ParFlow, we specify a "rainfall-recession" cycle as a boundary condition for the top of the simulation domain. Rain length and rain frequency can be specified by setting two ParFlow input keys, Cy-

The Emulator Setup
The emulator, ParFlow-ML, is trained with ParFlow's inputs to predict ParFlow outputs of the pressure head and relative saturation. There are two types of ParFlow inputs: static inputs (discussed in Section 2.3.3) and dynamic inputs (discussed in Section 2.3.4). Static inputs include surface and subsurface information, such as topographic slopes, Manning's roughness values, saturated hydraulic conductivity tensors, and domain's initial pressure. Static inputs are used as the initial spatial memory M k 0 for ParFlow-ML. ParFlow's rainfall information is used as the dynamic input for ParFlow-ML. For each dynamic input timestep, ParFlow-ML outputs the corresponding pressure head and relative saturation.

Model Setup
In this study, the model setups are the same for the Taylor River basin and the Little Washita River basin. Both model setups include five vertical layers that are 0.1, 0.3, 0.6, 1.0, and 100 m thick. The two basins are implemented with a lateral resolution of 1 km. Model inputs consist of two types of data: surface and subsurface.
The surface inputs, topographic slopes and land cover, are evaluated as follows. Topographic slopes are calculated from the elevation input from the hydrological data and maps based on shuttle elevation derivatives at multiple scales (HydroSHEDS) using the priority flow toolbox [42]. The land cover characteristic is upscaled from the original 30 m resolution National Land Cover Database (NLCD).
The subsurface inputs include four soil layers at the top and one geological layer at the bottom, including soil and geological layer characteristics, such as saturated hydraulic conductivity and van Genuchten parameters [43]. Soil and geological layer categories were taken from the Soil Survey Geographic Database (SSURGO) and from a global permeability map developed by Gleeson et al. [44]. More details about the subsurface inputs and their configuration can be found in Maxwell et al. [38] and Maxwell and Condon [45].

Rainfall-Runoff Scenarios
To simulate rainfall-runoff simulations using ParFlow, we specify a "rainfall-recession" cycle as a boundary condition for the top of the simulation domain. Rain length and rain frequency can be specified by setting two ParFlow input keys, Cycle.rainrec.rain.Length and Cycle.rainrec.rec.Length, respectively. Rain values can be specified by setting the Patch.top.BCPressure.rain.Value input key. ParFlow outputs the pressure head, and relative saturation files consist of five layers.

Scenarios
Rain

Training Process
We use both the train and the validation sets in the training process. While the train set is used for the ParFlow-ML model to learn, the validation set is used for providing an unbiased evaluation during training and for avoiding overfitting. ParFlow-ML is evaluated with the validation set every 20 iterations. If the validation loss does not change for the consecutive 200 iterations, the training process will be terminated. The training processes for the Taylor River basin and the Little Washita River basin are similar. Here, we only describe the training process for the Taylor River basin.
For each rainfall-runoff scenario, ParFlow outputs the pressure head and relative saturation at all model layers at each simulation timestep. These model outputs are used as targets for the ML model, and ParFlow's rainfall information is used as an input. In addition to rainfall information, other ParFlow model input parameters are also used but as static input: topographic slopes, Manning's roughness values, saturated hydraulic conductivity tensors, and domain's initial pressure. The static inputs are the same for all rainfall-runoff scenarios.

Training Process
We use both the train and the validation sets in the training process. While the train set is used for the ParFlow-ML model to learn, the validation set is used for providing an unbiased evaluation during training and for avoiding overfitting. ParFlow-ML is evaluated with the validation set every 20 iterations. If the validation loss does not change for the consecutive 200 iterations, the training process will be terminated. The training processes for the Taylor River basin and the Little Washita River basin are similar. Here, we only describe the training process for the Taylor River basin.
For each rainfall-runoff scenario, ParFlow outputs the pressure head and relative saturation at all model layers at each simulation timestep. These model outputs are used as targets for the ML model, and ParFlow's rainfall information is used as an input. In addition to rainfall information, other ParFlow model input parameters are also used but as static input: topographic slopes, Manning's roughness values, saturated hydraulic conductivity tensors, and domain's initial pressure. The static inputs are the same for all rainfall-runoff scenarios.
We configured ParFlow-ML with 8 stacked CausalLSTM layers, each including 1024 nodes. The total number of trainable parameters in ParFlow-ML was approximately 210 million. We choose to train the model for 2000 iterations (for each iteration, the model has to run through the whole training batch of 16 scenarios) initially, then assess the loss. We choose the Adam optimizer [46] with an initial learning rate of 1 × 10 −3 . After 2000 iterations, loss (mean squared error) of the model compared with both the train and the validation sets decreased to 2.55 × 10 −3 and 2.57 × 10 −3 , respectively ( Figure 5). Since both training and validation losses did not change over the last 200 iterations, we decided to impose early stopping after 2000 iterations.
In the training process, the time for ParFlow-ML to predict one scenario is roughly 5.4 s. Since there are 16 scenarios in the train set, the training batch size is 16. For 2000 We configured ParFlow-ML with 8 stacked CausalLSTM layers, each including 1024 nodes. The total number of trainable parameters in ParFlow-ML was approximately 210 million. We choose to train the model for 2000 iterations (for each iteration, the model has to run through the whole training batch of 16 scenarios) initially, then assess the loss. We choose the Adam optimizer [46] with an initial learning rate of 1 × 10 −3 . After 2000 iterations, loss (mean squared error) of the model compared with both the train and the validation sets decreased to 2.55 × 10 −3 and 2.57 × 10 −3 , respectively ( Figure 5). Since both training and validation losses did not change over the last 200 iterations, we decided to impose early stopping after 2000 iterations.

Performance Metrics
Predictions of the pressure head and relative saturation from ParFlow and ParFlow-ML are used to compute streamflow, water table depth (WTD), and total water storage (TWS) and then compared. These three derived variables are evaluated in the test dataset using four metrics, namely, Spearman's rho, total absolute relative bias, Kling-Gupta ef- In the training process, the time for ParFlow-ML to predict one scenario is roughly 5.4 s. Since there are 16 scenarios in the train set, the training batch size is 16. For 2000 iterations, the whole training progress takes around 50 h. All simulations were undertaken in the Princeton Hydrologic Data Center (PHDC), and the training was conducted on NVIDIA A100 GPUs.

Performance Metrics
Predictions of the pressure head and relative saturation from ParFlow and ParFlow-ML are used to compute streamflow, water table depth (WTD), and total water storage (TWS) and then compared. These three derived variables are evaluated in the test dataset using four metrics, namely, Spearman's rho, total absolute relative bias, Kling-Gupta efficiency (KGE [47,48]) and the volumetric hit index (VHI [49]). Spearman's rho assesses the differences in timing for variables resulting from ParFlow and PredRNN with values closer to one indicating good agreement. The relative bias measures the differences in volume for variables predicted from ParFlow and PredRNN. The KGE coefficient provides an aggregation metric of mean, standard deviation, and correlation between the "simulated" and the "observed" variables. Lastly, the VHI compares the volume of correctly detected simulations with the total volume of correctly detected simulations and missed observations.

Results
The test set is used to provide an unbiased assessment of ParFlow-ML. As stated above, the pressure and saturation output from both ParFlow and PredRNN are used to compute streamflow (m 3 /s), water table depth (m), and total water storage (m 3 ). Below is an evaluation for each of the variables.

Streamflow Evaluation
Outflows at gauge locations predicted by ParFlow-ML are evaluated against the ones simulated by ParFlow for all testing scenarios. For the Taylor River basin, the KGE values are 0.747, 0.96, 0.97, and 0.787 for scenarios 21, 22, 23, and 24, respectively ( Table 2). We show two representative testing scenarios, 22 and 23, with different peak times and magnitudes in Figure 6a. Predicted outflows from ParFlow-ML (dashed lines) closely match (average KGE of 0.97) the ones simulated by ParFlow in both scenarios (Figure 6a). The ParFlow-ML results compare well with those of ParFlow for the three small rainfall events (scenario 2) and for one large rainfall event (scenario 3).  (Table 2). Two representative test scenarios, 21 and 24, with different peak times and magnitudes are shown in Figure 7a. In scenario 24 (blue lines), ParFlow-ML can predict the peak timing accurately (average Spearman's rho value of 0.98). However, ParFlow-ML underestimates the first peak and overestimates the second peak.   (Table 2). Two representative test scenarios, 21 and 24, with different peak times and magnitudes are shown in Figure 7a. In scenario 24 (blue lines), ParFlow-ML can predict the peak timing accurately (average Spearman's rho value of 0.98). However, ParFlow-ML underestimates the first peak and overestimates the second peak.
The spatially distributed outputs of overland flow for baseflow (Figures 6b and 7b) and peak flow (Figures 6c and 7c) from ParFlow-ML and ParFlow also agree well. Par-Flow-ML captures the stream network location and change in magnitude through time, agreeing with the flow routing physics (i.e., flow increases from upstream to downstream cells). Outside of the stream networks, ParFlow-ML exhibits some prediction artifacts and several points where pressure values are predicted as high by ML but as very low in Par-Flow. For these locations, ParFlow-ML appears to be predicting a timeseries with a mean close to the predicted values in ParFlow but with much more noise, resulting in pressure fluctuations around a low mean.   The spatially distributed outputs of overland flow for baseflow (Figures 6b and 7b) and peak flow (Figures 6c and 7c) from ParFlow-ML and ParFlow also agree well. ParFlow-ML captures the stream network location and change in magnitude through time, agreeing with the flow routing physics (i.e., flow increases from upstream to downstream cells). Outside of the stream networks, ParFlow-ML exhibits some prediction artifacts and several points where pressure values are predicted as high by ML but as very low in ParFlow. For these locations, ParFlow-ML appears to be predicting a timeseries with a mean close to the predicted values in ParFlow but with much more noise, resulting in pressure fluctuations around a low mean.  For the WTD timeseries evaluation, over the Taylor River basin, we choose three representative points (shown in Figure 8 as A, B, and C) to represent the behavior of the stream, shallow, and deep WTD cells, respectively (Figure 8a,b). Point A starts with a deep WTD, which then decreases rapidly as rain is applied to the domain, becoming overland flow. A decrease in WTD also happens at point B, however, with a smaller magnitude. The WTD at point C remains constant during the simulation period. For points A and B, ParFlow-ML captures the temporal change in WTD for all the scenarios (Figure 8c,d). For point C, similar behaviors for ParFlow-ML in streamflow prediction can also be seen here with WTD where ParFlow-ML fluctuates around the mean (true) value from ParFlow (Figure 8e).

Water Table Depth (WTD) Evaluation
We observe a different performance in the Little Washita River domain (Figure 9), mainly due to the differences in the domain. The Little Washita has a more gentle topography and does not demonstrate the large differences in overall water table depth seen in For the WTD timeseries evaluation, over the Taylor River basin, we choose three representative points (shown in Figure 8 as A, B, and C) to represent the behavior of the stream, shallow, and deep WTD cells, respectively (Figure 8a,b). Point A starts with a deep WTD, which then decreases rapidly as rain is applied to the domain, becoming overland flow. A decrease in WTD also happens at point B, however, with a smaller magnitude. The WTD at point C remains constant during the simulation period. For points A and B, ParFlow-ML captures the temporal change in WTD for all the scenarios (Figure 8c,d). For point C, similar behaviors for ParFlow-ML in streamflow prediction can also be seen here with WTD where ParFlow-ML fluctuates around the mean (true) value from ParFlow (Figure 8e).
We observe a different performance in the Little Washita River domain (Figure 9), mainly due to the differences in the domain. The Little Washita has a more gentle topography and does not demonstrate the large differences in overall water table depth seen in the more mountainous Taylor River system. The average VHI over the simulation period for scenarios 21, 22, 23, and 24 are 0.85, 0.89, 0.82, and 0.90, respectively. The two points (A, more shallow; B, deeper) chosen in the Little Washita domain both undergo rapid changes as rain is applied, and this behavior is captured well by the ParFlow-ML results.

Total Water Storage Evaluation
We evaluate the TWS predicted by ParFlow and ParFlow-ML over both simulation domains to evaluate the ability of ParFlow-ML to capture the water balance over the catch ment. This is an important evaluation as while the ParFlow model is both globally and locally mass conserving, there is no such intrinsic mass conservation in the ML emulator Other than being trained on the loss function of total pressure for the given inputs, no external criteria that enforce water balance are imposed. The four test scenarios (21)(22)(23)(24) have different TWSs during the simulation period. While scenarios 21 and 22 in the Taylo River basin show flashier TWS behaviors resulting from shorter rainfall events, TWS changes in scenarios 23 and 24 are slower and exhibit higher magnitudes from longe rainfall events (Figure 10).

Total Water Storage Evaluation
We evaluate the TWS predicted by ParFlow and ParFlow-ML over both simulation domains to evaluate the ability of ParFlow-ML to capture the water balance over the catchment. This is an important evaluation as while the ParFlow model is both globally and locally mass conserving, there is no such intrinsic mass conservation in the ML emulator. Other than being trained on the loss function of total pressure for the given inputs, no external criteria that enforce water balance are imposed. The four test scenarios (21)(22)(23)(24) have different TWSs during the simulation period. While scenarios 21 and 22 in the Taylor River basin show flashier TWS behaviors resulting from shorter rainfall events, TWS changes in scenarios 23 and 24 are slower and exhibit higher magnitudes from longer rainfall events ( Figure 10).

Execution Time
An important consideration of an emulation is the reduction in time compared with the original model. We compare the execution times of ParFlow and ParFlow-ML for each of the 24 scenarios in the Taylor River basin. For one scenario prediction, ParFlow uses four CPU cores, while ParFlow-ML uses an A100 GPU.
Since each scenario is characterized by rain intensity (I), rain length (RL), and rain frequency (RecL), for the purposes of execution time comparison, we come up with a combined value, rainfall intensity-duration (RID), computed as follows: As in Figure 12, the higher the value of RID is, the longer it takes for ParFlow to solve. This is typical for physical hydrologic models as the nonlinear problem becomes more difficult to solve with greater water input necessitating more linear and nonlinear iterations of the solver. We see in this figure, however, that the execution time for ParFlow-ML is independent from RID; it is constant for all the scenarios with a value of 5.4 s. We also see that the execution times are sped up from 5 to 42 times. This is an advantage of the DL models in general and ParFlow-ML since DL treats the hydrological input purely as images.

Execution Time
An important consideration of an emulation is the reduction in time compared with the original model. We compare the execution times of ParFlow and ParFlow-ML for each of the 24 scenarios in the Taylor River basin. For one scenario prediction, ParFlow uses four CPU cores, while ParFlow-ML uses an A100 GPU.
Since each scenario is characterized by rain intensity (I), rain length (RL), and rain frequency (RecL), for the purposes of execution time comparison, we come up with a combined value, rainfall intensity-duration (RID), computed as follows: As in Figure 12, the higher the value of RID is, the longer it takes for ParFlow to solve. This is typical for physical hydrologic models as the nonlinear problem becomes more difficult to solve with greater water input necessitating more linear and nonlinear iterations of the solver. We see in this figure, however, that the execution time for ParFlow-ML is independent from RID; it is constant for all the scenarios with a value of 5.4 s. We also see that the execution times are sped up from 5 to 42 times. This is an advantage of the DL models in general and ParFlow-ML since DL treats the hydrological input purely as images.

Discussion
In this manuscript, we presented the first emulator of a complex surface-subsurface hydrologic model evaluated over two watersheds. The agreement in prediction from the ParFlow-ML emulator and the original ParFlow hydrologic model is good both spatially and temporally. The temporal behavior is captured by the ParFlow-ML well, with streamflow and water storage peak times predicted by ParFlow-ML matching the output from ParFlow closely for both basins (Figures 6 and 7). The average Spearman's rho values in all the test scenarios for the Taylor River and the Little Washita River basins are 0.909 and 0.957, respectively ( Table 2). The average VHI for the Taylor and the Little Washita River basins are 0.82 and 0.87, respectively. This match in peak timing indicates that ParFlow-ML is able to learn the dynamics of both surface and subsurface water routing correctly. However, there are some biases in streamflow volume and total water storage predicted by ParFlow-ML. The scenario that has the highest streamflow volume or total water storage bias is different between the Taylor and the Little Washita River basins (Table 2).
Grid cells that have constant values over time are also challenging for ParFlow-ML to predict. ParFlow-ML tends to yield noisy values that fluctuate around the true near-

Discussion
In this manuscript, we presented the first emulator of a complex surface-subsurface hydrologic model evaluated over two watersheds. The agreement in prediction from the ParFlow-ML emulator and the original ParFlow hydrologic model is good both spatially and temporally. The temporal behavior is captured by the ParFlow-ML well, with streamflow and water storage peak times predicted by ParFlow-ML matching the output from ParFlow closely for both basins (Figures 6 and 7). The average Spearman's rho values in all the test scenarios for the Taylor River and the Little Washita River basins are 0.909 and 0.957, respectively ( Table 2). The average VHI for the Taylor and the Little Washita River basins are 0.82 and 0.87, respectively. This match in peak timing indicates that ParFlow-ML is able to learn the dynamics of both surface and subsurface water routing correctly. However, there are some biases in streamflow volume and total water storage predicted by ParFlow-ML. The scenario that has the highest streamflow volume or total water storage bias is different between the Taylor and the Little Washita River basins (Table 2).
Grid cells that have constant values over time are also challenging for ParFlow-ML to predict. ParFlow-ML tends to yield noisy values that fluctuate around the true nearsteady values simulated by ParFlow. These near-steady value grid cells do not provide much information for the recurrent module of ParFlow-ML to interpret. However, the magnitudes of these fluctuations are in some cases quite small (Figure 8).
In general, ParFlow-ML's predictions are comparable to the ones from ParFlow in both the Taylor River and the Little Washita River basins. The two basins have different sizes, topographies, and land covers, and thus different hydrologic behaviors. The average TWS KGE values of ParFlow-ML for the Taylor River and the Little Washita River basins are 0.931 and 0.957, respectively. ParFlow-ML performs well without imposing mass balance conservation. Beucler et al. [50] suggesting that applying strict conservations of mass and energy to the machine learning model architecture might reduce the risk of unreliable predictions. By performing well in two different basins, ParFlow-ML shows promise in being transferable to other basins or to larger systems.
The ParFlow simulation time is dependent on the characteristics of the rainfall ( Figure 12). The more intense the rainfall is, the more time it takes for ParFlow to complete a simulation. The ParFlow-ML simulation time is constant and is independent from the rainfall intensity. This enables ParFlow-ML to predict up to 42 times faster than the original ParFlow ( Figure 12).

Conclusions
In this study, we discuss the development of an emulator of a complex integrated hydrologic model, ParFlow, based upon the PredRNN deep learning model. By training the ParFlow-ML model on direct inputs and outputs of ParFlow, the emulator was able to predict the evolution of the pressure head and the relative saturation for all grid cells of the model domain over time. We compare the simulation results of ParFlow-ML with those of ParFlow for models of the Taylor River and the Little Washita River basins. Streamflow, water table depth, and total water storage were calculated from the pressure fields for both ParFlow and the emulator.
Spearman's rho, relative bias, and Kling-Gupta efficiency were used as metrics of comparison. The emulator and ParFlow outputs matched well: the average relative biases of streamflow, water table depth, and total water storage were 0.070, 0.092, and 0.032, respectively. The simulation time of the emulator is up to 42 times smaller than the ones of ParFlow.
While the results are promising, this study still only serves as a proof of concept with synthetic data over two relatively small domains. Nevertheless, this proof of concept demonstrates the ability of an ML emulator that duplicates an entire set of hydrologic model inputs and outputs. This novel study is a step toward constructing larger and more comprehensive ML emulator models that may ultimately ingest meteorological forcing as input and predict land surface states and fluxes in addition to pressure head. Our work provides an additional path to a more efficient hydrologic simulation over large domains at a high resolution.