Novel Salinity Modeling Using Deep Learning for the Sacramento–San Joaquin Delta of California

: Water resources management in estuarine environments for water supply and environmental protection typically requires estimates of salinity for various ﬂow and operational conditions. This study develops and applies two novel deep learning (DL) models, a residual long short-term memory (Res-LSTM) network, and a residual gated recurrent unit (Res-GRU) model, in estimating the spatial and temporal variations of salinity. Four other machine learning (ML) models, previously developed and reported, consisting of multi-layer perceptron (MLP), residual network (ResNet), LSTM, and GRU are utilized as the baseline models to benchmark the performance of the two novel models. All six models are applied at 23 study locations in the Sacramento–San Joaquin Delta (Delta), the hub of California’s water supply system. Model input features include observed or calculated tidal stage (water level), ﬂow and salinity at model upstream boundaries, salinity control gate operations, crop consumptive use, and pumping for the period of 2001–2019. Meanwhile, ﬁeld observations of salinity at the study locations during the same period are also utilized for the development of the predictive use of the models. Results indicate that the proposed DL models generally outperform the baseline models in simulating and predicting salinity on both daily and hourly scales at the study locations. The absolute bias is generally less than 5%. The correlation coefﬁcients and Nash–Sutcliffe efﬁciency values are close to 1. Particularly, Res-LSTM has slightly superior performance over Res-GRU. Moreover, the study investigates the overﬁtting issues of both the DL and baseline models. The investigation indicates that overﬁtting is not notable. Finally, the study compares the performance of Res-LSTM against that of an operational process-based salinity model. It is shown Res-LSTM outperforms the process-based model consistently across all study locations. Overall, the study demonstrates the feasibility of DL-based models in supplementing the existing operational models in providing accurate and real-time estimates of salinity to inform water management decision making.


Background
This study develops novel machine learning (ML) approaches in salinity modeling in an estuarine environment. The innovations include (a) proposing inventive ML models never explored in the literature before and (b) a novel application of the proposed ML models at a finer hourly time scale in both simulation and prediction.
Salinity management in estuarine environments can impact the region's water supply and ecology. Estuarine salinity is linked to changes in migration patterns, spawning habitat, fish distribution, and survivability, and affects the water quality of freshwater withdrawals [1,2]. Globally, many examples of estuarine systems sensitive to salinization

Literature Review
Optimizing the operation of the SWP and CVP requires estimating salinity for a wide range of climate and operational scenarios. Process-based models have been traditionally developed and utilized for this purpose [12][13][14][15][16][17][18][19][20]. However, applying these models can be time-consuming, particularly in studies that require numerous model runs with long simulation periods. There is a need for fast simulation models with reasonable accuracy. Data-driven ML models have been explored to that end.
The earliest attempt was probably using the multi-layer perceptron (MLP), a type of artificial neural network (ANN), to simulate flow-salinity relationships in the Delta [21]. The study developed MLP models with one hidden layer or two hidden layers and showed that they could outperform empirical models significantly. The MLP model was refined later in terms of (a) identifying the most effective input features and training strategy; (b) increasing their robustness by training them using a variety of hydrology and operational conditions; and (c) simplifying their implementation into operational water planning models [22][23][24][25]. The resulting MLP model with seven input variables and two-hidden layers (with eight and two neurons, respectively) was implemented into California's latest water resources planning model, CalSim3 [26]. CalSim3 simulates the operations of SWP and CVP under different planning scenarios constrained by regulatory requirements, including allowable salinity levels at various locations across the Delta [26]. Ref. [27] further enhanced the MLP models of [26] in the context of (a) changing the learning paradigm from single-task learning (one MLP per study location) to multitask learning (MTL; one MLP model for all study locations together); and (b) adding a convolution layer before the hidden layers to pre-process input data. These enhancements helped reduce training time and increase the accuracy of the salinity estimates.
In addition to simulating the flow-salinity relationships, ANNs have also been developed to emulate process-based models directly in the Delta. Ref. [2] incorporated the Bayesian ANNs with the delta salinity gradient (DSG) model [17] in a hybrid manner for salinity simulation in the Delta. Ref. [28] utilized ANNs to emulate DSM2 in simulating volumetric fingerprints of flow sources for several study locations across the Delta. Salinity levels at these locations were then derived by multiplying the fingerprints by their corresponding salinity levels at flow source locations. Ref. [10] developed deep learning models, including the long short-term memory (LSTM) networks and convolutional neural networks, to emulate a salinity generation model at the downstream boundary of the Delta. Ref. [29] explored the use of both conventional ML models and deep learning models in emulating DSM2 for salinity estimation at 28 locations across the Delta. However, Ref. [29] did not explore the forecasting capability of the ML models and focused only on the daily scale using simulated salinity data.
Despite their scientific advances and practical values, those Delta salinity ML studies generally have four limitations in common [29]. Firstly, the ML models developed in these studies were mostly applied in simulating salinity under different planning scenarios. The forecasting capability of the ML models was largely unexplored. Reliable and intelligent forecasting is one major practical application for water-resource studies. Over the past decades, machine learning (ML) methods have gained more and more popularity in this area [30,31], due to their ability to handle big data at different scales as well as their flexible structure for identifying non-linear and complex relationships between input and output data. Researchers worldwide have applied novel ML methods in forecasting various variables that are important to water resources management, including streamflow [32][33][34][35][36]; groundwater level [37], groundwater quality [38], groundwater storage change [39], and sediment [40,41], among others. Popular ML algorithms explored include random forest [42][43][44], artificial neural network [45,46], support vector regression [45,47], LSTM [48], regression trees [49], extreme learning machine [45,50], wavelet transform [46,50,51], and adaptive neuro-fuzzy inference system [50,51].
Secondly, those Delta salinity ML studies focused on daily or coarser temporal scales probably due to the prohibitively expensive computing requirement associated with finer time scales. However, sub-daily scales (e.g., tidal scale, hourly scale) are also meaningful for water resources planning and management practices in the Delta. For instance, farmers may need to make water diversion schedules on when to pump water from Delta channels to irrigate their crop lands during a day. Understanding the sub-daily variations of salinity would help inform their relevant decision making to avoid diverting salty water that may have a detrimental effect on crops.
Thirdly, ML models in those studies were typically trained using salinity simulations from process-based models. Simulated data are generally "noise-free", as they follow the physical laws embedded in the advection-dispersion governing equations hardwired in process-based models. This characteristic makes it straightforward for ML models to learn the underlying patterns or signals in simulated data. This limits the application of those ML models for certain applications, including forecasting. To forecast the spatial and temporal variations of salinity in the near future, it would be ideal for the ML models to be trained and tested using field observations directly so that they can be utilized to predict what would happen in the field. These field observations reflect the real-world salinity conditions containing information not captured by process-based models, which are, at most, simplified representations of reality.

Scope of the Current Work
The current study attempts to tackle these limitations highlighted above. Specifically, built upon the success of previous studies, particularly the study of [29], which developed four ML models (MLP, LSTM, GRU, and ResNet) to simulate salinity at multiple locations in the Delta, this study proposes two novel ML models, Res-LSTM and Res-GRU, which are less complex but more efficient compared to their vanilla versions (i.e., LSTM and GRU). This study utilizes salinity observations as the target to train these six ML models and assesses their performance on both daily and hourly time scales. Moreover, this study explores the forecasting capability of the two proposed novel ML models. Furthermore, the study discusses the overfitting potential of the proposed models and evaluates model performance against that of a process-based model.
The paper is organized as follows. In Section 2, we describe the methodology applied, including the study ML models proposed and their setup, study locations, study dataset, and study metrics. In Section 3, we illustrate the performance of these models as well as their forecasting capability. In Section 4, we discuss the results, scientific and practical values of the study, and potential future work. The study is concluded in Section 5.

Study Area and Dataset
This study exemplifies the development and use of novel ML models in an estuarine system: the Delta of California ( Figure 1). The Delta is a transition zone between freshwater and saltwater, where freshwater inflows from the Sacramento and San Joaquin rivers are conveyed westward toward the San Francisco Bay through a series of channels and tributaries. Managing Delta salinity is important to maintain the region's ecological health, freshwater water supply reliability, and regulatory compliance. Salinity is monitored at sparse locations across the Delta and is typically represented as electrical conductivity (EC) in micro-Siemens/cm (µS/cm ) which indicates the amount of dissolved salt in water. Specifically, this study focuses on 23 salinity-monitoring locations with a reasonably long record of observed data ( Figure 1). These locations include freshwater pumping locations, key flow junctions, and locations of ecological significance in the Delta. Hourly salinity measurements during the study period from 1 January 2000 to 31 December 2019 at these 23 stations are used for ML model training and testing.
Salinity in the Delta can vary between fully marine to near-zero, depending on the location and interaction between tides and freshwater inflows [52]. Figure 2 visualizes the range of EC values at each station over the 20-year study period. The orange line on each box represents the median values of the corresponding metrics for each of the 23 study locations. Each box represents the interquartile range from the 25th to the 75th percentiles. The location of the upper bar indicates the maximum metric value within 1.5 times the interquartile range above the 75th percentile. The location of the lower bar indicates the minimum metric values within 1.5 times the interquartile range below the 25th percentile. The open circles represent outliers. Figure 2 illustrates that the EC between the least saline and most saline locations can span two orders of magnitude. Generally, water is least saline toward the northern Delta, where lower salinity water from the Sacramento River and eastern tributaries (e.g., Cosumnes River, Mokelumne River, and Calaveras River) enter the Delta. The northernmost locations considered in this study (RSMKL008 #1, RSAN032 #2 and RSAN037 #3) exhibit median EC between 100 and 300 µS/cm. Median salinity near the major pumping locations (CHVT000 #7, CHSWP003 #8, CHDMC006 #9) ranges from 350 to 450 µS/cm. San Joaquin River inflows into the southern Delta (RSAN072 #12) have a higher median EC between 600 and 800 µS/cm, due to having higher salt content from agricultural drainage [53]. In the brackish Suisun Marsh (SLMZU011 #20, SLSUS012 #21, SLCBN002 #22), where the saltwater from the San Francisco Bay meets the freshwater from the Sacramento and San Joaquin Rivers, the median salinity ranges from 8000 to 10,000 µS/cm. The westernmost locations considered in this study (RSAC075 #16 and RSAC064 #23) have high median salinity and variability due to their proximity to the ocean and influence from tidal cycles [53]. As with any real-time observations, there can be missing data. The ratios of available data in Figure 2, expressed in percent, indicate the data available during the 20-year observation period. An ideal ratio of 100% represents no missing data in the 20-year period. Most (14 out of 23) stations have over 90% data availability. Stations generally have low data availability because there are no observations for consecutive years in the early part of the 20-year period likely because sensors were not installed. For example, data for Dutch Slough (SLDUT007) are available starting from 2010. From 2010 onward, the data are continuously available for all stations with minor, intermittent, dropouts. To maintain consistency with our previous study [29], we use the same set of eight variables as input features to the ML models proposed in the current study (Table 1). Both salinity measurements and input variables during the study period have been rigorously quality controlled and applied to calibrate Delta Simulation Model II (DSM2), the operational hydrodynamics and water quality model providing simulations on flow, water stage, and water quality variables (including salinity) to guide real-time and long-term planning of water operations in the Delta [54]. The source of data utilized in the current study is provided in Appendix A, and the details of the datasets utilized in previous studies are summarized in Appendix B. Following [29], we randomly select 70% of the historical salinity measurements for training and use the remaining 30% for testing the ML models proposed.

Machine Learning Models
This study explores the use of six different ML models in a multi-task learning framework (i.e., one ML model for all study locations). Four of them were investigated in our previous study [29] where they were trained and tested using noise-free salinity simulations rather than real-world field observations of salinity. The current study adopts the same architectures applied in [29]: a multi-layer perceptron (MLP) network, a residual network (ResNet), a long short-term memory (LSTM) network, and a gated recurrent unit (GRU) network. The workflow of each of these four models is provided in Appendix C. These models are briefly described as follows. For detailed explanation on them, the readers are referred to [29].
The MLP model consists of one input layer, two fully connected (FC) hidden layers, and an output layer. The number of neurons in the two hidden layers is the number of study locations multiplied by 8 and 2, respectively. The input time series are pre-processed before being sent to the MLP model, which will be discussed in detail in Section 2.3. Results in [29] show that an MLP model achieves satisfactory salinity estimates, but the pre-defined input pre-processing procedure can still lead to unavoidable information loss in the input data. To improve the performance of the MLP model, a ResNet [55] model is developed by adding a shortcut side path including two convolutional layers and a FC layer to the vanilla MLP model to skip the pre-processing step so that the temporal information in the inputs is preserved. In addition to MLP and ResNet, Ref. [29] also explored two recurrent neural networks (RNNs): GRU and LSTM. RNNs maintain a memory internally in order to preserve essential temporal information from the inputs and have also achieved great success on time series analysis and processing tasks. GRU is one type of the popular RNN architecture designed for processing sequential data, which consists of a reset gate and an update gate connected to its hidden state. Similar to GRU, LSTM models keep a hidden state, which is the short-term memory, while storing an additional cell state, also known as long-term memory. Following [29], in the current study, we set the numbers of neurons in FC layers in ResNet, or numbers of RNN units in LSTM and GRU models equivalent to the numbers of neurons in the baseline MLP model.
In addition to these four models, we devise two novel architectures: Res-LSTM and Res-GRU. LSTM and GRU models applied in [29] are beneficial for time-series-related tasks, as they are capable of keeping track of memory when processing sequential inputs in an iterative manner. However, they have two major drawbacks. Firstly, they usually appear as complex models with more parameters than non-RNN models such that sufficient memory can be retained. Secondly, as the subsequent cell states in GRU and LSTM depend on the previous ones, computations cannot be processed in parallel, making them run slower than MLP or ResNet by design.
In order to address these two issues, inspired by the shortcut design in ResNet, we propose the Res-LSTM and Res-GRU models, which are less complex than the baseline LSTM or GRU models, respectively. According to our previous work [29], a vanilla MLP model already yields satisfactory results. Therefore, we directly use the MLP model as the main branch in Res-LSTM or Res-GRU models, as illustrated in Figures 3 and 4 below. The numbers of neurons, 184 and 46, of the two hidden FC layers in the main branch are identical to those in the original MLP baseline model. In addition, we add the shortcut connection, consisting of a single LSTM or GRU layer, such that the error, or the residual, between the ground truth and the outputs of the MLP model can be captured. Taking advantage of the powerful MLP baseline model, we can reduce the complexity of the LSTM or GRU layer in the shortcut path in comparison with the baseline LSTM or GRU models. Here, we arbitrarily pick the number of units in the LSTM or GRU layers to be 46, which is equivalent to the number of neurons in the second hidden layer of the MLP baseline model. The number of parameters of the six aforementioned models can be found in Table 2.
The LSTM or GRU layers in the shortcut branch takes the time series of the eight input variables as inputs. At each of the 118 daily time steps in inputs, the RNN layers process one set of daily values of the eight input variables to update their hidden state and/or cell state accordingly. At the final 118-th daily time step, the LSTM or GRU layer outputs its final hidden state, which is supposed to fit the error or residual defined above. When applying the Res-RNN models for salinity forecasting, we simply shift the target salinity forward by a given lead time such that the outputs generated by these Res-RNN models represent the salinity values in the future.  As can be seen in Table 2, the Res-LSTM and Res-GRU use fewer parameters than their corresponding vanilla LSTM and GRU models. Meanwhile, in theory, they ought to outperform the vanilla MLP model since the shortcut side branch can learn to compensate for residual errors of the MLP main branch. Additionally, they need shorter training times.

Input Preprocessing
The inputs for the proposed ML models are generated using the same pre-processing strategy as introduced in [29]. Specifically, given the long memory of the Delta, where flow and operations in the past several months have lagged impacts the current salinity conditions [26][27][28], we aggregate 118 antecedent daily values of each of the 8 input variables into 18 values per input variable. These 18 values consist of one measurement from the current day and 7 daily measurements from the most recent 7 preceding days, together with 10 non-overlapping 11-day averages of the prior 110 days.
We use bold symbols for vectors and regular symbols for scalars. We utilize superscript to represent vector indices and subscript to represent input and output variable indices. For instance, x t i stands for the i-th input variable on day t. We write the mean of the i-th input variable computed based on a time frame from day t 1 to t 2 , with t 1 < t 2 , as x t 1 →t 2 i , where We apply linear min-max normalization on the input time series and the salinity at each study station to the range of [0, 1]. That means that for each input feature or each salinity sequence at each station over the 20-year time span, the minimum value is transformed to 0, while the maximum value is transformed to 1. To be more specific, taking x t i , the i-th input variable on day t, as an example, it is normalized according to Equation (2), where T is the total number of samples.
The inputs for the proposed MLP network used to estimate reference salinity lev- , and 10 average valueŝ computed on 10 successive 11-day sliding windows, without overlapping, on the prior 110 days. The 18 processed values of each input variable make up the 8 × 18 = 144 inputs parameters for the MLP network. This pre-defined pre-processing method reduces the dimension of the input vector and avoids unnecessarily increasing the complexity of proposed MLP model, as well as preserving historical memory in the input time series.
For the proposed ResNet, LSTM, GRU, Res-LSTM and Res-GRU models, as they are designed to be more complex than the MLP model and require more detailed information from the input data, we directly input the 118 daily valuesx t i , . . . ,x t−117 Model inputs are prepared in the same way as discussed above. For salinity estimation on day t, namely, the case where lead time t l = 0, the salinity levels observed on day t at each of the 23 monitoring stations, depicted as y t k , where k = 1, 2, . . . , 23 is the index of the monitoring stations, are set as target outputs during training or testing.

Forecasting Setup
In our previous work [10,29], we focused only on the investigation of same-day salinity estimation (i.e., the lead time is zero). In practice, forecasting near-term salinity is critical to informing real-time water management decision making. In this work, we extend the scope of proposed Res-LSTM and Res-GRU models to salinity forecasting up to 14 days into the future (lead time equals 14 days). Specifically, one ML model is trained for each lead day. A total of 14 Res-LSTM models and 14 Res-GRU models are developed.
For salinity forecasting on day t with a lead time of t l , we perform the following pre-processing steps: Step 1: We prepare model inputs the same way as discussed in Section 2.3, which

Step 2:
We formulate the target output values by shifting the salinity values forward by t l days, represented by y t+t l k , k = 1, 2, . . . , 23. In this way, after training, the models shall be capable of predicting daily salinity levels at the 23 monitoring stations ahead of time by t l days.
In the remaining of the paper, ML models trained with a lead time of zero (t l = 0) are referred to as "salinity estimation" models, while models trained with a lead time of greater than or equal to 1 (t l ≥ 1) are referred to as "salinity forecasting" models. It is worth noting that forecasting models here differ from models applied in real-time forecasting operations which use forecast model inputs to drive the model and generate forecasts. The forecasting models developed for each lead time (i.e., day 1 through day 14 into the future) in the current study use purely historical data up to the lead time of zero.

Evaluation Metrics
The proposed models are trained with the Adam optimization algorithm [56] based on the widely used mean squared error (MSE) loss function. Four statistical evaluation metrics, consisting of the square of the correlation coefficient (r 2 ), bias, root mean standard deviation ratio (RSR), and the Nash-Sutcliffe efficiency coefficient (NSE), are employed to assess the ML model performance. Each of the four metrics evaluates the modeled salinity performance from a different perspective: r 2 quantifies the strength of the linear relationship between modeled salinity and the target salinity; percent bias indicates whether the models over-or underestimate the salinity; RSR is a standardized representation of the root mean squared error (RMSE) between model outputs and targets; and NSE compares the predictive capacity of the models with the global mean of target sequences. For r 2 and NSE, a value close to 1 indicates desirable model performance. For percent bias and RSR, a value close to 0 designates good model performance. Table 3 provides detailed descriptions and definitions of these five metrics. Here, S represents the salinity sequence,S indicates the overall average value of the salinity levels S, and the subscripts ANN and Observed indicate ANN-estimated and observed salinity, respectively.

Implementation Details
Our experiments are carried out on a public platform: the Google Colaboratory. Hyperparameters such as batch size, learning rate and numbers of epochs may affect model performance. The authors of a previous study [57] optimized some hyper-parameters in the backpropagation neural network (BPNN) architecture, including the number of nodes in the model, the learning rate, and number of epochs. In a different manner, we used a constant small learning rate of 0.001 with the Adam optimizer [56] to train our models and stopped training if the mean squared error (MSE) on the test set did not decrease for 50 epochs. In addition, to prevent overtraining, we limited the maximum number of epochs to 5000. In this way, we did not have to specifically optimize the learning rate or the number of epochs.

Results
This section presents the results of the proposed ML models, particularly those two novel models. Firstly, the training and testing performance of all six models is scrutinized in terms of the skill metrics described in Section 2.5 as well as visual inspection of modeled salinity against the corresponding observations. Next, the forecasting capability of the two proposed novel models is examined. Finally, model performance is evaluated on the hourly time step.   Figure 6 shows the corresponding exceedance probability curves and daily time series plots of Res-LSTM simulations compared with the observed data at selected locations. Both types of plots demonstrate that the simulations mimic the target observed salinity very well, with the latter showing capture of the temporal pattern and magnitude in general. Another notable pattern is that, despite the marginal discrepancies in the full salinity spectrum between the two models, time series plots reveal that Res-LSTM slightly underestimates high salinity, especially for RSAC092 and RSAN018. Figure 7 shows the statistical metrics for each study location, calculated at three ranges, illustrating the performance of the Res-LSTM model compared to observed data on a daily time step. For metrics r 2 , NSE, and RSR, "yellow" indicates satisfactory performance. For the percent bias metric, shaded blue and orange represent underestimation and overestimation, respectively. Overall, model performance is most satisfactory when the salinity is in the low-middle range (0-75%) and decreases with high (75-95%) and extremely high (95-100%) salinity ranges. Several notable observations are further discussed below.  Performance at location RSAC092 (Sacramento River at Emmaton) is lower in the low salinity range but is consistent with the other locations in the high and extremely high ranges. Despite the departure from the other stations, overall, r 2 , NSE, and RSR for RSAC092 are acceptable. The Res-LSTM model underestimates salinity in the low-middle range, where the percent bias is −11%. This is because the Res-LSTM often estimates zero EC at Emmaton, but this generally does not occur in the observed data. At location RSAC064 (Port Chicago), r 2 , NSE, and RSR are acceptable in the low-middle and extremely high ranges, but less satisfactory under the high range. The Res-LSTM is less able to capture the salinity variability at Port Chicago under the high range, but the percent bias is acceptable and consistent with other locations. At locations in the Suisun Marsh (SLMZU011, SLSUS012, SLCBN002), the Res-LSTM tends to overestimate salinity, especially in the low-middle range.

Model Performance on the Daily Scale
In short, the novel Res-LSTM and Res-GRU models can satisfactorily estimate salinity at the locations studied, while achieving similar or better performance compared with their more complex LSTM and GRU counterparts. Generally, performance is best at stations with lower median salinity and variability and degrades at stations with higher salinity and variability. The combination of a simpler architecture paired with comparably good performance to vanilla LSTM and GRU models indicate that the new models show promise in estimating Delta salinity on a daily time step.  Figure 2.

Forecasting Performance
Generally speaking, model performance declines smoothly as the lead time increases, and all the metrics are within a reasonable range. Even with a lead time of 14 days, the Res-LSTM model predictions are satisfactory. For all lead times evaluated under training and testing, NSE is above 0.94, r 2 is above 0.95, and percent bias centers around zero percent, indicating excellent predictive performance without a tendency to systematically underestimate or overestimate.  Figure 9 shows the corresponding performance of Res-GRU models based on four criteria (r 2 , bias, RSR, and NSE) in two rows. The first row (Figure 9a-d) and the second row (Figure 9e-h) display the performance of Res-GRU for training and test datasets, respectively. As a performance indicator for ML algorithms, results of the test dataset (Figure 9e-h) indicate that the nowcasting (forecasting with 0 lead time) model has the best performance, and the forecasting model's accuracy decreases when the lead time increases, which is reasonable for every forecasting model. However, the forecasting model with 6 and 12 days lead time does not follow this pattern, and the forecasting model with a 6-day lead time provides the worst performance but still is satisfactory (r 2 and NSE are high, while RSR and bias are low). This suggests that the historical data up to lead time 0 alone may not be ideal to forecast these two days for the Res-GRU model. Overall, all the metrics are within a reasonable range. For all lead times evaluated under training and testing, NSE is above 0.94 and r 2 is above 0.94, indicating satisfactory performance overall. The percent bias metric indicates higher variability than the Res-LSTM predictions ( Figure 5) but does not show a clear systematic bias toward underestimation or overestimation. All in all, for all lead times considered, Res-LSTM and Res-GRU can forecast salinity levels at all study locations with satisfactory performance. Model performance generally decreases as the lead time increases.

Model Performance on the Hourly Scale
The results presented so far are all trained and tested using daily salinity data aggregated from the hourly observations of salinity. In this sub-section, the six ML models proposed are trained using the hourly observations directly though the input data supplied to the models and are still on the daily scale. Figure 10 compares the performance of these models during the training (Figure 10a-d) and test (Figure 10e-h) runs, using box and whisker plots for four types of metrics consisting of r 2 , percent bias, RSR, and NSE. Each plot includes one box and whisker for each model evaluated. Based on these metrics, the Res-LSTM model generally outperforms all the other ML models tested during training and test runs. On average, Res-LSTM has the highest r 2 and NSE. It also has the lowest bias and RSR for both training and testing. The performance of Res-GRU is close to but not as ideal as that of Res-LSTM. In contrast, ResNet has slightly inferior performance compared to other ML models, followed by MLP. Compared to their counterparts on the daily scale ( Figure 5), the skill metrics r 2 , RSR, and NSE are notably inferior, indicative of stronger performance on the daily (versus hourly) scale for all six models.  Figure 11 shows the corresponding exceedance probability curve and hourly time series plots to evaluate the performance of Res-LSTM at six selected locations in the Delta. In general, the differences between model simulations and the corresponding observations are marginal. However, the time series subplots indicate that the Res-LSTM models slightly underestimate the peak values at some of these specified locations. Nevertheless, the plots show remarkable similarity between models and observations, and the Res-LSTM model can skillfully capture the temporal pattern of observed salinity. Comparing Res-LSTM performance on the daily scale ( Figure 6) versus the hourly scale (Figure 11), the metrics associated with the daily scale are generally superior. In particular, the r 2 and NSE are slightly higher while the RSR is generally lower on the daily scale. This is also observed for other models as illustrated in Figures 5 and 10.  Figure 7, Figure 12 shows heatmaps which summarize the performance of the Res-LSTM model with hourly time steps using the statistical metrics r 2 , percent bias, RSR, and NSE for each study location. In general, model performance is the most satisfactory for salinities in the low-middle range across most stations, but lower for the high and extreme high ranges. Compared to the daily time step simulation results in Figure 7, the metrics associated with the hourly time step are inferior for most locations. The detailed values of all four study metrics on both daily and hourly scales are provided in the Appendix D. In a nutshell, all six proposed models can achieve satisfactory performance at a finer hourly scale, and Res-LSTM slightly outperforms the other five. The differences between model simulations and the corresponding observations are small on average. The performance of Res-LSTM is highest in the low-middle range, but relatively lower for the high and extreme high ranges. Compared to their counterparts on the daily scale, the ML models on the hourly scale generally have slightly degraded performance.

Discussions
This section first discusses the overfitting potential of the ML models proposed in the study. Second, the performance of a selected ML model is compared with that of the process-based model DSM2, which is widely used to inform water operations in the Delta. Next, the section discusses the scientific and practical implications of the study, followed by discussions on study limitations and planned future work.

Overfitting Potential versus Model Complexity
Overfitting happens when a ML model picks up the details, including noise, and fits exactly on the training data but does not generalize well on unseen data [58,59]. Overfitting is a central problem in the field of data-driven ML, as it negatively impacts the model's generalization performance on new data. Overfitting is more likely to occur when a model's structure is too complex for the task. To avoid this problem, the number of neurons or units in the layers need to be determined carefully.
In this study, we proposed six different ML models. In this sub-section, we explore the relationship between model complexity and salinity estimation performance by reducing the number of neurons in the hidden layers of each model.  Tables A10 and A11 of Appendix E. We plot the model performance (in terms of the average value of each evaluation metric across the 23 study locations) versus model complexity (in terms of number of parameters) in Figure 13. In general, model performance improves as complexity grows. In both the training and test plots of r 2 , RSR and NSE, MLP models show the best complexityperformance trade-off. Namely, the MLP model can achieve a comparable performance with other models with a relatively smaller number of parameters. However, during the grid search process when designing these models, we observed that the MLP model hits a performance plateau earlier than other models, namely, the test performance stops improving, even if we keep increasing the number of neurons in its hidden layers. In contrast, the ResNet model gives the worst complexity-performance trade-off because the extra FC hidden layer in its shortcut branch adds a large number of parameters to the model. Comparing the vanilla RNNs (i.e., LSTM and GRU) and their corresponding Res-RNN models, we see that the simple RNN module in the residual path ensures a satisfying model performance while bringing down the complexity.
In brief, model performance improves as the complexity grows but both training and test performance will hit a plateau at some point. No obvious overfitting was observed during our exploration, suggesting that the proposed models are not over-parameterized and well-trained. It is worth noting that there are other methods that can be applied to assess the overfitting potential of machine learning models. The current study briefly explored two of those methods, namely data distortion and cross validation for demonstration purposes. The results ( Figures A5-A8 in Appendix F) also indicated that no evident overfitting was observed.

Comparing with a Process-Based Model
It is shown in Section 3 that the proposed ML models in this study, particularly the novel Res-LSTM and Res-GRU models, can simulate and predict real-world salinity in the Delta well. Traditionally, process-based models, including the operational hydrodynamics and water quality model DSM2, have been applied to simulate and predict the spatial and temporal variations of the salinity across the Delta. It is imperative to assess the performance of these ML models against that of the operational process-based models. For illustration purposes, this sub-section compares simulations from the Res-LSTM model and its counterparts from DSM2.
The comparison is conducted by means of both visual inspection and evaluating statistical metrics. Figure 14 shows the comparison among time series plots of the measured DSM2-simulated and Res-LSTM-simulated EC data at six selected key stations. Four study metrics (r 2 , Bias, RSR, NSE) of both sets of models are also displayed side-by-side to conduct quantitative comparison. The corresponding comparison for all 23 study locations is provided in Appendix G. Both simulations mimic the target filed measurements of salinity very well via visually inspecting the time series plot. The overall biases of both models are generally small. For Res-LSTM, the absolute bias ranges from 0.3% to 4.6%. The r 2 , RSR and NSE values of Res-LSTM are notably better than their counterparts of DSM2 for all six locations. Collectively, these observations indicate that Res-LSTM yields comparable or more desirable salinity simulations compared to the operational DSM2 model.
Nevertheless, it should be pointed out that DSM2 can generate salinity simulations not only at the 23 study locations, but at all channels across the Delta. The ML models proposed in the study can only be applied to study locations where they have been trained, and thus are not meant to be substitutes of process-based models, including DSM2.

Implications
This study has important scientific implications. Firstly, the study exemplifies the feasibility of applying ML, particularly deep learning models, as a new scientific exploratory tool to tackle a complex problem. Secondly, this study proposes two novel deep learning models (i.e., Res-LSTM and Res-GRU) that have never been explored before. These novel models can be applied to simulate other variables, including water temperature, suspended sediment, dissolved oxygen, and other water quality variables, that are important to guide water management practices in the Delta. Thirdly, this study illustrates the forecasting capability of newly developed deep-learning models. Effective and efficient forecasting models are valuable tools that can guide real-time operations in light of forecast near-term salinity conditions. There are also important practical implications of this study. The study demonstrates that the proposed ML models are capable of generating desirable salinity simulations and predictions even on the hourly scale. The overall absolute biases are generally less than 5%. The correlation coefficients, RSR, and NSE values are generally satisfactory. They are either comparable or superior compared to the corresponding metrics of the DSM2 model. In addition to accuracy, the proposed ML models are also of high efficiency. DSM2 runs can take hours depending upon the simulation period length [28], while the runtime for trained ML models for the same inputs is measured in seconds. This is particularly appealing, for instance, for real-time operations which require quick turn-around time and also studies that require running multiple scenarios during the historical period.

Limitations and Future Work
Despite those scientific and practical implications, this study has several limitations. First of all, the current study randomly splits the dataset into a training subset and a test subset. Though random split of training/testing in not uncommon in ML studies in the Delta [2,10,[26][27][28]54], there are other data-split methods available. Ref. [29] examined chronological split and manual split and found that the performance of the random split method yielded improved results over the other methods. Since the observed data are about one-third shorter than the simulated data used in [29], this study did not employ the chronological/manual methods. In our future work focusing on larger datasets, we will explore chronological split and manual split methods. In addition, explainable arti-ficial intelligence (XAI) is an active research area [60]. There have been a number of XAI approaches developed. In our future work, we plan to explore various XAI approaches, including the gradient-based method [61] and the backpropagation-based method [62], and the input variable sensitivity analysis [57], and conduct in-depth investigation of the importance of different inputs features in different regions and locations in the Delta.
This study used eight empirical variables as input features to the proposed models. They were shown to yield desirable salinity estimation at the study locations. However, other variables, including precipitation and wind speed, also influence the circulation and mixing of freshwater and sea water and thus affect the salinity level in the study area. In our future work, we will also explore the impacts of additional input features on the ability of the proposed models to estimate Delta salinity.
In this study, the ML models are trained and tested on historical time series. The selected range of data may not capture potential hydrologic extremes or increased water use. Climate change is expected to cause larger storm-driven streamflow and altered runoff timing [63]. In addition, in the coming decades, municipal, industrial and agricultural water demand in California is projected to increase from increasing urbanization and changing agricultural practices [64]. The models trained on historical data, therefore, will not be exposed to the range of inflows and operational constraints resulting from potential future conditions. Additionally, the ML models are trained using data from 23 study locations and can thus only be applied to those study locations. In the future, we will explore generic ML models capable of generating salinity estimates at locations they are not trained upon.
Exclusively using the historical record for training may also introduce shortcomings when conducting long-term planning studies, as the training dataset is not modified beyond the scope of historical operational considerations. In real-time operations and planning, measures such as emergency barriers and temporary operational regimes may be implemented to manage flow and salinity. Operational measures include emergency temporary barriers [65] to manage salinity intrusion and maintain acceptable water quality at pumping locations. The historical time series reflects limited use of emergency operational measures and does not consider the operation of the Suisun Marsh Salinity Control Gates (SMSCG). The limitations associated with using a historical dataset during training will be addressed in future work, where the model will be trained using synthetically augmented datasets.
Data augmentation is a technique to generate synthetic data for model training, which incorporates an enlarged and diversified dataset to better represent extreme conditions and possible future conditions. In our ongoing preliminary tests, we use DSM2 historical simulation as the baseline, then apply several modifications, including (1) scaling the magnitude of major boundary flows; (2) temporally shifting major boundary flows; and (3) changing operations of key Delta structures, such as operable gates. All of the above aim to reduce overfitting and thus improve the generalization ability of the trained neural networks. Another benefit of data augmentation is that it can provide sufficient time series for chronological split training to bypass the limit of random split method.
In another follow-up work, we plan to explore the physics informed neural network (PINN), a cutting-edge neural network algorithm which can embed the knowledge of physical laws into data training. Most of the physical laws governing the dynamics of a system can be described by partial differential equations (PDEs). PINN adds the underlying PDE of the dynamics directly into the loss function of the neural network [66,67]. For our study, we plan on implementing PINN with the one-dimensional advection-dispersion equation for salinity transport [14,68].
PINN can be viewed as a regularization limiting the space of admissible solutions, through adding the prior knowledge of general physical laws in the training of neural networks. Its benefits include increasing the correctness of the function approximation, facilitating the learning algorithm to capture the right solution, generalizing well, even with a low amount of training examples, and providing a meshfree alternative to traditional approaches. PINN could also be viewed as a paradigm that bridges the gap between the process-based models, which are developed from the known PDEs of the dynamics, and the machine-learning models, which are driven purely by existing data. By integrating the best of both methods, namely, the physics information of the process-based models and the training data employed in the machine-learning models, PINN could learn the underlying solution of the dynamics more accurately and more efficiently.
Moreover, this study explored the forecasting capability of the two proposed novel ML models using historical data up to the forecast time only. The main reason is that we do not have forecasted input variables available. It is expected that the performance of the ML models would be even more satisfactory should forecasted model inputs be used to drive the models. In a follow-up study, we plan to collect archived forecasted inputs and salinity data from real-time operations, and develop ML models based on them, and test the trained model in a hindcasting mode.
Lastly, we are developing an interactive dashboard to integrate and visualize our modeling results. It is designed as a front-end to the trained neural network model engine, allowing users to customize data inputs, run the ANNs, and query results. The proposed dashboard could generate a visualization of the time-series output and their proposed metrics for all the specified locations across the Delta. This tool could facilitate management decisions in historical, real-time, and forecast applications.

Conclusions
Built upon the success of relevant previous studies that explored machine learning applications in salinity modeling in the Delta, this study develops two novel deep learning models and applies them in both salinity simulation and prediction as well as on a finer hourly time scale that had never been investigated before. The study shows that the novel models proposed can effectively simulate and predict salinity at all study locations across the Delta. In addition, compared to traditional process-based models, the proposed models run significantly faster once trained. Their effectiveness and efficiency make them viable supplements to operational process-based models in terms of providing salinity estimates to inform both real-time and long-term water management and planning practices. Acknowledgments: The authors would like to thank the editors and two anonymous reviewers for providing thoughtful and insightful comments that helped to improve the quality of this study. The views expressed in this paper are those of the authors, and not of the State of California.

Conflicts of Interest:
The authors declare no conflict of interest.

Appendix A. Data Sources
Input data for the ML models along with observed salinity data are available at the following link: https://data.cnra.ca.gov/dataset/dsm2-v8-2-1 (accessed on 1 July 2022). The input features of this study are daily freshwater flow to the estuary, daily mean coastal water level, and the daily tidal range for water years 1922-2012. Labels are salinity data from nine locations collected from sensors in the Delta.

Appendix B. Summary of Datasets from Previous Studies
Chen et al. (2018) [28] In this study, the machine learning emulator is based on data generated using DSM2 (a process-based model) including its outputs at 17 locations for 10 scenarios (two decades each). The use of 10 scenarios is intended to augment the dataset to bound the range of possible water management operations. The study period 1990-2010 was strategically selected as it contains widely varying hydrology and is a period where the DSM2 model is well-calibrated. Mosavi et al. (2018) [31] In this study, the authors examined studies that used field data from rain gauges and other sensing devices, including data from remote sensing technologies. The input features of this study are eight inputs representing boundary flows or operating rules for Delta flow and salinity management. DSM2 simulated daily salinity at the 28 study locations during 1990-2019 is used as the training label dataset. Figure A1. Diagram of the MLP network from [29]. The number in the input layer denotes input shape and those in the subsequent FC layers represent the numbers of neurons of the layers. Figure A2. Diagram of the ResNet network from [29]. The number in the input layer denotes input shape and those in the FC layers represent the numbers of units/neurons of the layers. In the convolutional layers following the input layer, "f" denotes the number of convolutional filters, "k" denotes size of convolutional kernels and "s" denotes stride. Figure A3. Diagram of the LSTM network from [29]. The number in the input layer denotes input shape and those in the subsequent layers represent the numbers of units/neurons of the layers. Figure A4. Diagram of the GRU network from [29]. The number in the input layer denotes input shape and those in the subsequent layers represent the numbers of units/neurons of the layers.  Figure A5. Comparison of six models on observed data without ("w/o") or with ("w/") data distortion. Figure A6. Comparison of the 5-fold cross-validation on the MLP architecture using observed data. "SP" stands for "split". Figure A7. Comparison of the 5-fold cross-validation on the Res-LSTM architecture using observed data. "SP" stands for "split".