Using Deep Learning Algorithms for Intermittent Streamﬂow Prediction in the Headwaters of the Colorado River, Texas

: Predicting streamﬂow in intermittent rivers and ephemeral streams (IRES), particularly those in climate hotspots such as the headwaters of the Colorado River in Texas, is a necessity for all planning and management endeavors associated with these ubiquitous and valuable surface water resources. In this study, the performance of three deep learning algorithms, namely Convolutional Neural Networks (CNN), Long Short-Term Memory (LSTM), and Self-Attention LSTM models, were evaluated and compared against a baseline Extreme Learning Machine (ELM) model for monthly streamﬂow prediction in the headwaters of the Texas Colorado River. The predictive performance of the models was assessed over the entire range of ﬂow as well as for capturing the extreme hydrologic events (no-ﬂow events and extreme ﬂoods) using a suite of model evaluation metrics. According to the results, the deep learning algorithms, especially the LSTM-based models, outperformed the ELM with respect to all evaluation metrics and offered overall higher accuracy and better stability (more robustness against overﬁtting). Unlike its deep learning counterparts, the simpler ELM model struggled to capture important components of the IRES ﬂow time-series and failed to offer accurate estimates of the hydrologic extremes. The LSTM model (K.G.E. > 0.7, R 2 > 0.75, and r > 0.85), with better evaluation metrics than the ELM and CNN algorithm, and competitive performance to the SA–LSTM model, was identiﬁed as an appropriate, effective, and parsimonious streamﬂow prediction tool for the headwaters of the Colorado River in Texas.


Introduction
The cessation of flow for at least a portion of a year is a defining characteristic of intermittent rivers and ephemeral streams (IRES) [1]. Different forms of IRES, from headwater streams to the tributaries of mountainous rivers and snow-fed streams, make up about 60% of the river network in the United States [2,3] and more than 50% of all streams globally [4]. These streams play a crucial role in their landscape's environmental and hydrological connectivity [5][6][7]. The transition between wet and dry states in IRES is an influential factor in promoting the peak biodiversity of riparian vegetation [8], controlling the kinetics of biogeochemical cycles [9], and channel geomorphology [10]. Additionally, IRES offer beneficial ecosystem services like forage, nesting sites, and transportation routes for both aquatic and terrestrial wildlife [11][12][13][14][15]. Further, there is significant interest in utilizing IRES systems to address anthropogenic water supply needs [16].
Flow in IRES is primarily influenced by soil and precipitation characteristics, both of which are heavily affected by changes in climatic patterns [4,17,18]. Many perennial streams are projected to become ephemeral as a result of climate change [19][20][21]. Hence, reliable models are required to capture the link between meteorological variables (e.g., precipitation, temperature) and streamflow in IRES. Accurate streamflow prediction in IRES settings is an essential step for including these increasingly necessary water resources in various planning and management endeavors, from floodplain design and ecosystem conservation efforts to long-term supply management and climate impact analysis.
Rahmani-Rezaeieh et al. [35] used an ensemble gene expression programming (EGEP) modeling approach for 1-day-and 2-day-ahead streamflow forecasting in Iran's Shahrchay River and reported competitive performance, sometimes with higher accuracy, compared to regular ANN. Mehr and Gandomi [36] developed MSPG-LASSO, a new multi-stage genetic programming technique coupled with multiple regression LASSO methods, for univariate streamflow forecasting in Turkey's Sedre River and found it superior to a series of models from the genetic programming variant family. Kisi et al. [37] investigated the predictive abilities of the Extreme Learning Machines (ELM) model coupled with Discrete Wavelet Transform for monthly intermittent streamflow forecasting and found it superior to regular ANN models. Li et al. [38] devised a staged error model that treats zero flows as censored data for hourly streamflow forecasting over 18 ephemeral streams in Australia. In one of the most recent researches on IRES flow, Alizadeh et al. [39] developed an attention-based Long-Short Term Memory (LSTM) cell deep learning (DL) model and examined it for oneto seven-day-ahead predictions of daily flows for four basins in different climatological regimes of the United States and reported accurate and promising results.
While the reported advancements have improved the forecasting abilities of IRES systems, capturing the high variability in most intermittent flow series and modeling their extreme hydrologic events is still challenging. The streamflow models introduced in the literature tend to over-predict the low flow events and under-predict the high flows. Further, in many of these models, lagged values of streamflow (i.e., the model's output at previous time-steps) are utilized as inputs, which severely restricts their application for long-term forecasting, as propagating the prediction errors using endogenous lags causes the accuracy of the prediction to deteriorate quickly [40]. Many water planning and flood management activities (e.g., damage mitigation, food production, environmental protection) associated with IRES, such as the headwaters of the Colorado River in Texas, depend upon accurate streamflow forecasts with the longest possible lead time [36,41]. Therefore, flexible and exogenous (inputs independent of output) hydro-meteorological models are required to model the flow dynamics in intermittent settings and deliver reliable long-term streamflow predictions.
Following the most recent recommendations on modeling IRES flow data, the main objective of this study is to investigate the application of deep learning algorithms for predicting intermittent streamflow in the headwaters of the Colorado River in Texas. Three models, namely a Convolutional Neural Network (CNN), a Long Short-Term Memory (LSTM), and a Self-Attention LSTM (SA-LSTM) model, were chosen to represent deep learning algorithms of different levels of complexity. An Extreme Learning Machine (ELM) model was developed as a baseline shallow learning model for better comparisons, and to highlight the impacts of the use of deep learning versus shallow learning models. Considering the importance of the Colorado River for long-term water planning for the state, the heavy influence of climate on flow generation in IRES, and the location of the Texas Colorado River headwaters in a climate hotspot, this study adopted a monthly timescale and focused on capturing the links between climate variables and streamflow. This research seeks to answer the following questions about the intermittent headwaters of the Colorado River in Texas: (a) What is the difference between the performance of the deep learning algorithms and that of the baseline ELM model in terms of capturing the hydrological extremes and the entire range of flowrates? (b) Are deep learning algorithms appropriate for intermittent streamflow prediction? (c) How much complexity is warranted for predicting intermittent streamflow using deep learning algorithms?

Study Area
As illustrated in Figure 1a, The Colorado River rises near the Texas-New Mexico border (south of Lubbock, TX, USA) and flows southeast for 1560 km into the Gulf of Mexico at Matagorda Bay, making it the longest and largest river by length and drainage area in Texas [42,43]. Its drainage area, which is about 16 percent of the total area of Texas, is stretched from the dryer west side of Texas, with higher elevations and lower precipitation rates, to the more humid and lesser-elevated southeast of the state (Figure 1b,c). The average annual runoff of the Texas Colorado River reaches a volume of more than 2 billion cubic meters near the Gulf of Mexico [44]. Several dams (e.g., the J.B. Thomas, E.V. Spence, and O.H. Ivie) and lakes (e.g., the Texas Highland Lakes) along the Colorado River serve water supply, flood mitigation, recreational, and energy production purposes. The headwaters of the Texas Colorado River are in the High Plains level III ecoregion at an elevation of 1195 m asl, where the annual average temperature is 13.9 • C, and the average yearly precipitation is 40 cm [45][46][47].
As illustrated in Figure 1a, The Colorado River rises near the Texas-New Mexico border (south of Lubbock, TX) and flows southeast for 1560 km into the Gulf of Mexico at Matagorda Bay, making it the longest and largest river by length and drainage area in Texas [42,43]. Its drainage area, which is about 16 percent of the total area of Texas, is stretched from the dryer west side of Texas, with higher elevations and lower precipitation rates, to the more humid and lesser-elevated southeast of the state (Figure 1b,c). The average annual runoff of the Texas Colorado River reaches a volume of more than 2 billion cubic meters near the Gulf of Mexico [44]. Several dams (e.g., the J.B. Thomas, E.V. Spence, and O.H. Ivie) and lakes (e.g., the Texas Highland Lakes) along the Colorado River serve water supply, flood mitigation, recreational, and energy production purposes. The headwaters of the Texas Colorado River are in the High Plains level III ecoregion at an elevation of 1195 m asl, where the annual average temperature is 13.9 °C, and the average yearly precipitation is 40 cm [45][46][47].

Data
Streamflow data of the headwaters of the Colorado River in Texas were accessed from the closest USGS streamflow monitoring station to the point of origin (Station 08117995, located near Gail in Borden County, TX) and used for this study. This USGS station has a drainage area of 1290 square kilometers and is located upstream of Lake J.B. Thomas, one of the three reservoirs operated by the Colorado River Municipal Water District, supplying water to the rapidly growing Midland-Odessa region in West Texas. Monthly streamflow records were obtained from March 1988 to May 2022 [48], with a total of 0.5% (the equivalent of 2 months) missing data. Kalman filtering was applied as an imputation technique to fill these missing records based on the available data [49,50].
The streamflow records indicate that Texas Colorado River headwaters were dry for 127 months during the study period, making its intermittency ratio (ratio of the duration of dry runs to the total duration of the study) equal 31%. The frequency of extreme high flow events has increased in the stream over the last decade because of changing climatic patterns. The station of interest has recorded three extreme flooding events (greater than 5 cubic meters per second) in the last six years, including the floods of September 2014

Data
Streamflow data of the headwaters of the Colorado River in Texas were accessed from the closest USGS streamflow monitoring station to the point of origin (Station 08117995, located near Gail in Borden County, TX, USA) and used for this study. This USGS station has a drainage area of 1290 square kilometers and is located upstream of Lake J.B. Thomas, one of the three reservoirs operated by the Colorado River Municipal Water District, supplying water to the rapidly growing Midland-Odessa region in West Texas. Monthly streamflow records were obtained from March 1988 to May 2022 [48], with a total of 0.5% (the equivalent of 2 months) missing data. Kalman filtering was applied as an imputation technique to fill these missing records based on the available data [49,50].
The streamflow records indicate that Texas Colorado River headwaters were dry for 127 months during the study period, making its intermittency ratio (ratio of the duration of dry runs to the total duration of the study) equal 31%. The frequency of extreme high flow events has increased in the stream over the last decade because of changing climatic patterns. The station of interest has recorded three extreme flooding events (greater than 5 cubic meters per second) in the last six years, including the floods of September 2014 Water 2022, 14, 2972 5 of 24 (14.29 cubic meters per second) and May 2015 (20.10 cubic meters per second) that were almost two and three times greater than the greatest previously recorded flood (7.45 cubic meters per second in May 1992), respectively.
Climate variables were required to develop appropriate hydro-meteorological streamflow prediction models. Utilizing precipitation and evaporation data for modeling IRES systems is recommended in the literature [51][52][53][54]. Monthly precipitation and temperature data were extracted for the location of the streamflow monitoring site and the same 410-month period (March 1988-May 2022) from PRISM [55]. The Thronthwaite method was used to calculate potential evapotranspiration based on temperature records [56]. A summary of the collected data is provided in Table 1. PPT and PET autocorrelation function (ACF) plots ( Figure 2) revealed that the first two lags correlated positively with observed precipitation and evapotranspiration fluxes. This finding implies that the previous two months' rainfall and evaporative fluxes influenced streamflow observations during any given month. Seasonality of rainfall and PET can be seen in these plots, but higher lags were not taken into account due to the parsimony principle [57].  Correlation analysis between the climate variables (and their first two lags) and the streamflow data ( Figure 3) indicated the flowrate at any given month shows a strong correlation with the observed precipitation in that month (r = 0.7) and moderate correlation with the PET (r = 0.27) and the first lag of precipitation (r = 0.31) and PET (r = 0.24). The second lags of precipitation and PET had weak correlations with streamflow (r < 0.2) and, therefore, were not included in the final set of the inputs. All parameters were standardized by subtracting the mean values and dividing them by the standard deviations so that the scale effects were removed, and the impacts of outliers were minimized.

Extreme Learning Machine
An Extreme Learning Machine is composed of three main layers: input, hidden, and output layer, which employ various weights to convey information through the network (Figure 4a). Huang et al. [58,59] suggested the ELM method in which the weights from the input layer to the hidden layer are randomly assigned. ELM reduces the computational time and enhances the generalization ability of the single-layer Artificial Neural Network (ANN) model [58,60,61]. ELMs have gained popularity in the hydrologic literature [62][63][64][65][66][67] and are established as fast and effective streamflow forecasting models [37,[68][69][70][71][72][73].

Extreme Learning Machine
An Extreme Learning Machine is composed of three main layers: input, hidden, and output layer, which employ various weights to convey information through the network (Figure 4a). Huang et al. [58,59] suggested the ELM method in which the weights from the input layer to the hidden layer are randomly assigned. ELM reduces the computational time and enhances the generalization ability of the single-layer Artificial Neural Network (ANN) model [58,60,61]. ELMs have gained popularity in the hydrologic literature [62][63][64][65][66][67] and are established as fast and effective streamflow forecasting models [37,[68][69][70][71][72][73].  An ANN architecture consisting of one input layer and L hidden neurons with an activation function called g(x) and a bias term (B), is presented mathematically as Equation (1): where is the vector weights that connect ith hidden neuron to the input neurons, is the vector of weights connecting ith hidden neuron to the output neurons, is the kth output vector, is the bias regarding ith hidden neuron, and is the output of ith hidden neuron. Moreover, the equation is written based on the assumption of the network being trained using a dataset composed of N arbitrary patterns ( , ). Equation (1) can be written as the following matrix [58,74,75]: In which H is the hidden layer output of the network.
The following equation provides the output weights between the hidden layer and the output layer: An ANN architecture consisting of one input layer and L hidden neurons with an activation function called g(x) and a bias term (B), is presented mathematically as Equation (1): where w i is the vector weights that connect ith hidden neuron to the input neurons, β i is the vector of weights connecting ith hidden neuron to the output neurons, o k is the kth output vector, B i is the bias regarding ith hidden neuron, and g i is the output of ith hidden neuron. Moreover, the equation is written based on the assumption of the network being trained using a dataset composed of N arbitrary patterns (X i , Y i ). Equation (1) can be written as the following matrix [58,74,75]: where, In which H is the hidden layer output of the network.
The following equation provides the output weights between the hidden layer and the output layer:β = H + T where H + represents the Moore-Penrose generalized inverse of the hidden layer output matrix H [74]. Based on what is presented in Huang et al. [58,59], in an ELM model w i weights and β i bias are randomly assigned (based on a probability density function). Then, the H matrix is calculated based on Equation (3), and finally, H + can be calculated.

Convolutional Neural Networks
A Convolutional Neural Network (CNN) is a specific architecture of neural networks that is designed based on the weight sharing concept and employs convolution and pooling layers [76][77][78]. The family of CNN models include one-dimensional CNN (Conv1D), two-dimensional CNN (Conv2D), and three-dimensional CNN (Conv3D) models, and their primary difference is in the structure of the model inputs [79]. A standard CNN architecture consists of an input layer, an output layer, and a hidden layer that is composed of a convolution layer, a pooling layer, and a fully connected layer (Figure 4b). A unique feature of the CNN is that a given neuron is only connected to its nearby local neurons in the previous layer. While the neurons in the convolution layer are fully connected to the input layer neurons, they are not connected to all the neurons in the pooling layer.
Convolution and pooling layers, as the core building blocks in CNN, extract different features from the input layer and convert them to small dimensions by performing convolution operations on the input layer and merging neuron cluster outputs into a single neuron. The pooling mechanism significantly reduces the number of coefficients in the network and makes the training (learning) phase of the CNNs more efficient, easier, and faster than the regular ANN networks [79,80]. Following that, the fully connected layer flattens all feature maps in a feature vector and uses them as input variables to make predictions [81,82].
The application of CNN models for streamflow prediction has received more attention over the last few years, and they have been found to be relatively fast, accurate, and stable alternatives among the growing family of deep learning algorithms [78,[83][84][85].
Streamflow is a one-dimensional data; thus, for this study, a Conv1D model is adopted. The application of a rectified linear unit (ReLU) activation function for the convolutional layer is recommended to enhance the model's ability to capture non-linearity [86]. The mean squared error (MSE) was used as the loss function for the fully connected layer.

Long Short-Term Memory
Hochreiter and Schmidhuber [87] introduced the Long Short-Term Memory (LSTM) model as a form of recurrent neural network. Contextual state cells are used in LSTM models as either long-term memory cells or short-term memory cells, making them suitable substitutes for representing sequential data [88,89]. The LSTM model's architecture (shown in Figure 4c) is made up of unique units (memory blocks) in the recurrent hidden layer. Self-connected memory cells and multiplicative units implemented in the memory blocks are utilized to store the network's temporal state. The input, output, and forget gates, which are multiplicative units, are in charge of managing the information flow. The following equations are used in different LSTM cells: Forget Gate : Previous Cell State : Current Cell State : HiddenState In which, i t , f t , and O t represent the input gate, forget gate, and output gate, respectively. W i , W f , and W o stand for the weights connecting the input, forget, and output gates with the input, respectively; U i , U f , and U o are the weights from the input, forget, and output gates to the hidden layer, respectively; b i , b f , and b o indicate the input, forget, and output gate bias vectors, respectively. C t is the current state of the cell and C t is the state of the cell at the previous time. Moreover, h t refers to the output of the cell at the current time [90,91]. Additionally, the dropout mechanism was used to enhance the generalization of the model and avoid overfitting [92].

Attention-Based Long Short-Term Memory
Attention is a deep learning strategy and can be viewed as essentially implementing a neural network within another neural network to weigh various portions of a sequence for relative feature importance [103,104]. Multiplicative self-attention, a special type of attention, is used for the SA-LSTM model in this study (Figure 4d), which is the mechanism of relating different positions of a single sequence in order to compute a representation of the same sequence [39,105].
The self-attention mechanism is known as an effective technique for improving LSTMs and enhancing the model's performance by "paying attention" and assigning attention scores (weights) to each observation [106,107]. Attention-based LSTM models are among the most recent developments in machine learning that have found application for streamflow prediction [39,108,109].
The first 75% of the period of study, from April 1988 until September 2013, was used for the training phase. 25% of these 307 months was used as the evaluation phase to find the best values for the hyperparameters of the models. The period from October 2013 until May 2022 was used as the independent testing dataset. For this study, the pre-processing (e.g., Kalman filtering, data standardization) and post-processing (e.g., model evaluation metric calculations, visualizations) were done in R [110]. The models were developed and run in Python [111]. 25% of the training dataset was used for the validation process, where the best values for the hyperparameters (e.g., the number of hidden neurons, dropout rate, learning rate, and number of epochs) were determined based on grid search.

Model Evaluation Metrics
Common guidelines for model evaluation (e.g., [112,113]) were utilized here to compare the closeness of model predictions to observations over a broad range of statistical measures. The following model evaluation metrics were used in this study:

•
The Mean Absolute Error (MAE) and Root Mean Square Error (RMSE): MAE and RMSE measure the errors associated with the low and high flowrates, respectively, and together they support model comparison with respect to accuracy. MAE and RMSE are calculated as: where N is the number of observations, and S i and O i are the simulated and observed flowrates, respectively.
• The Index of Agreement (d): Developed by Willmott [114], the index of agreement is a standardized measure between 0 and 1 and describes the degree of model prediction error. This index can identify additive and proportional differences between observed and simulated means and variances, but it should be noted that this index is extremely sensitive to extreme values [115]. The formula for the index of agreement is as follows: where, O is the average of observed flowrates.
• The Pearson's r (r): Pearson [116] developed the Pearson (Product-Moment) correlation (r), which was based on the work of others, including Galton [117], who first introduced the concept of correlation [118,119]. The r coefficient is considered the most common measure of association between variables and is widely used for describing linear relationships. Pearson's r is calculated as: where, S is the average of simulated flowrates. There are guidelines in the literature to interpret different ranges of r. According to Schober et al. [120], thresholds of 0.1, 0.39, 0.696, and 0.89 can be used to describe negligible, weak, moderate, strong, and very strong correlations, respectively. It should be noted that extraordinarily high outliers (extreme high floods in the case of this study) can have a huge effect on Pearson's r [119].

• The Coefficient of Determination (R 2 ):
The coefficient of determination describes how well observed outcomes are simulated by the model, based on the proportion of total variation of outcomes explained by the model [121].
where RSS is the sum of squares of residuals and TSS is the total sum of squares.
• The Nash-Sutcliffe Efficiency (NSE): The Nash-Sutcliffe efficiency (NSE) is a normalized statistic that measures the relative magnitude of residual variance ("noise") versus measured data variance [122]. The NSE is computed using the following equation: The NSE varies between -infinity and 1, with NSE = 1 corresponding to an ideal match between the model estimation and the observed data. While positive values of NSE are generally considered "acceptable levels of performance," a negative NSE score suggests that the mean of the observed data is a better predictor than the model [113]. The NSE is a reliable and widely used model evaluation metric in the field of hydrology [123,124]. According to Moriasi et al. [113] thresholds of 0.5, 0.65, and 0.75 can be used to rate the model performance as "very good", "good", "satisfactory", and "unsatisfactory", respectively.

• The Kling-Gupta Efficiency (KGE):
The Kling-Gupta Efficiency (KGE) was developed initially by Gupta et al. [125] and later revised by Kling et al. [112] to decompose the Nash-Sutcliffe Efficiency metric into correlation (Pearson's r), bias (the ratio between the mean of the simulated values and the mean of the observed ones), and variability components to facilitate its use and providing more insight to model performance. Similar to the NSE, the KGE metric has been increasingly used as a model evaluation metric in the hydrologic literature [126][127][128][129][130].
The KGE is computed as: where r is Pearson's r, and s is a vector of length three which contains the scaling factor for correlation, variability, and bias. Variability (vr) and bias (β) can be calculated as: where σ s and σ o are the standard deviation of simulated and observed flowrates, and µ s and µ o are the average values of simulated and observed flowrates, repextively. Knoben et al. [131] showed that a KGE > −0.41 indicates that the model is more informative than the mean of the observed data.

Predictive Performance over the Entire Range of Flowrates
A summary of the model evaluation metrics during the training and testing periods is provided in Tables 2 and 3, respectively. The error indices, MAE, and RMSE, were significantly lower for the deep learning algorithms against the ELM model. With almost 20% lower Pearson's r value, 10% lower index of agreement, and almost 30% lower R 2 , the baseline ELM model was outperformed by the more complex deep learning counterparts during the testing period. According to the NSE scores, the deep learning models achieved "very good" levels of performance (NSE > 0.75) against the unsatisfactory performance of the ELM model (NSE < 0.5). All models achieved "skillful" predictions (KGE > −0.41); however, the deep learners, particularly LSTM-based models offered better estimations with lower biases and better variability ratios and, thus, considerably better KGE scores in comparison to the ELM. A serious problem associated with utilizing artificial neural networks is overfitting or poor generalizability, which occurs when the model performs well with the training data and fails to maintain the same performance quality on the independent testing data [132][133][134][135]. A comparison of the metrics achieved by the four models during the train-ing and testing periods indicated that the ELM model exhibited the worst performance decline from training to testing (substantially higher than the deep learning models). The ELM model achieved the best scores during training (almost perfect with respect to all metrics) and the worst scores during testing compared to the other models. Finding ELM prone to overfitting is consistent with the previous reports of its application in literature and is mostly due to the large number of hidden nodes required to capture complex non-linear relationships [74,136,137]. A variety of contributing factors to the problem of overfitting are stated in the literature: from architecture-related reasons (e.g., high model complexity, an extensive number of hidden units) [138][139][140] to data-related reasons (e.g., noisy training samples, under-sampled training data) [141][142][143]. Bejani and Ghatee [144] categorize methods for controlling overfitting as passive, active, and semi-active. The pooling mechanism built into CNN [79] and the dropout mechanism [92] utilized with the LSTM and SA-LSTM models belong to the category of active (regularization) and semi-active (dynamic architecture), respectively. The results indicated that the applied overfitting control mechanisms and architectural advancement of the deep learning models granted them an enhanced ability to learn the information (input-output relationships) and distinguish the noise in data during the learning (training) phase. While the metrics of the three deep learning algorithms were close (within a 10% difference), SA-LSTM achieved slightly less errors and higher correlations, as reflected in higher KGE scores. To further explore the streamflow predictions by the four algorithms, their estimated flowrates are plotted against the observed values during the testing period in Figure 5.
During the testing period, the streamflow monitoring station at the headwaters of the Texas Colorado River recorded 35 dry months (no-flow), which is equivalent of 34% of the testing period. Additionally, three relatively large flooding events (greater than 5 cubic meters per second) were recorded that were greater than the highest previously recorded flood at the station of interest. Given the limitation of data-driven models in predicting beyond the range of training data, the first two large flooding events were expected to be more challenging for all models. According to the results, the baseline ELM model predicted a considerable number of physically unrealistic negative streamflow predictions, mostly when the stream was dry (Figure 5a). Further, ELM severely underpredicted the largest flood events. While the overall better performance of the deep learning algorithms, previously discussed using the evaluation metrics, can also be seen with the time-series plot, Figure 5b-d provide more insight into the differences between the CNN and LSTM counterparts. Compared to the ELM model, the extent of negative flowrate estimates is less severe with the deep learning models. Additionally, the more complex models provided more accurate predictions of the extremely high flows. Based on the results, the LSTM and SA-LSTM models were superior among the investigated algorithms as they captured the extreme hydrologic events more accurately. During the testing period, the streamflow monitoring station at the headwaters of the Texas Colorado River recorded 35 dry months (no-flow), which is equivalent of 34% of the testing period. Additionally, three relatively large flooding events (greater than 5 cubic meters per second) were recorded that were greater than the highest previously recorded flood at the station of interest. Given the limitation of data-driven models in pre-

Predictive Performance for the No-Flow Events
As discussed earlier, the distinguishing characteristic of IRES is the presence of noflow events (zero flowrate entries) in their flow time-series, and capturing these events is both important and challenging. Figure 6 illustrates the flowrate estimations of the four models for the cases when a no-flow event was recorded. While ideally, all the model predictions should be on the horizontal line of 0, according to the results, none of the investigated models in this study estimated an absolute zero flowrate. When predicting a true zero flowrate, the models either exhibited over-estimation and predicted a positive flowrate, or under-estimated the no-flow event and predicted physically unrealistic negative flowrates. The inability to capture the no-flow events leads to an uncertainty concern for the application of these streamflow prediction algorithms in IRES settings, and the extent of over-and under-estimation errors (as measured by the MAE) can be viewed as a measure of this uncertainty. The deep learning algorithms achieved closer values to zero when predicting no-flow events with lower errors (MAE < 0.1 m 3 /s) in comparison to the ELM model (MAE = 0.67 m 3 /s). Table 4 summarizes the percentage of negative flowrate estimations by each model during the testing period. Despite achieving a relatively low MAE, the CNN model had the highest percentage of negative predictions, followed by the baseline ELM model. However, the advanced architecture of the LSTM models, and particularly, utilizing the attention unit, considerably reduced the extent of negative flowrate estimations. There are a number of factors contributing to the limited predictive ability of these models in estimating no-flow events.
From a hydrological perspective, the flow in headwater IRES systems tends to be seasonal and is largely controlled by overland flows following rainfall or snow-melt event.
In arid and semi-arid regions, water tables tend to be deep, and the river systems are hydraulically disconnected from the water-bearing subsurface system [145]. Flow cessation (i.e., no-flow conditions) begins as water in the stream channel becomes disconnected and is present in discontinuous pockets. In the absence of precipitation, slow-moving water paths (e.g., groundwater discharges), or anthropogenic discharges (e.g., wastewater discharge from municipalities), the IRES dries up. Eventually, the intermittent headwater stream translates to ephemeral flow conditions, and the disconnected parcels of water and the exposed soils will continue to undergo evaporation until the complete flow cessation happens, resulting in no-flow conditions. As the mechanisms of flow and no-flow regimes are controlled by different hydrological processes, the assumption that they arise from a single underlying distribution is perhaps the main limitation of current data-driven models.
Further, as the streamflow prediction models try to match both zero and extremely high flows using a limited set of calibration parameters, they underestimate the high flows and overestimate the no-flow events. Therefore, the results from such models must be post-processed to induce intermittency. A cut-off threshold is often subtracted from the predicted flows to simulate intermittent flow conditions [146]. This approach, while pragmatic, is also subjective, and requires a careful assessment by experts and a detailed understanding of the surface and subsurface hydrological and geological conditions, which may not be known with certainty, even in well-characterized streams. Thus, even though the deep learning models, particularly the LSTM models, outperformed the baseline ELM model and estimated considerably fewer negative flowrate estimations, there is still room to improve the performance of IRES streamflow prediction models and develop algorithms capable of accurately capturing no-flow conditions.  Table 4 summarizes the percentage of negative flowrate estimations by each model during the testing period. Despite achieving a relatively low MAE, the CNN model had the highest percentage of negative predictions, followed by the baseline ELM model. However, the advanced architecture of the LSTM models, and particularly, utilizing the attention unit, considerably reduced the extent of negative flowrate estimations. There are a number of factors contributing to the limited predictive ability of these models in estimating no-flow events.

Predictive Performance for the Extreme High Flow Events
The errors associated with the flowrate estimations of the four models for the three extreme flooding events are summarized in Table 5. For the 2021 flood, which was relatively similar to the 1992 flood included in the training data, and for the 2015 flood, the highest recorded flowrate in the history of the Texas Colorado River headwaters, the LSTM and SA-LSTM models offered the most accurate estimates among the investigated algorithms. The baseline ELM model was outperformed by the deep learning algorithms, with almost 50% underestimation of the largest three extreme flood events. To further explore the predictive ability of the models of interest for extreme high flows, their performances were compared for the top 30% of high flow events (all the flow events that were greater than or equal to the 70th quantile of all positive flowrates). MAE, RMSE, and KGE metrics were computed (Figure 7) for the four algorithms over extreme high flow estimations.  Considering the MAE and RMSE metrics (Figure 7a,b), it is clear that the deep learning algorithms achieved similar estimation errors, which were lower (~0.8 cubic meters per second) than the baseline ELM model. Thus, the deep learning models were identified Considering the MAE and RMSE metrics (Figure 7a,b), it is clear that the deep learning algorithms achieved similar estimation errors, which were lower (~0.8 cubic meters per second) than the baseline ELM model. Thus, the deep learning models were identified as appropriate tools for extreme high flow estimation in the headwaters of the Colorado River in Texas and were advantageous compared to the shallow learning ELM counterpart.
A comparison of the KGE scores showed the advantage of the more complex LSTM models (K.G.E. > 0.65) in capturing the extreme high flows compared to the ELM and CNN algorithms (KGE < 0.55). As KGE is a composite metric that accounts for correlation, bias, and variability, it was concluded that the more advanced architecture and complex algorithm of the LSTM units were better alternatives for capturing the extreme high flows in the intermittent headwaters of the Colorado River in Texas.
There are two major limitations to the predictive ability of the data-driven models for estimating extreme high flow events, particularly in an IRES setting. First, as discussed earlier, the investigated data-driven models assume that the entire IRES flow data arise from a single distribution, and fitting a curve in the presence of numerous zero flow entries curtails the predictive performance of the models for the extreme high flowrates, resulting in underestimation of these events.
Second, the flow generation process in IRES is highly climate-driven, and changes in climatic patterns are likely to cause unprecedented flow events (e.g., record-breaking floods, prolonged dry spells) in such streams. This is exemplified by the case of the 2014 and 2015 floods in the headwaters of the Colorado River in Texas, where they broke the previous flood record by twice the magnitude. In such cases, to achieve an accurate streamflow estimation, the model must make a prediction outside its valid domain. The extrapolation problem or severe deviation of model performance when the inputs are dissimilar to the training data is a well-known weakness of the data-driven models, even the more complex deep learning algorithms [147][148][149]. According to the results, utilizing the more advanced LSTM deep learning models yielded more accurate estimates for the extrapolation cases, making them more reliable alternatives for modeling intermittent headwaters in the face of climate change. However, further research on methods to address the extrapolation problem, removing the burden of the no-flow events on extreme high flow prediction, and reducing the uncertainty associated with extreme flow analysis of IRES is needed.

Summary and Conclusions
Reliable streamflow prediction of intermittent rivers and ephemeral streams, such as the headwaters of the Colorado River in Texas, is an essential requirement for a variety of planning and management tasks associated with these streams, from drought analysis to flood warning systems, supply allocation, and riparian ecosystem conservation. In this study, the application of advanced deep learning algorithms, namely CNN, LSTM, and SA-LSTM models, were compared against a baseline ELM model for hydro-meteorological modeling and monthly streamflow prediction at the headwaters of the Texas Colorado River, located in a climate hotspot and exposed to changes in precipitation and temperature variability. The performance of these algorithms was evaluated using a suite of model evaluation metrics and compared over the entire range of flow, as well as for the no-flow events and extreme high flowrates. Here is a list of the major findings of this study for intermittent streamflow prediction at the headwaters of the Texas Colorado River: • While all the investigated models offered skillful streamflow predictions (as measured by the KGE score above −0.41), overall, the deep learning models clearly outperformed the baseline ELM model with respect to all evaluation metrics. The more advanced models better captured the flow dynamics in the IRES setting and were found to be appropriate tools for streamflow prediction at the site of study.

•
None of the investigated data-driven algorithms were able to capture absolute zero flowrates. However, deep learning models, more specifically LSTM and SA-LSTM, estimated closer values to zero and predicted considerably less unrealistic negative flowrates.
• Deep learners also offered more accurate predictions of the extreme high flows, with lower RMSE and MAE errors and higher correlations and KGE scores in comparison to the ELM. • With respect to the principle of parsimony, the LSTM model is the most appropriate model among the considered alternatives as it outperformed the ELM and CNN models with considerable higher performance metrics and achieved relatively similar results to the SA-LSTM model, despite not having the attention unit and being a slightly simpler methodology. • LSTM and SA-LSTM models outperformed their counterparts when challenged with the extrapolation problem for the unprecedented record-breaking flood events of 2014 and 2015. • Despite its simplicity and fast speed, the ELM model was found to provide unreliable streamflow estimations, and its application is not recommended for the studied stream, particularly because it exhibited the most severe underestimation of the extreme high flows. • The ELM model was found to be prone to overfitting and learning the noise in the training data, which yielded noticeably lower quality of performance during the independent testing period. • The CNN model, while achieving better evaluation metrics than the baseline ELM model, predicted a large number of negative flowrates and failed to provide accurate estimates of the extreme high flows. Hence, the application of the CNN algorithm is not recommended for the stream of study.

•
The SA-LSTM, as the cutting-edge alternative and the most complex tool among the investigated models, offered the best performance in capturing the extreme ends of the IRES streamflow spectrum: no-flow events and extreme floods.

•
The pooling mechanism in CNNs and the dropout mechanism for the LSTM-based models were found to be effective in considerably lowering the extent of performance loss from training to testing and controlling overfitting.
According to the results of this study, deep learning algorithms are powerful and effective tools for predicting streamflow in the headwaters of the Colorado River in Texas. The layered architecture and advanced algorithm of these models allow them to model various portions of the IRES flow series, including the extreme hydrologic events, with higher accuracy, enhanced reliability, and a considerably lower extent of overfitting. Deep learning streamflow prediction models offer valuable information about IRES flow dynamics to support various management and planning efforts associated with these growingly important surface water resources. However, modelers and other groups of users should be cautious with the estimations of these data-driven models due to their limitations, such as their inability to capture absolute zero flowrates or their failure to maintain high performance when applied to data dissimilar to the training set (e.g., an unprecedented flood event). Such limitations introduce uncertainties that should be considered when applying data-driven models and interpreting their results, regardless of how advanced their architecture may be. Further research is required to develop methodologies that can capture the complex streamflow generation and cessation processes in IRES, tackle the extrapolation problem with minimal performance loss, and provide reliable intermittent streamflow prediction. Additionally, increasing the quality and quantity of available hydrologic and meteorologic data in IRES sites, such as the headwaters of the Texas Colorado River, can significantly enhance the performance of data-driven models and lead to more effective water planning in these usually water-scarce regions. Funding: This research received no external funding.

Data Availability Statement:
All data that support the findings of this study are available from the corresponding author upon reasonable request.