Snowmelt-Driven Streamflow Prediction Using Machine Learning Techniques (LSTM, NARX, GPR, and SVR)

Although machine learning (ML) techniques are increasingly popular in water resource studies, they are not extensively utilized in modeling snowmelt. In this study, we developed a model based on a deep learning long short-term memory (LSTM) for snowmelt-driven discharge modeling in a Himalayan basin. For comparison, we developed the nonlinear autoregressive exogenous model (NARX), Gaussian process regression (GPR), and support vector regression (SVR) models. The snow area derived from moderate resolution imaging spectroradiometer (MODIS) snow images along with remotely sensed meteorological products were utilized as inputs to the models. The Gamma test was conducted to determine the appropriate input combination for the models. The shallow LSTM model with a hidden layer achieved superior results than the deeper LSTM models with multiple hidden layers. Out of seven optimizers tested, Adamax proved to be the aptest optimizer for this study. The evaluation of the ML models was done by the coefficient of determination (R2), mean absolute error (MAE), modified Kling–Gupta efficiency (KGE’), Nash–Sutcliffe efficiency (NSE), and root-mean-squared error (RMSE). The LSTM model (KGE’ = 0.99) enriched with snow cover input achieved the best results followed by NARX (KGE’ = 0.974), GPR (KGE’ = 0.95), and SVR (KGE’ = 0.949), respectively. The outcome of this study proves the applicability of the ML models, especially the LSTM model, in predicting snowmelt driven discharge in the data-scant mountainous watersheds.


Introduction
Snowmelt is one of the major sources of fresh water in the world. Billions of people living in river basins originating from the snow-dominated Hindu Kush Himalayan (HKH) region, directly or indirectly rely on rivers for food, water, and electricity [1]. In this region, the accurate prediction of snowmelt runoff is crucial for effective water resource planning and management. However, the paucity of snow monitoring networks in the HKH region, due to geographical conditions and extreme weather, leads to inadequate ground truth information available for developing strategies for optimum water resource utilization. Remotely sensed snow cover and meteorological data are invaluable assets for snowmelt runoff modeling in the data-scant mountainous watersheds.
Kratzert et al. [15] did not investigate the influence of hyperparameters (e.g., the number of LSTM layers and window size) on model performance but rather used some arbitrary values. Le et al. [17] and Fan et al. [16] emphasized the window size as an important hyperparameter to be tuned for best model performance, however, the effect of other hyperparameters, such as the number of LSTM layers and optimizers, were not evaluated.
This study aims to scrutinize the ability of the state-of-the-art deep learning LSTM network in modeling daily snowmelt runoff in a Himalayan basin. Furthermore, we evaluated the effect of various hyperparameters, including the number of LSTM layers and optimizers to achieve the best model performance. We developed three other ML models, namely, nonlinear autoregressive exogenous model (NARX), support vector regression (SVR), and Gaussian process regression (GPR) models and compared their performance. In this study, remotely sensed daily precipitation, temperature, and snow products along with the antecedent discharge data were used as inputs for one day ahead river discharge forecasting. The Gamma test (GT) was carried out to determine a suitable input combination for the model. The ML approach for operational river discharge forecasting will be useful in estimating water availability for reservoir management, water supply, irrigation, and hydroelectricity projects in the data-scarce mountain basins.

Study Area
Although the Himalayan region is a snow-glacier dominated region, snowmelt runoff modeling using ML models have not been conducted in this region. Therefore, Langtang basin ( Figure 1), a snow-glacier dominated basin situated in Central Himalayas, Nepal is chosen for this study. The altitude of the study site lies in the range of 3647-7213 m above mean sea level. The catchment area is 354 km 2 comprising 110 km 2 of the glacier area. The glacier extent is derived from RGI-GLIMS version 6 [21]. Snow and glacier-melt is a significant contributor to the overall runoff in this watershed [22].
Water 2020, 12, x FOR PEER REVIEW 3 of 17 rather used some arbitrary values. Le et al. [17] and Fan et al. [16] emphasized the window size as an important hyperparameter to be tuned for best model performance, however, the effect of other hyperparameters, such as the number of LSTM layers and optimizers, were not evaluated. This study aims to scrutinize the ability of the state-of-the-art deep learning LSTM network in modeling daily snowmelt runoff in a Himalayan basin. Furthermore, we evaluated the effect of various hyperparameters, including the number of LSTM layers and optimizers to achieve the best model performance. We developed three other ML models, namely, nonlinear autoregressive exogenous model (NARX), support vector regression (SVR), and Gaussian process regression (GPR) models and compared their performance. In this study, remotely sensed daily precipitation, temperature, and snow products along with the antecedent discharge data were used as inputs for one day ahead river discharge forecasting. The Gamma test (GT) was carried out to determine a suitable input combination for the model. The ML approach for operational river discharge forecasting will be useful in estimating water availability for reservoir management, water supply, irrigation, and hydroelectricity projects in the data-scarce mountain basins.

Study Area
Although the Himalayan region is a snow-glacier dominated region, snowmelt runoff modeling using ML models have not been conducted in this region. Therefore, Langtang basin ( Figure 1), a snow-glacier dominated basin situated in Central Himalayas, Nepal is chosen for this study. The altitude of the study site lies in the range of 3647-7213 m above mean sea level. The catchment area is 354 km 2 comprising 110 km 2 of the glacier area. The glacier extent is derived from RGI-GLIMS version 6 [21]. Snow and glacier-melt is a significant contributor to the overall runoff in this watershed [22].

Hydrometeorological Data
The hydrological data gauged at the Kyangjing station (latitude 28.216 • , longitude 85.55 • ) was received from the Department of Hydrology and meteorology (DHM), Nepal. The daily temperature data was derived from the 0.25 • × 0.25 • gridded APHRODITE product, APHRO_TAVE_MA_V1808, for the period 2002-2012 [23]. Previous studies have shown a good correlation (Spearman's rho > 0.9) of APHRODITE products with ground observation in the Langtang basin [20]. Precipitation data, acquired from the tropical rainfall measuring mission (TRMM), with a spatial resolution of 0.25 degrees is used in this study [24]. 3B42RT TRMM datasets are completed in four phases. Precipitation related data are obtained from various sensors and the precipitation rate is computed. Infrared data, collected from the satellites, provides exceptional time-space coverage. These data are combined to provide the best precipitation estimate and finally, the rain gauge data are incorporated to produce the 3B42RT TRMM dataset. TRMM products have been widely used in the Himalayan region [25]. The product is available at https://pmm.nasa.gov [26].

Snow Cover
The moderate resolution imaging spectroradiometer (MODIS) product, MOD10A2 version 6 [27] is used for snow cover mapping. Several studies have verified the accuracy of MODIS datasets with ground observation. In a study, Stigter et al. [28] cross-checked the accuracy of MODIS snow data with in-situ snow observation in the Langtang basin and revealed an accuracy of 83.1%. The snow mapping algorithm uses MODIS band 4 and band 6 for determining the normalized-difference snow index [29]. In MOD10A2 product, a pixel is labeled as snow if snow cover is observed at least once in eight days, no snow if snow cover is absent in all eight days, and cloud if cloud cover is observed on all eight days. For this study, 496 images for the period (2002-2012) were downloaded from the National Snow and Ice Data Center (NSIDC) website (https://nsidc.org/data/mod10a2) [27]. The 30 m resolution Advanced Spaceborne Thermal Emission and Reflection Radiometer (ASTER) Global Digital Elevation Model (GDEM) is obtained from https://lpdaac.usgs.gov [30]. The boundary of the study area was delineated from the DEM. All MOD10A2 images were projected to World Geodetic System 1984 (WGS84), Universal Transverse Mercator (UTM) zone 45. The 8-day maximum SCA was extracted from projected snow images using the delineated boundary. Images with more than 10% cloud cover were discarded. Finally, the daily SCA for 365 days in a year was interpolated or extrapolated from the 8-day maximum SCA using the cubical spline method.

GT
For the DD models, input selection is an important task in the model development process [31], but it is often neglected. In most of the studies employing ML models, inputs were determined on an ad-hoc basis by trial and error method. In this study, we applied the GT [32], a non-parametric method to select input variables for a nonlinear and complex model. GT approximates the variance of the noise related to the output. Input variables are selected based on Gamma value and V-ratio. Input combination with the least Gamma value and V-ratio is considered best for the model. For the selection of a suitable input combination, 15 different input combinations of SCA, temperature, precipitation, and antecedent discharge were tested and the best input combination was determined using the winGamma TM application (Cardiff University, University of Wales, Wales, UK) [33].

LSTM
The LSTM, proposed by [14], is a specialized RNN able to learn and preserve long-term dependencies [15]. The unique feature of LSTM is the presence of a memory cell. In LSTM, adding or deleting the information in the cell is controlled by gates. LSTM has three types of gates, i.e., Forget, Input, and Output gates, to control the flow of information in the cell. The Forget gate governs the amount of information to remove from the previous cell state. The information to be introduced into the cell state is controlled by the input gate. Then, a vector of a candidate layer is generated which could be appended to the cell state. Then, the old cell state is adjusted to a new cell state according to the preceding two steps. The output layer opts information from the cell state to be used as output. The final LSTM layer at the final time step (n) passes the output (h n ) to a dense layer with a single output neuron where final streamflow (y) is calculated. Equations related to the LSTM cell are presented below.
Forget gate : Input gate : Potential update vector : Cell state : Output gate : Hidden state : Output layer : Sigmoid function : Tanh function : where f, i, o are vectors representing forget, input, and output gates respectively, c t , c t are vectors for the cell states and candidate values, σ is the sigmoidal function, is element-wise multiplication, x t is input vector at time t, W and U are the weight matrix for input and hidden state, and b is the bias term. A classic LSTM cell is shown in Figure 2.
Water 2020, 12, x FOR PEER REVIEW 5 of 17 the preceding two steps. The output layer opts information from the cell state to be used as output.
The final LSTM layer at the final time step (n) passes the output (hn) to a dense layer with a single output neuron where final streamflow (y) is calculated. Equations related to the LSTM cell are presented below.
Forget gate: = + ℎ + (1) Potential update vector: ̃ = ℎ( ̃ + ̃ℎ + ̃) Cell state: Output gate: Hidden state: ℎ = tanh( ) ⊙ Output layer: = ℎ + Sigmoid function: ( ) = Tanh function: tanh( ) = where f, i, o are vectors representing forget, input, and output gates respectively, , ̃ are vectors for the cell states and candidate values, is the sigmoidal function, ⊙ is element-wise multiplication, is input vector at time t, and are the weight matrix for input and hidden state, and b is the bias term. A classic LSTM cell is shown in Figure 2. In this study, the LSTM model was developed using open-source Keras [34] with Tensorflow backend in Python. The presence of a large range of values in the dataset affects the learning skill as well as the convergence of the LSTM network during training. Therefore, for efficient learning and faster convergence, input and target data are rescaled to the range (0,1) by subtracting the minimum number and then dividing by the range, without altering the shape of the original distribution. Then, the time-series data is transformed to the supervised learning. The input data should be in a threedimensional format (i.e., samples, time step, and features) for the LSTM model. For the final prediction, the output is retransformed to the original scale. The accuracy of the model is affected by the choice of hyperparameters [15,16,35,36]. Hyperparameters such as optimization algorithm, In this study, the LSTM model was developed using open-source Keras [34] with Tensorflow backend in Python. The presence of a large range of values in the dataset affects the learning skill as well as the convergence of the LSTM network during training. Therefore, for efficient learning and faster convergence, input and target data are rescaled to the range (0,1) by subtracting the minimum number and then dividing by the range, without altering the shape of the original distribution. Then, the time-series data is transformed to the supervised learning. The input data should be in a three-dimensional format (i.e., samples, time step, and features) for the LSTM model. For the final prediction, the output is retransformed to the original scale. The accuracy of the model is affected by the choice of hyperparameters [15,16,35,36]. Hyperparameters such as optimization algorithm, number of LSTM layers, hidden units, loss function, learning rate, batch size, and dropout are carefully selected based on performance or previous studies which is further discussed in Section 3.2.

NARX
NARX, a recurrent dynamic network, is a powerful modeling tool with faster convergence and generalization than simple ANNs [37]. Several researchers have successfully employed the NARX model in hydrological modeling [38][39][40]; however, the potential of this model in snowmelt runoff modeling is yet unexplored. This model has feedback that links some of the network layers, so it is known as a recurrent dynamic network. Parallel (P) and series-parallel (SP) structures are two types of NARX model architecture ( Figure 3). The SP model is preferred because it is a fully feedforward architecture and therefore, the backpropagation algorithm can be applied for training. Since the actual output is used, the SP model can produce accurate forecasts. The principal equation for the NARX model is where the predicted variable (y) at timestep (t) is computed based on its previous values and previous values of exogenous input values (u).
Water 2020, 12, x FOR PEER REVIEW 6 of 17 number of LSTM layers, hidden units, loss function, learning rate, batch size, and dropout are carefully selected based on performance or previous studies which is further discussed in Section 3.2.

NARX
NARX, a recurrent dynamic network, is a powerful modeling tool with faster convergence and generalization than simple ANNs [37]. Several researchers have successfully employed the NARX model in hydrological modeling [38][39][40]; however, the potential of this model in snowmelt runoff modeling is yet unexplored. This model has feedback that links some of the network layers, so it is known as a recurrent dynamic network. Parallel (P) and series-parallel (SP) structures are two types of NARX model architecture ( Figure 3). The SP model is preferred because it is a fully feedforward architecture and therefore, the backpropagation algorithm can be applied for training. Since the actual output is used, the SP model can produce accurate forecasts. The principal equation for the NARX model is where

the predicted variable ( ) at timestep (t) is computed based on its previous values and previous values of exogenous input values ( ).
Training algorithm, number of hidden nodes, input delays, and feedback delays are the important parameters affecting the performance of the NARX model. The river discharge is predicted using the antecedent discharge and the exogenous inputs, including SCA, air temperature, and precipitation. The feedback delay was chosen by the autocorrelation function of the discharge data series. The number of hidden nodes and feedback delays was selected based on a trial and error method. Different training algorithms, such as Bayesian regularization (BR), Levenberg-Marquardt (LM), and scaled conjugate gradient (SCG) were applied for training the NARX model and their performance was compared. We used trainlm, trainbr, and trainscg functions in MATLAB version 2018b to train the model by LM, BR, and SCG method, respectively.

SVR
The support vector machine, introduced by Vapnik [41], is a standard ML tool for classification and regression. SVR is a nonparametric regression technique based on the Support Vector Machine. The SVR applies the principle of the structural risk minimization to recognize the pattern between predictor and predicted values, whereas, ANN use empirical risk minimization principle. In the SVM model, the learning function for inputs and outputs ((x1, y1), (xn, yn)) are created. The approximating function f(x) is given by: Training algorithm, number of hidden nodes, input delays, and feedback delays are the important parameters affecting the performance of the NARX model. The river discharge is predicted using the antecedent discharge and the exogenous inputs, including SCA, air temperature, and precipitation. The feedback delay was chosen by the autocorrelation function of the discharge data series. The number of hidden nodes and feedback delays was selected based on a trial and error method. Different training algorithms, such as Bayesian regularization (BR), Levenberg-Marquardt (LM), and scaled conjugate gradient (SCG) were applied for training the NARX model and their performance was compared. We used trainlm, trainbr, and trainscg functions in MATLAB version 2018b to train the model by LM, BR, and SCG method, respectively.

SVR
The support vector machine, introduced by Vapnik [41], is a standard ML tool for classification and regression. SVR is a nonparametric regression technique based on the Support Vector Machine. The SVR applies the principle of the structural risk minimization to recognize the pattern between predictor and predicted values, whereas, ANN use empirical risk minimization principle. In the SVM model, the learning function for inputs and outputs ((x 1 , y 1 ), (x n , y n )) are created. The approximating function f (x) is given by: where < x i , x > denotes dot product of two vectors, x is the problem variable vector, b is the bias, α i and α * i are dual variables. Various studies have confirmed that the radial basis function (RBF) performs better than other kernel functions [42]. Therefore, in this study, the Gaussian RBF kernel function is used in the SVM model, which is defined as follows: For the SVR model with the Gaussian RBF kernel function, the penalty parameter (C) and epsilon (ε) are the important parameters that are determined by the trial and error method in this study.

GPR
The GPR model is a kernel-based probabilistic model. Gaussian processes can also be viewed as the Bayesian version of the SVM methods. They are flexible and simple to implement methods appropriate for problems related to classification and prediction [43]. Unlike ANNs, GPR models normally do not suffer overfitting problems. Despite these advantages, very few usages of GPR in water resource and hydrological modeling are available. Using the notations from [44], the model can be summarized as follows: Observation model: GP prior: Hyper prior: where m(x) and k(x, x |θ) indicate the mean and covariance functions and θ and φ are parameters of the covariance function and the observation model, respectively. The joint distribution of training outputs (f ) and test outputs ( f ) is given by: Marginal distribution of f as where K f , f = k(x, x|θ) and K f , f = k( x, x|θ). The performance of several kernel functions, including quadratic, squared exponential, exponential, matern 5/2, matern 3/2 are compared and the suitable kernel function is chosen for this study.

Performance Indices
The evaluation of the model is done by several statistical error measures as described in Table 1.
NSE provides the relative magnitude of residual variance compared to the measured data variance [46].
R 2 provides the intensity of the link between measured and simulated values. Its value ranges from 0 to 1, closer to 0 indicates a lower correlation while close to 1 represents a high correlation. where Q t (m 3 /s) and Q t (m 3 /s) are observed and simulated discharge at time t, Q and Q denote average observed discharge and average simulated discharge, r is Pearson's correlation coefficient, α is the measure of relative variability in simulated and observed discharge, CV denotes Coefficient of variation, and β is the ratio of mean simulated discharge and mean observed discharge.

Results
The SCA, air temperature (T), precipitation (P), and antecedent discharge (Q) were used as inputs to the models for the daily snowmelt runoff prediction. The hydrometeorological dataset was separated into a training set (9 years) and a testing set (2 years). SCA was derived from MOD10A2 snow images for the research period (2002-2012). The monthly variation of SCA is shown in Figure 4. The SCA is minimum during monsoon (June-September) and starts increasing during post-monsoon (October-November). The significant amount of snow is present during winter (December-February), which reaches the peak during February-March and then starts declining during pre-monsoon (March-May).

Selection of Input Combination
In this study, GT was conducted to select the best input combination for the model. The river discharge data is found to be significantly autocorrelated. Considering the lag of two days, 15 input combinations were tested by GT. Input combination M2 presented the minimum Gamma value and V-ratio as shown in Table 2, therefore, Model M2 was chosen for further study.

Selection of Input Combination
In this study, GT was conducted to select the best input combination for the model. The river discharge data is found to be significantly autocorrelated. Considering the lag of two days, 15 input combinations were tested by GT. Input combination M2 presented the minimum Gamma value and V-ratio as shown in Table 2, therefore, Model M2 was chosen for further study.

Effect of Hyperparameters on Model Performance
The batch size of 32 which is a default value recommended by several researchers [47] and mean square error (MSE) as a loss function is used in this study. The number of LSTM layers and hidden units was carefully chosen through several experiments. We compared the performance of the model with different numbers of LSTM layers (1, 2, and 3 layers). The model with one LSTM layer achieved the best result (MAE = 0.126, RMSE = 0.231) as shown in Figure 5a). Similarly, based on performance, the optimum number of hidden units was found to be 20 (Figure 5b). The performance of seven optimizers using default values of hyperparameters as mentioned in [34] was compared. Amongst all, Adamax optimizer (MAE = 0.1462, RMSE = 0.3138) showed the minimum error as shown in Figure 5c. Hence, Adamax optimizer [48], a variant of Adam, is adopted. Dropout is a simple method to prevent overfitting problems. We tested different values of dropout (0, 0.1, 0.2, 0.3) and appropriate value (0.1) was considered. We varied window size from 1 to 7 days, and an appropriate time step of 2 days was considered as per trial-and-error. The number of epochs was varied from 1 to 50. No significant improvement in performance was noticed when the number of epochs was raised above 40, therefore the number of epochs is 40 in this study. A fully connected LSTM layer with 20 cell/hidden units followed by a dense layer with hyperparameters as loss function (MSE), optimizer (Adamax), learning rate (0.002), beta1 (0.9), beta2 (0.999), dropout (0.1), time step (2 days), batch size (32), number of epochs (40) was considered for this study.
Water 2020, 12, x FOR PEER REVIEW 10 of 17

Effect of Hyperparameters on Model Performance
The batch size of 32 which is a default value recommended by several researchers [47] and mean square error (MSE) as a loss function is used in this study. The number of LSTM layers and hidden units was carefully chosen through several experiments. We compared the performance of the model with different numbers of LSTM layers (1, 2, and 3 layers). The model with one LSTM layer achieved the best result (MAE = 0.126, RMSE = 0.231) as shown in Figure 5a). Similarly, based on performance, the optimum number of hidden units was found to be 20 (Figure 5b). The performance of seven optimizers using default values of hyperparameters as mentioned in [34] was compared. Amongst all, Adamax optimizer (MAE = 0.1462, RMSE = 0.3138) showed the minimum error as shown in Figure 5c. Hence, Adamax optimizer [48], a variant of Adam, is adopted. Dropout is a simple method to prevent overfitting problems. We tested different values of dropout (0, 0.1, 0.2, 0.3) and appropriate value (0.1) was considered. We varied window size from 1 to 7 days, and an appropriate time step of 2 days was considered as per trial-and-error. The number of epochs was varied from 1 to 50. No significant improvement in performance was noticed when the number of epochs was raised above 40, therefore the number of epochs is 40 in this study. A fully connected LSTM layer with 20 cell/hidden units followed by a dense layer with hyperparameters as loss function (MSE), optimizer (Adamax), learning rate (0.002), beta1 (0.9), beta2 (0.999), dropout (0.1), time step (2 days), batch size (32), number of epochs (40) was considered for this study. For the NARX model, we compared the performance of three training algorithms namely, BR, LM, and SCG. Although LM and SCG algorithms were timesaving, their performance was inferior to that of the BR algorithm ( Table 3). The suitable number of the hidden nodes for the NARX model was found to be 3 as per the trial and error. The feedback delay was selected based upon the autocorrelation of the target series. Input delay was chosen based on the performance. For this work, input delays and feedback delays are 2 days and 1 day, respectively. For the GPR model, we compared the performance of five kernel functions for the training dataset and found squared exponential kernel function performing better than other kernel functions as shown in Table 4. For the SVR model, the penalty parameter (C = 15.5) and epsilon value ( = 0.25) was chosen by the trial and error method. After fine-tuning the ML models, the predicted discharge compares favorably to the discharge measured. For the NARX model, we compared the performance of three training algorithms namely, BR, LM, and SCG. Although LM and SCG algorithms were timesaving, their performance was inferior to that of the BR algorithm ( Table 3). The suitable number of the hidden nodes for the NARX model was found to be 3 as per the trial and error. The feedback delay was selected based upon the autocorrelation of the target series. Input delay was chosen based on the performance. For this work, input delays and feedback delays are 2 days and 1 day, respectively. For the GPR model, we compared the performance of five kernel functions for the training dataset and found squared exponential kernel function performing better than other kernel functions as shown in Table 4. For the SVR model, the penalty parameter (C = 15.5) and epsilon value (ε = 0.25) was chosen by the trial and error method. After fine-tuning the ML models, the predicted discharge compares favorably to the discharge measured.

Comparison of ML models
The quantitative evaluation of the model is based on five metrics: KGE', NSE, R 2 , RMSE, and MAE. The LSTM model (KGE' = 0.99) showed the best performance followed by NARX (KGE' = 0.974), GPR (KGE' = 0.95), and SVR (KGE' = 0.949), respectively. As shown in Table 5, the LSTM model outperformed all other models in terms of all metrics used. The performance of NARX was inferior to LSTM but better than other models. The performance of GPR and SVR is comparable. The qualitative assessment of the models is done by visual comparison. From Figure 6, it is observed that the performance of all four models is comparable for normal flow. However, for peak flows, only the LSTM model showed an accurate prediction. The box plot depicting the median and percentiles (5th, 25th, 75th, and 95th) of residuals of simulated river discharge is shown in Figure 7. For a good model, the mean and median of the residuals should lie near to zero. The mean and median of residuals of the LSTM model are closer to zero than that of other models. In the scatterplot diagram, if the points lie near the diagonal line (1:1 line), the model accurately estimates the river discharge. If the points are above the diagonal line (1:1 line), the model overestimates the actual flow and if points are below the diagonal line, the model underestimates the actual flow. From the scatterplot (Figure 8), it is well noticed that all models predict good for low and medium flows but high flows are underestimated by all models except LSTM. The ability of the LSTM model to predict peak flows accurately makes it suitable for use in engineering projects such as reservoir management, hydropower, irrigation, water supply, as well as early flood warning systems.

Comparison of ML models
The quantitative evaluation of the model is based on five metrics: KGE', NSE, R 2 , RMSE, and MAE. The LSTM model (KGE' = 0.99) showed the best performance followed by NARX (KGE' = 0.974), GPR (KGE' = 0.95), and SVR (KGE' = 0.949), respectively. As shown in Table 5, the LSTM model outperformed all other models in terms of all metrics used. The performance of NARX was inferior to LSTM but better than other models. The performance of GPR and SVR is comparable. The qualitative assessment of the models is done by visual comparison. From Figure 6, it is observed that the performance of all four models is comparable for normal flow. However, for peak flows, only the LSTM model showed an accurate prediction. The box plot depicting the median and percentiles (5th, 25th, 75th, and 95th) of residuals of simulated river discharge is shown in Figure 7. For a good model, the mean and median of the residuals should lie near to zero. The mean and median of residuals of the LSTM model are closer to zero than that of other models. In the scatterplot diagram, if the points lie near the diagonal line (1:1 line), the model accurately estimates the river discharge. If the points are above the diagonal line (1:1 line), the model overestimates the actual flow and if points are below the diagonal line, the model underestimates the actual flow. From the scatterplot (Figure 8), it is well noticed that all models predict good for low and medium flows but high flows are underestimated by all models except LSTM. The ability of the LSTM model to predict peak flows accurately makes it suitable for use in engineering projects such as reservoir management, hydropower, irrigation, water supply, as well as early flood warning systems.

Sensitivity Analysis
As discharge data is highly autocorrelated, antecedent discharge values were used for better prediction results. The sensitivity analysis of models for three input conditions, i.e., all inputs (M1), all inputs without precipitation (M2), and all input without SCA (M3), were carried out. The results of the sensitivity analysis are presented in Table 6

Sensitivity Analysis
As discharge data is highly autocorrelated, antecedent discharge values were used for better prediction results. The sensitivity analysis of models for three input conditions, i.e., all inputs (M1), all inputs without precipitation (M2), and all input without SCA (M3), were carried out. The results of the sensitivity analysis are presented in Table 6

Sensitivity Analysis
As discharge data is highly autocorrelated, antecedent discharge values were used for better prediction results. The sensitivity analysis of models for three input conditions, i.e., all inputs (M1), all inputs without precipitation (M2), and all input without SCA (M3), were carried out. The results of the sensitivity analysis are presented in Table 6

Discussion
In this study, the model efficiency (in terms of NSE) of LSTM, NARX, GPR, and SVR models were found to be 99.5%, 99.1%, 97.3%, and 97.1%, respectively. In a previous study by Pradhananga et al. [49], a positive degree day TI model employed in the Langtang basin for river discharge prediction achieved model efficiency (in terms of NSE) up to 80%, which is much lower than that of all ML models used in this study. Moreover, the ANN model employed by Uysal et al. [11] for 1 day ahead snowmelt runoff prediction in the Upper Euphrates Basin of Turkey achieved the model efficiency (in terms of NSE) up to 93%, which is also lower than that of ML models used in this study. In the study by Le et al. [17], the LSTM model for rainfall-runoff modeling showed the model efficiency (in terms of NSE) up to 99.2%, which is comparable to the result of LSTM model used in this study.
Kratzert et al. [15] employed a two-layered LSTM whereas Le et al. [17] and Fan et al. [16] used the single-layered LSTM, however, these studies did not evaluate the effect of the number of hidden layers on the predictive power of the LSTM model for runoff modeling. In this study, we compared the performance of the LSTM model for several hidden layers (1, 2, and 3 layers) and noticed that the LSTM layer with one hidden layer performed better than the LSTM model with multiple hidden layers. Similar to this, Kratzert et al. [18] also reported that a single-layered stacked LSTM model performs better than two-layered stacked LSTM. From this result, we realized that a single hidden layer LSTM model is adequate, and deeper LSTM models are not essential for streamflow prediction. In their study, Le et al. [17] used SGD optimizer whereas Fan et al. [16] used Adam optimizer. Both studies did not evaluate the influence of optimizer on model performance. In a study by KC et al. [35], the performance of three optimizers was compared and they found that Nadam optimizer was more accurate than SGD and Adam for plant disease detection. In this study, we compared the performance of seven optimizers (Adam, Nadam, Adamax, Adadelta, Adagrad, SGD, RMSprop) and found that Adamax optimizer is superior to other optimizers for runoff modeling.
Kratzert et al. [15] and Kratzert et al. [18] argued that the LSTM model could mimic snow accumulation and melting process by learning the linkage between precipitation during winter and runoff in spring. In this study, we compared the performance of the LSTM model with different input combinations (see Table 6). It is well noticed that SCA as input gives better snowmelt runoff prediction than precipitation as input. The results achieved by the previous studies [15,18] were good enough but could have been better (in the case of snow-influenced catchments) if they had incorporated snow-related data in the input.
Several studies [39,40] used the LM algorithm for training the NARX model. We compared the performance of LM, BR, and SCG algorithms, and the results showed that the BR algorithm performs better than LM and SCG algorithms. The superiority of the BR algorithm over the LM algorithm was also shown by Guzman et al. [38]. In a study, Alsumaiei [39] employed the NARX model for groundwater level forecasting and the results showed the model efficiency up to 99.3%, which is comparable to the performance of the NARX model in this study.
In most of the studies employing ML models [9,11,15,16,18], inputs were determined on an ad-hoc basis or by trial and error method. Input selection is an important task in the model development process for the DD models, however, it is often neglected. The application of GT during the initial phase of the model development process to determine the appropriate input combination reduced the workload which otherwise would have required several experiments. LSTM models used in rainfall-runoff modeling used precipitation data as a key input [15][16][17][18]. Since precipitation is the prime source of river discharge in those catchments, it is obvious to utilize rainfall as input for modeling in those cases. However, in the Himalayan basins, due to low temperature, precipitation is stored in the form of snow and therefore, does not instantly contribute to total runoff. In a study by Thapa et al. [20], it was found that precipitation and river discharge has no significant correlation whereas SCA and river discharge are significantly correlated in the Langtang basin. From the results of sensitivity analysis, it is clear that the models are sensitive to snow cover than precipitation data in the Himalayan basins, which demonstrates the ability of ML techniques to learn the complex physical linkage between input and output.
Due to the sparsity of ground stations, it is hard to obtain ground truth observation in Himalayan basins. Even if available, the ground truth data are not representative of the whole basin due to high elevation difference, as most of the stations are located at lower elevation zones within the Himalayas. In such cases, remotely sensed SCA and meteorological products are invaluable assets for the water resource modeling. The result of this study proves the applicability of ML models in operational streamflow forecasting using remotely sensed products in the data-scant mountainous basins.

Conclusions
In this study, four ML models, including LSTM, NARX, SVR, and GPR models are employed for snowmelt driven streamflow prediction. Langtang basin is one of the snow-glacier dominated basins in the Himalayas, therefore, this study area was chosen for the study. The performance of models is assessed by KGE', NSE, R 2 , RMSE, and MAE. The SCA extracted from MODIS snow products and remotely sensed meteorological data are utilized as inputs to the models. A suitable input combination was selected based on GT.
The LSTM model (KGE' = 0.99, RMSE = 0.173) outperformed NARX (KGE' = 0.974, RMSE = 0.486), GPR (KGE' = 0.95, RMSE = 0.812), and SVR (KGE' = 0.949, RMSE = 0.851) models. All four ML models achieved good results in discharge prediction (KGE' > 0.94). However, NARX, GPR, and SVR models slightly underestimated the high flows. While scrutinizing the potentiality of LSTM architecture in snowmelt-runoff prediction, we found the shallow LSTM model with a single hidden layer performing better than deeper models with multiple hidden layers. Out of seven optimizers tested, Adamax was found to be the most suitable optimizer for this study. The results of the GT and sensitivity analysis revealed that the ML models were more sensitive to SCA than precipitation data in the Langtang basin. Therefore, ML models enriched with snow data are appropriate for river discharge prediction in snow-dominated basins.
This study demonstrates the successful application of the LSTM, NARX, GPR, and SVR models in predicting snowmelt driven streamflow. This approach can be easily replicated on other snow-dominated mountainous basins with diverse characteristics where sufficient past river discharge data are available. This work will be useful for estimating water availability for reservoir management, water supply, irrigation, and hydroelectricity projects in the data-scanty mountainous basins.