Prediction of Combined Terrestrial Evapotranspiration Index (CTEI) over Large River Basin Based on Machine Learning Approaches

: Drought is a fundamental physical feature of the climate pattern worldwide. Over the past few decades, a natural disaster has accelerated its occurrence, which has signiﬁcantly impacted agricultural systems, economies, environments, water resources, and supplies. Therefore, it is essential to develop new techniques that enable comprehensive determination and observations of droughts over large areas with satisfactory spatial and temporal resolution. This study modeled a new drought index called the Combined Terrestrial Evapotranspiration Index (CTEI), developed in the Ganga river basin. For this, ﬁve Machine Learning (ML) techniques, derived from artiﬁcial intelligence theories, were applied: the Support Vector Machine (SVM) algorithm, decision trees, Matern 5/2 Gaussian process regression, boosted trees, and bagged trees. These techniques were driven by twelve different models generated from input combinations of satellite data and hydrometeorological parameters. The results indicated that the eighth model performed best and was superior among all the models, with the SVM algorithm resulting in an R 2 value of 0.82 and the lowest errors in terms of the Root Mean Squared Error (RMSE) (0.33) and Mean Absolute Error (MAE) (0.20), followed by the Matern 5/2 Gaussian model with an R 2 value of 0.75 and RMSE and MAE of 0.39 and 0.21 mm/day, respectively. Moreover, among all the ﬁve methods, the SVM and Matern 5/2 Gaussian methods were the best-performing ML algorithms in our study of CTEI predictions for the Ganga basin.

the CTEI in this study. Machine learning models differ in terms of their input parameters. Five variants of each model were built, changing the applied machine learning algorithm for each: RF, SVM, boosted trees, bagging, and Matern 5/2 Gaussian process regression (GPR). These algorithms were applied due to their high performances in previous studies. They are very good at learning complex and highly nonlinear relationships. Therefore, the objectives of this study are to (1) fit five machine learning algorithms for the modeling of the CTEI prediction based on advanced models derived from artificial intelligence theories; (2) compare the accuracy and stability of these models, and (3) determine which were the best outcomes provided by the five models based on the prediction accuracy with the best combination of the input variables.

Study Area
The Ganga river basin (GRB) is one of the most populous (about 440 million people) river systems in the world [10]. The basin is situated in the northern part of the country. It lies between latitudes 21 •  by applying bagging, GB, RF, and DT methods and comparing them with multilayer perceptron (MLP) and support vector regression (SVR). It was reported that the accuracy of tree-based models was the best. Recently, a novel, simple tree-based ensemble method named XG-Boost has been developed; it is an improved version of gradient boosting with higher computation efficiency and better capability to deal with overfitting problems [30]. Based on the techniques mentioned above, 12 different data-driven models were investigated to forecast the CTEI in this study. Machine learning models differ in terms of their input parameters. Five variants of each model were built, changing the applied machine learning algorithm for each: RF, SVM, boosted trees, bagging, and Matern 5/2 Gaussian process regression (GPR). These algorithms were applied due to their high performances in previous studies. They are very good at learning complex and highly nonlinear relationships. Therefore, the objectives of this study are to (1) fit five machine learning algorithms for the modeling of the CTEI prediction based on advanced models derived from artificial intelligence theories; (2) compare the accuracy and stability of these models, and (3) determine which were the best outcomes provided by the five models based on the prediction accuracy with the best combination of the input variables.

Study Area
The Ganga river basin (GRB) is one of the most populous (about 440 million people) river systems in the world [10]. The basin is situated in the northern part of the country. It lies between latitudes 21°32′8.6′′-31°27′36.2′′ N and longitudes 73°14′33.4′′-90°53′18.9′′ E, covering an area of 1,086,000 km 2 ( Figure 1). The GRB spreads out into four countries, India (79%), Nepal (14%), Bangladesh (4%), and China (3%). In India, it covers 861,452 km 2 , which is nearly 26% of the country's total geographic area [50]. The GRB originates in the Himalayan Mountains at the Gangotri glacier's snout, at an elevation of ~7000 m a.s.l. The Bhagirathi and Alaknanda Rivers' confluence occurs in Devprayag, which is then officially called the Ganga River. The Ganga River's main tributaries are the Yamuna, the Ramganga, the Gomti, the Ghaghra, The GRB spreads out into four countries, India (79%), Nepal (14%), Bangladesh (4%), and China (3%). In India, it covers 861,452 km 2 , which is nearly 26% of the country's total geographic area [50]. The GRB originates in the Himalayan Mountains at the Gangotri glacier's snout, at an elevation of~7000 m a.s.l. The Bhagirathi and Alaknanda Rivers' confluence occurs in Devprayag, which is then officially called the Ganga River. The Ganga River's main tributaries are the Yamuna, the Ramganga, the Gomti, the Ghaghra, the Sone, the Gandak, the Kosi, and the Mahananda. It flows for about 2510 km, generally southeastward, through a vast plain to the Bay of Bengal. The primary source of water in the Ganga River is surface runoff generated by precipitation (~66%), base flow (~14%), glacier melt (~11.5%), and snowmelt (~8.5%). The GRB receives 84% of total rainfall during the monsoon season (June to October). The monsoon season accounts for 75% of the rain in the upper basin and 85% of the rain in the lower basin [51]. The elevation range across the basin varies from sea level to the highest mountain peak (~8850 m a.s.l). Several researchers have documented drought years in the region. For example, the NRAA [52] reported that India had experienced 22 large scale droughts years-in 1891, 1896, 1899, 1905, 1911, 1915, 1918, 1920, 1941, 1951, 1965, 1966, 1972, 1974, 1979, 1982, 1986, 1987, 1988, 1999, 2000, and 2002-and also that their frequency had increased during the periods 1891-1920, 1965-1990 and 1999-2002. Rathore et al. [53] reported that India experienced three major droughts in 2002, 2004, and 2009. A drought occurred in Tharparkar, in Sindh province, starting in 2013 and reaching its most devastating point between March and August 2014 [54]. Kothawale and Rajeevan [55] documented several rainfall deficit years between 1871 and 2016. They reported four deficit years (2004, 2009, 2014, and 2015)  In this study, 168 monthly GRACE data gravity solutions (level-3 RL-05, particular harmonics) at 1 • × 1 • spatial resolutions were acquired from three research agencies, the Center for Space Research (CSR) at the University of Austin/Texas, the NASA Jet Propulsion Laboratory (JPL), and the German Research Centre for Geosciences (GFZ), and were used to determine TWSA changes from January 2003 to December 2016 (14 years) (Table 1). However, 17 months' worth of GRACE solutions were missing during the observation period; therefore, a particular month's missing solution was replaced by the available sequent months (before and after) [56,57]. The GRACE TWSA data obtained from three data centers (JPL, GFZ, and CSR) were averaged to reduce the gravity field noise [58,59]. The TWSA includes the groundwater storage anomaly (GWSA), soil moisture storage anomaly (SMSA), canopy water storage anomaly (CWSA), surface water storage anomaly (SWSA), and snow water equivalent anomaly (SWEA), as expressed by Equation (1).

Global Land Data Assimilation System (GLDAS) Observation
Global Land Data Assimilation System (GLDAS) is a joint project designed by NASA, the National Oceanic and Atmospheric Administration (NOAA), and the National Centers for Environmental Prediction (NCEP) by integrating the hydrological components obtained from ground and satellite-based observations with satisfactory spatial and temporal resolutions [60]. The details of the data products are described by Han et al. [61]. The GLDAS data comprises data for four land surface models (LSMs): the community land model (CLM2.0) [62], variable infiltration capacity (VIC) [63], National Oceanic and Atmospheric Administration (NOAH) [64], and Mosaic [65]. The spatial resolution of all the LSM data is about 1 • × 1 • . For monthly Total Water Storage (TWS) data obtained through GLDAS, a summation of the monthly soil moisture (SM) layer, snow water equivalent (SWE), and canopy water storage (CWS) from 2003 to 2016 from four LSM datasets were used ( Table 1). The average of four LSM datasets was used to estimate the monthly TWS with minimum bias [66]. None of these LSM datasets includes groundwater storage or surface water storage (SWS) [60,62]. We assumed that the SWS in the study area was likely to be a minor compo-nent or a small contribution over the region; therefore, it was neglected. Numerous previous studies neglected the SWS changes for canopy water storage (CWS) estimation; for instance, Rodell et al. [60] and Tiwari et al. [67]. The GLDAS TWSA was converted into anomalies with the same considerations as GRACE data (i.e., the baseline period of January 2004 to December 2009). The GWSA was obtained by subtracting the model-based GLDAS TWSA from the GRACE TWSA [68] by rearranging Equation (1), as is widely used in different regions of the world to isolate the GWSA from the GRACE-derived TWSA [60,67].

Tropical Rainfall Measuring Mission
The Tropical Rainfall Measuring Mission 3B43 Version 7 (TRMM-3B43-V7) monthly research precipitation data at 0.25 • × 0.25 • spatial resolutions were used for the Indus Ganga river basins during the period 2003-2016 [69] (Table 1). This product is commonly used worldwide for global precipitation analysis, for which its algorithm combines several instruments [69]. Nonetheless, several authors have compared the TRMM data with observational data and reported good accuracy [57,70,71].

Potential Evapotranspiration
Datasets with a spatial resolution of 1 • × 1 • for the daily global potential evapotranspiration between 2003 and 2016 were used (Table 1). These data were generated from the climate parameter, i.e., extracted from the Global Data Assimilation System (GDAS) analysis fields. The NOAA generates the GDAS data every six hands it is freely available on the USGS website (https://earlywarning.usgs.gov/fews/product/81). The daily PET is calculated on a spatial basis using the Penman-Monteith equation [72]. The monthly and yearly PET was obtained using the accumulation of daily data. The monthly time series of GRACE terrestrial water storage anomalies, GLDAS observations, precipitation, and potential evapotranspiration are shown in Figure 2.

CTEI Description and Calculation
The CTEI was derived using the GRACE TWSA data and meteorological variables (i.e., precipitation and evapotranspiration) for 2003-2016. This index was developed by Dharpure et al. [24] using hydrological and climatological conditions in the Indus, Ganga, and Brahmaputra river basins and which also highlighted that the CTEI was positively correlated with the ground observation wells. They also revealed that this index could quantify drought and severity on spatial and temporal scales. This model presents a com-

Methodology 2.3.1. CTEI Description and Calculation
The CTEI was derived using the GRACE TWSA data and meteorological variables (i.e., precipitation and evapotranspiration) for 2003-2016. This index was developed by Dharpure et al. [24] using hydrological and climatological conditions in the Indus, Ganga, and Brahmaputra river basins and which also highlighted that the CTEI was positively correlated with the ground observation wells. They also revealed that this index could quantify drought and severity on spatial and temporal scales. This model presents a comprehensive picture for estimating drought events at a regional scale where ground observation is limited.
For the calculation of the CTEI, firstly, the difference between the P and PET was calculated for each month [15] using the following Equation (2): where D indicates the difference for each month t. After that, the difference anomaly (DA) was derived using the following Equation (3) [73]: where D µ indicates the average value, which is calculated for the period from January 2004 to December 2009. The monthly climatologies concerning the DA and TWSA were derived based on the GRACE deficit approach [74]. The climatology of each month, i.e., from January 2003 to December 2016, was computed by averaging the values of the DA and TWSA in the same months of the year (e.g., all Januaries in the 14-year record were averaged) [73,74]. This monthly climatology was used to remove the influence of seasonality [75]. The DA and TWSA time series' monthly residuals were obtained by subtracting the climatologies from each month's DA and TWSA data, respectively. The obtained residuals were added together to get the combined water storage anomaly (CWSA), indicating the net deviation in the volume of water storage based on seasonal variability. Finally, we normalized the CWSA by removing the mean CWSA µ and dividing by the standard deviation CWSA σ of each month [74], as in the following Equation (4):

Machine Learning Models Support Vector Machine
A support vector machine is a supervised learning algorithm. It can also be used as a regression model, maintaining all the main features that describe the algorithm (e.g., maximal margin). Support vector regression (SVR) uses a similar SVM theory for the classification method, with a few slight changes. The limit of tolerance is set in approximation to SVMs, for which the problem has already been formulated. The main aim is to minimize the error by individualizing the hyperplane, which increases the limit of tolerance because a part of the error is tolerated. The goal of SVM is to make the function (x) as flat as possible. Hence, given the linear function, we can minimize the process and constraints, as shown in Figure 3, where: w and b are the dot products in x; E is the maximum deviation value from the observed target values; ξ, ξ* are slack variables, greater than or equal to zero; and C is a constant that affects both the function of flatness and tolerated variations. The SVM estimation function (F) in any given regression scenario can be defined as follows in Equation (5):

Decision Trees
Decision-tree learning is a forecasting model for going from observations about a dot (represented in branches) to prediction and conclusions about the target value (represented in leaves). The primary aim is to create a model that forecasts the target value based on several independent inputs. An example is shown in the diagram on the right. Each internal node represents one of the input variables; there are edges to increase more children for each input variable. Each leaf has a value of the goal variable, given the input values of variables represented by the process from the root to leaf. In this study, various techniques were often deployed to construct more than one decision tree (Figures 4 and  5).
Boosting is an ensemble technique to create a collection of predictors. In this technique, models are learned sequentially. Early learners fit simple models to the data and analyze data for errors. The boosted tree is performed based on an ensemble to train the new sample [76,77]. When a hypothesis misclassifies an input, its weight is increased. The next hypothesis is more likely to classify it correctly. Combining the whole set at the end converts weak learners into a better-performing model. A typical paradigm is Ada-Boost, which can be used for regression-type modeling; bootstrap-aggregated (or bagged) decision trees, an early ensemble method, construct numerous decision trees by frequently resampling training data using the replacement method and electing the trees for the consensus forecast; a random forest regression is a specified type of bootstrap assemblage.  Bagged Tree Bootstrap aggregation (bagging) is used when the goal is to reduce the variance of a decision tree. The bagging tree technique is a robust technique widely deployed in the accurate estimation of drought, which uses re-samples of the training datasets [78,79]. The first step involves bootstrapping the samples from the raw data that comprise the different training data sets. One of the bagging algorithm's main advantages is that it combines all the trees to provide a combined tree model rather than a single tree model output. Also, it removes the instability that is present in the regression tree growth. This is done by removing the initial training datasets instead of novel training dataset sampling for each time step. The average of all the predictions from different trees is used, which is more robust than a single decision tree.  Random Forest An RF is simply a collection of decision trees whose results are aggregated into one final result [78]. Their ability to limit overfitting without substantially increasing error due to bias is why they are such powerful models. One way the random forests reduce vari- W is the weightage vector; T f represents the nonlinear transfer function, which projects the input vectors towards a very high dimension feature space; and b is the constant variable. The parameters used for the SVM algorithm were batch size = 100, C = 1, filter type = normalized training data, and kernel = poly kernel.

Decision Trees
Decision-tree learning is a forecasting model for going from observations about a dot (represented in branches) to prediction and conclusions about the target value (represented in leaves). The primary aim is to create a model that forecasts the target value based on several independent inputs. An example is shown in the diagram on the right. Each internal node represents one of the input variables; there are edges to increase more children for each input variable. Each leaf has a value of the goal variable, given the input values of variables represented by the process from the root to leaf. In this study, various techniques were often deployed to construct more than one decision tree (Figures 4 and 5).
Water 2021, 13, 547 9 of 20 model, a statistical inference technique called bootstrapping is used. Each of the trees which make up a particular forest is built up from random sub-sampling of datasets. This bootstrap method is based on a random draw with replacements. Hence, the term" random" is used in the name of this model. The final prediction of the model output is determined by an ensemble of methods from among all results from each tree making up the forest. This can be called the" majority vote." One technique to avoid overfitting RF model application problems is to limit the minimum leaf size (min-leaf). This parameter determines the minimum number of observations used to create each child node; smaller minileaf values need a deeper learning process. The parameters used for each algorithm were bag size percentage = 100, batch size = 100, number of execution slots = 1, number of iterations = 100, and seed = 1.    At each step, the processing system divides the subdivision into two unconnected segments that decrease the value of the squared deviation, described as follows in Equation (6):

• Boosted Tree
Boosting is an ensemble technique to create a collection of predictors. In this technique, models are learned sequentially. Early learners fit simple models to the data and analyze data for errors. The boosted tree is performed based on an ensemble to train the new sample [76,77]. When a hypothesis misclassifies an input, its weight is increased. The next hypothesis is more likely to classify it correctly. Combining the whole set at the end converts weak learners into a better-performing model. A typical paradigm is Ada-Boost, which can be used for regression-type modeling; bootstrap-aggregated (or bagged) decision trees, an early ensemble method, construct numerous decision trees by frequently re-sampling training data using the replacement method and electing the trees for the consensus forecast; a random forest regression is a specified type of bootstrap assemblage.

• Bagged Tree
Bootstrap aggregation (bagging) is used when the goal is to reduce the variance of a decision tree. The bagging tree technique is a robust technique widely deployed in the accurate estimation of drought, which uses re-samples of the training datasets [78,79]. The first step involves bootstrapping the samples from the raw data that comprise the different training data sets. One of the bagging algorithm's main advantages is that it combines all the trees to provide a combined tree model rather than a single tree model output. Also, it removes the instability that is present in the regression tree growth. This is done by removing the initial training datasets instead of novel training dataset sampling for each time step. The average of all the predictions from different trees is used, which is more robust than a single decision tree.

• Random Forest
An RF is simply a collection of decision trees whose results are aggregated into one final result [78]. Their ability to limit overfitting without substantially increasing error due to bias is why they are such powerful models. One way the random forests reduce variance is by training on different samples of the data. To increase the robustness of this model, a statistical inference technique called bootstrapping is used. Each of the trees which make up a particular forest is built up from random sub-sampling of datasets. This bootstrap method is based on a random draw with replacements. Hence, the term" random" is used in the name of this model. The final prediction of the model output is determined by an ensemble of methods from among all results from each tree making up the forest. This can be called the" majority vote." One technique to avoid overfitting RF model application problems is to limit the minimum leaf size (min-leaf). This parameter determines the minimum number of observations used to create each child node; smaller mini-leaf values need a deeper learning process. The parameters used for each algorithm were bag size percentage = 100, batch size = 100, number of execution slots = 1, number of iterations = 100, and seed = 1.
At each step, the processing system divides the subdivision into two unconnected segments that decrease the value of the squared deviation, described as follows in Equation (6): N (t) is the number of sample groups in connection (t), yi represents the value taken by target variables in the i unit, and ym equals the average of the target variable in connection t.

Matern 5/2 Gaussian Process
A Matern 5/2 Gaussian process regression is a random process. Any point x∈R d is assigned a random variable f(x). The Matern 5/2 kernel takes actual data densities of the stationary kernel. It creates a Fourier transform of the radial basis function (RBF) kernel. It does not have any measure problems for high-dimensional spaces. The algorithm of the Matern 5/2 GPR is as follows in Equation (7) P(f|X) = N(f|µ,K) In Equation (6), f (x) = p (f(x 1 ),...,f(x N )), µ = (m(x 1 ),...,m(x N )), and K ij = κ(x i ,x j ). m is the mean function, and it is common to use m(x) = 0 as GPs are flexible enough to model the mean arbitrarily well. κ is a positive definite kernel function or covariance function. Thus, a Gaussian process is a distribution over functions whose shape is defined by K. If points x i and x j are considered similar by the kernel, the function values at these points, f(x i ) and f(x j ), can be expected to be similar too. All algorithms were implemented for -CTEI modeling using MATLAB (R 2019a) software. The CTEI datasets for all algorithms were divided into training sets for those from 2003 to 2013 and testing sets for those from 2014 to 2016.

Statistical Analysis
Actual data for the CTEI and modeled values were compared for the study period. To evaluate the accuracy of models, the following statistical indicators were selected: (1) root mean square error, (2) coefficient of determination, and (3) mean absolute error [80][81][82][83]. All parameters were defined as follows: CTEI i A is an observed or actual value, CTEI i P is a simulated or foreseen value, CTEI − is the mean value of reference samples, and N is the total number of data points.
Root mean square error (RMSE) is a sample's standard deviation about the differences between foreseen and actual values. It is given by Equation (8): Coefficient of determination (Equation (9)) Water 2021, 13, 547 10 of 18

Mean absolute error (MAE)
The mean absolute error evaluates the mean magnitude of the errors in a set of forecasts without considering their sign. It is an average for the test sample of the absolute differences between foreseen and actual values. It is defined as follows in Equation (10):

Assessment of ML Models Performance
In this study, we developed twelve different data-driven models for predicting the CTEI in the Ganga basin. As discussed in the methodology, each of these models differed depending on the number and type of input parameters. We developed five variants corresponding to each of these models by changing the applied machine learning algorithm, namely RF, SVM, boosted trees, bagging, and Matern 5/2 GPR. The different input parameters used in the model building were the TWSA (from GRACE and GLDAS), evapotranspiration (ET), P, PET, wind speed (u), radiation, GWSA, air temperature (Ta), and surface temperature (Ts). These input parameters have also been used in previous studies [23,84]. The detailed setup combination for each model is given in Table 2. These models' setups range from a simple model that depends on only important climatic variables, like evapotranspiration, to very complex models including complex climatic inputs and radiation and heat fluxes. Table 2 shows the implementation and effectiveness of all the twelve models in the CTEI estimates for the Ganga basin. Like those of model 7 and model 4, some of the model setups performed too poorly for all the five methods used. These models were the ones in which the following input parameters were used: PET, ET, radiation (R), net radiation (Rn), long-wave net radiation (LWN), and Ta. These models had R 2 lower than 0.15 in all cases. The RMSEs and MAEs for all models were as high as 0.6, 80, and 65, respectively. Based on these statistical analyses, it can be observed that model 8 showed the best predictions for the CTEI when compared to the other model setups.
Moreover, among all the five methods used, the SVM and Matern 5/2 GPR methods were the most highly performing ML algorithms in our study of CTEI predictions in the Ganga basin (as detailed in Table 2). Model 8 showed the best performance among all the model settings. Most of the implemented ML algorithms used in model 8 showed higher values of R 2 and the lowest values in terms of the RMSE and MAE. The SVM algorithm was characterized as the best performing method among all the five algorithms, with an R 2 value of 0.82 ( Figure 6) and the lowest errors in terms of the RMSE (0.33) and MAE (0.20).
The residual plot for model 8 with the SVM method shows that the maximum errors in the CTEI predictions occurred in the range of 0 to 2 mm, as shown in Figure 7. The residual plot for model 8 with the SVM method shows that the maximum errors in the CTEI predictions occurred in the range of 0 to 2 mm, as shown in Figure 7. Furthermore, the GPR boosted trees and bagged trees algorithms performed comparatively better, with R 2 values of 0.75, 0.58, and 0.52, respectively. They also had very low values for the error statistics in terms of the RMSE (0.39, 0.51, and 0.58, respectively) and MAE (0.21, 0.34, and 0.38, respectively). Figure 8 compares the actual CTEI with the predicted CTEI for all the twelve models, from which the best ML algorithm was obtained. It shows a plot for the years from 2003 to 2014. Models 4 and 7 performed the worst. The rest of the models performed comparatively well. As mentioned in the previous section, model 8 with the SVM algorithm provided the best CETI prediction compared with the observed CTEI. Table 3 shows the in-  Figure 8 compares the actual CTEI with the predicted CTEI for all the twelve models, from which the best ML algorithm was obtained. It shows a plot for the years from 2003 to 2014. Models 4 and 7 performed the worst. The rest of the models performed comparatively well. As mentioned in the previous section, model 8 with the SVM algorithm provided the best CETI prediction compared with the observed CTEI. Table 3 shows the interannual variation for the observed and the predicted CTEIs for the years from 2003 to 2016.  The maximum deviation in the CTEI predictions occurred in 2013, while the minimum deviation occurred in 2015, 10.31 and −0.02, respectively. Figure 9 compares the actual CTEI and predicted CTEI (model 8, SVM algorithm) for all the years. It can be observed that model 8 with the SVM algorithm could predict almost concurrently with the observed datasets, except for a few years. Figure 9 shows the correlation coefficients for  The maximum deviation in the CTEI predictions occurred in 2013, while the minimum deviation occurred in 2015, 10.31 and −0.02, respectively. Figure 9 compares the actual CTEI and predicted CTEI (model 8, SVM algorithm) for all the years. It can be observed that model 8 with the SVM algorithm could predict almost concurrently with the observed datasets, except for a few years. Figure 9 shows the correlation coefficients for both the training and testing periods of all the datasets using the best algorithm for each of the models. It can be observed that model 8 showed the highest correlation in the training and testing period, > 0.85. Model 9 also predicted the CTEI values equally compared to model 8, with correlations up to~0.82 for both periods. As models and 7 performed too poorly, they showed the least correlation (<0.3) throughout the simulation period. The SVM algorithm worked best compared to all the other methods, and it has also been widely used previously [32,66,82].

Comparison of Actual CTEI with Predicted CTEI
of the models. It can be observed that model 8 showed the highest correlation in the training and testing period, > 0.85. Model 9 also predicted the CTEI values equally compared to model 8, with correlations up to ~0.82 for both periods. As models and 7 performed too poorly, they showed the least correlation (<0.3) throughout the simulation period. The SVM algorithm worked best compared to all the other methods, and it has also been widely used previously [32,66,82].
This study's findings show that machine learning algorithms are one of the best, efficient, and most powerful tools that can be used for evapotranspiration prediction when there are limited datasets [23,45]. This indicates similarities to approaches with large-scale conceptual hydrological models [85,86]. It can be observed that when the models were given input variables like climatic parameters (P, Ta, Ts), sensible heat fluxes, as well as GRACE and GLDAS TWS datasets, the models resulted in highly accurate predictions [84,87]. Considering only the temperature and extraterrestrial radiation has advantages for ML algorithms [27,28,49]. These model predictions provide the CTEI based upon net evapotranspiration without intense data requirements [22,57]. These analyses show that when various variables, ranging from climate and radiation to heat fluxes-and further including−GRACE TWSA, GLDAS TWASA, GWSA, ET, R, , R SWN, LWN, P, and PET data-are used, a better opportunity is provided for predicting CTEI values with high accuracy. This is also supported by previous studies [10,24,29,34].

Conclusions
In this study, a new drought index called the CTEI was modeled based on five machine learning models (SVM, RF, boosted trees, bagged trees, and Matern 5/2 GPR). This index was driven by a combination of meteorological variables and the GRACE TWSA for the Ganga river basin. The monthly data were collected from 2003 to 2016 and divided into training (2003-2013) and testing (2014-2016) models. This research aimed to investigate the performances of the five selected machine learning models in predicting nonlinear interaction between limited input parameters to simulate the monthly values of the CTEI. Different combinations of satellite data with the hydroclimatic parameters, based on limited parameters (the ML models' inputs), were used for CTEI estimation. After analyzing all the combinations, the main finding of this study was that the eight model settings (considering−GRACE TWSA, GLDAS TWASA, GWSA, ET, R, T , R , SWN, LWN, P, and PET) with the SVM algorithm showed the best performance among all the model settings in predicting the CTEI. This model also found the Matern GPR algorithm to be satisfactory. It achieved the second-highest rank for CTEI prediction. Model 6 ranked as the third-best option, achieving reasonable outcomes based on a data fusion of GRACE TWSA, GLDAS TWASA, SWN, LWN, P, and PET. This study's findings show that machine learning algorithms are one of the best, efficient, and most powerful tools that can be used for evapotranspiration prediction when there are limited datasets [23,45]. This indicates similarities to approaches with large-scale conceptual hydrological models [85,86]. It can be observed that when the models were given input variables like climatic parameters (P, Ta, Ts), sensible heat fluxes, as well as GRACE and GLDAS TWS datasets, the models resulted in highly accurate predictions [84,87]. Considering only the temperature and extraterrestrial radiation has advantages for ML algorithms [27,28,49].
These model predictions provide the CTEI based upon net evapotranspiration without intense data requirements [22,57]. These analyses show that when various variables, ranging from climate and radiation to heat fluxes-and further including GRACE TWSA, GLDAS TWASA, GWSA, ET, R, T a , R N , SWN, LWN, P, and PET data-are used, a better opportunity is provided for predicting CTEI values with high accuracy. This is also supported by previous studies [10,24,29,34].

Conclusions
In this study, a new drought index called the CTEI was modeled based on five machine learning models (SVM, RF, boosted trees, bagged trees, and Matern 5/2 GPR). This index was driven by a combination of meteorological variables and the GRACE TWSA for the Ganga river basin. The monthly data were collected from 2003 to 2016 and divided into training (2003-2013) and testing (2014-2016) models. This research aimed to investigate the performances of the five selected machine learning models in predicting nonlinear interaction between limited input parameters to simulate the monthly values of the CTEI. Different combinations of satellite data with the hydroclimatic parameters, based on limited parameters (the ML models' inputs), were used for CTEI estimation. After analyzing all the combinations, the main finding of this study was that the eight model settings (considering GRACE TWSA, GLDAS TWASA, GWSA, ET, R, T a , R N , SWN, LWN, P, and PET) with the