Reliable Evapotranspiration Predictions with a Probabilistic Machine Learning Framework

Evapotranspiration is often expressed in terms of reference crop evapotranspiration (ETo), actual evapotranspiration (ETa), or surface water evaporation (Esw), and their reliable predictions are critical for groundwater, irrigation, and aquatic ecosystem management in semi-arid regions. We demonstrated that a newly developed probabilistic machine learning (ML) model, using a hybridized “boosting” framework, can simultaneously predict the daily ETo, Esw, & ETa from local hydroclimate data with high accuracy. The probabilistic approach exhibited great potential to overcome data uncertainties, in which 100% of the ETo, 89.9% of the Esw, and 93% of the ETa test data at three watersheds were within the models’ 95% prediction intervals. The modeling results revealed that the hybrid boosting framework can be used as a reliable computational tool to predict ETo while bypassing net solar radiation calculations, estimate Esw while overcoming uncertainties associated with pan evaporation & pan coefficients, and predict ETa while offsetting high capital & operational costs of EC towers. In addition, using the Shapley analysis built on a coalition game theory, we identified the order of importance and interactions between the hydroclimatic variables to enhance the models’ transparency and trustworthiness.


Introduction
Background Evapotranspiration (ET) is the key component of groundwater budget in droughtprone regions with scarce water supplies [1][2][3], facing challenges of sustainable development and climate resilience [4]. Reliable prediction of ET is useful to determine aquifer recharge [5,6] in evaluating groundwater sustainability to meet consumptive water use and environmental flow requirements. ET is often reported as reference crop evapotranspiration (ET o ), actual evapotranspiration (ET a ), or potential evapotranspiration from wet surfaces (ET p ), or surfaces covered by large volume of water, such as wetlands or lakes (E sw ).
ET o accounts for climate-driven watershed-scale ET from a surface covered by a hypothetical grass reference crop with uniform height, fully shading the saturated soil, and hence, reflects evaporation power of the atmosphere. The FAO56 Penman-Monteith equation (FAO56 PME) is commonly used to calculate the ET o [7][8][9][10][11]. FAO56 PME-based ET o calculations depend on the climate data, involving time series of shortwave solar radiation (R s ), air temperature (T a ), atmospheric pressure (P), relative humidity (RH), and wind speed (u 2 ). Because long-term climate data are unavailable in some regions, simplified versions of the FAO56 PME with fewer climate variables have been explored and tested at watersheds with scarce data [12][13][14][15]. At the extreme edge, T a was proposed to be the best alternative to ET o [16] or used as a proxy for ET o [17] in hydroclimatic studies. simulate the complex nonlinear behavior of ET o , E p , and ET a [14,61,[64][65][66][67][68][69][70][71][72][73][74][75]. However, a critical challenge with these existing ML models is that the nonlinear relationship between climatic variables and the ET makes it difficult to account for inherent uncertainties [76]. Therefore, in this paper, we confront the uncertainties in ET predictions using a hybrid probabilistic boosting (natural gradient [77] and extreme gradient boosting [78]) ML model without compromising the accuracy of the predictions. The probabilistic model takes in respective feature values x and returns a distribution over the target y indicating the relative likelihood of different values of y. To our knowledge, ML-aided probabilistic predictions of ET o , E sw , and ET a is unprecedented. Moreover, for the first time we applied a game theory approach [79] to explain the importance of the climate variables on the ML-based ET o , E sw , and ET a predictions. This approach manifests how the individual feature values, while considering their interactions with other features, influences the models' predictions, which enhances the models' transparency and trustworthiness.

Research questions, motivations, and objectives
Considering the presence of different ET prediction methods (e.g., FAO56 PME, Meyer's Formula, Eddy covariance, Machine Learning), a follow-up research question would be: Can we have a unified machine learning model to (i) avoid calibration parameters and empirical relations in predicting E sw (or E p ), (ii) calculate ET o , E sw , and ET a simultaneously from the standard hydroclimatic data, (iii) perform probabilistic predictions over the entire solution space for more accurate assessment of uncertainties related to ET estimates, (iv) analyze the order of importance and interactions between the hydroclimatic variables, and (v) explore new hydroclimatic knowledge that may not be readily available from non-probabilistic machine learning, numerical, or empirical models? The main motivations for development of such a model are to (i) have an alternative and complementary method to the FAO56 PME [7] to circumvent net solar radiation calculations in ET o predictions; (ii) overcome uncertainties associated with the pan coefficients, pan evaporation data, and upscaling methods for E sw (or ET p ) estimates; and (iii) reduce the number of required EC towers and/or their operational time length, and hence, to offset high capital & maintenance costs of EC towers used for ET a measurements. Thus, the main objectives of the paper are to (i) develop a unified 'probabilistic' predictive ML model based on the standard local hydroclimate data to collaterally predict ET o , E sw , and ET a ; and (ii) use Shapley analysis, built on the game theory approach, to determine the order of importance and dependencies & interactions between hydroclimatic variables in ET o , E sw , and ET a predictive models, and compare the results against findings from the current literature.

Study Area & Data Availability
The karstic Edwards aquifer in semi-arid south-central Texas is the primary source of drinking water for the city of San Antonio and is also home to several threatened and endangered aquatic species (e.g., Texas blind salamander, San Marcos salamander) at the major spring outlets [80]. Up to 65% of rainfall is lost to evapotranspiration [81] in southcentral Texas, which has a few permanent surface water and experiences frequent droughts. In some years, anomalously sinking motions and divergent water vapor fluxes over the Texas area reduce precipitation and increase downward solar radiation, which results in dry and hot soil promoting the occurrence of extreme heat waves [82]. Such an extreme summer heat wave occurred in 2011 with average temperature 3 • C above the 1981-2010 mean for June through August [83]. The likelihood of exceeding a given unusually high summer temperature in the Texas region was reported to be about 10 times greater with 2011 anthropogenic emissions compared to preindustrial forcing [84]. Under the current and projected climate conditions in south-central Texas, increased air and groundwater temperatures and decreased aquifer recharge and springs flow could make endangered or threatened aquatic species vulnerable to extinction [80,85]. Therefore, reliable estimates of ET are useful for improved management of Edwards aquifer's groundwater and habitats for groundwater-dependent species, as part of the current and future resource planning.
The Edwards Aquifer Authority (EAA) initiated a pilot program in 2014 to establish a network of weather stations across the Edwards aquifer region to collect local climate data. Measured local climate data at these stations for ET o calculations include R s , P, T a , RH, and u 2 . Local climate data at the 15 min intervals from 1 September 2015 to 1 December 2020 were acquired from weather stations at the Nueces Duernell Ranch (NDR) and Bandera County River Authority and Groundwater District's office (BCRAGD) in Figure 1. Local climate data at the Camp Bullis Savanna (CBS) station was available since 25 January 2016. Local hydroclimatic data (including climate data at the NDR, BCRAGD, and CBS sites) used in the numerical and ML analyses are shown in Appendices A.1-A.4. Figure 1. Data source locations across the Edwards aquifer region. The map shows the location of EAA's weather stations with local climate data, the U.S. Geological Survey (USGS)'s station with surface water temperature data, Ingram Lake with estimated lake evaporation data, Uvalde city with the cloud cover data, and the eddy covariance tower with the actual evapotranspiration data.

FAO56 Penman-Monteith Equation (FAO56 PME)
Hourly ET o were computed at the NDR, BCRAGD, and CBS stations using the FAO56 PME. The computed ET o were then used to construct the training and testing data for the ML models. Using the FAO56 PME [7], ET o [mm/h] is computed by Hourly-averaged T a , RH, P, u 2 , e a , and e o , and hourly-aggregated R s are used in Equation (1). Net solar radiation is defined as R n = R ns − R nl , in which R ns is the measured net incoming shortwave radiation and R nl is the outgoing longwave radiation [MJ/(m 2 h)] computed as where σ is the Stefan-Boltzmann constant (2.043 × 10 −10 MJ / (K 4 m 2 h)) and T a,K is the air temperature in [K]. R so is the clear-sky radiation [MJ/(m 2 h)]. Linearized Beer's radiation law leads to R so = 0.75 + 2 × 10 −5 z R a , in which z is the elevation of the weather station above the sea level [m] and R a is the extraterrestrial radiation [MJ/(m 2 h)]. In other words, R so ∼ 0.75R a , which accounts for 25% reduction in Ra due to the interaction of R a with atmospheric gases [86,87]. (R s /R so ) is the relative shortwave radiation, representing the cloud cover, defined as in which the lower bound of 0.33 and the upper bound of 1.0 represent the dense cloud cover and clear sky on a particular day, respectively. The first, second, and third terms in Equation (2) account for the effect of air temperature, air humidity, and cloudiness on R nl . R a depends on the geographic location of the weather station and time of the day, and is computed as . ω 1 = ω − (πt 1 /24) and ω 2 = ω + (πt 1 /24). Here, R a = 0 when the sun is below the horizon at ω < −ω s or ω > ω s . To keep the cloudiness, R s /R so in Equation (3), and hence, R nl in Equation (2) finite, R s /R so at night times is set to R s /R so value 2-3 h prior to sunset. The sunset time in each day of the year can be identified by (ω s − 0.79) ≤ ω ≤ (ω s − 0.52). When the sun is above the horizon (R a > 0), G = 0.1R n corresponds to smaller heat outfluxes, promoting soil warming during day times. In contrast, when the sun is below the horizon (R a = 0), G = 0.5R n corresponds to larger heat outfluxes, promoting soil cooling at nights. Wind speed, u 2 ≥ 0.5 m/s in ET o calculations to account for the effects of boundary layer instability and buoyancy of air in promoting exchange of vapour at the surface when air calm.

Meyer's Formula (MF)
The MF model was used here to verify the reasonability of E sw data produced by upscaling the E p data, as the E sw data were subsequently used to train and test the ML models. MF is a mass transfer-based, empirically constructed formula [88] used to calculate daily or monthly E sw . It is expressed in the form of E sw = β(e o − e a ), in which β is the empirically determined constant, e o and e a are defined in terms of surface water temperature (T sw ), unlike in the FAO56 PME. Reportedly, the best form of the MF to predict daily E sw from free water surface [34] where u 2 is expressed in [mm/d], and e o and e a are expressed in [mm-Hg] in Equation (6).

Probabilistic Machine Learning Models
We used NGBoost's (natural gradient boosting [77]) modular design to hybridize it with XGBoost (extreme gradient boosting) base learners to enhance the resulting model's predictive capability-i.e., producing probabilistic predictions of the continuous variables in addition to delivering accurate point predictions. Technical explanation for the NGBoost and XGBoost models are provided in Appendix B.
As shown in Figure 2, the input feature vector x in the hybrid NGBoost-XGBoost model is passed on to the XGBoost base learners to produce a probability distribution of the predictions P θ (y|x) over the entire outcome space y (that is, ET o , ET a , and E sw ). The models are then optimized by scoring rule S(P θ , y) using a maximum likelihood estimation function that yields calibrated uncertainty and point predictions. The feature vector x for ET o predictions consists of T a , P, RH, u 2 , R s , and month; the feature vector x for ET a predictions includes ET o , T a , P, RH, u 2 , R s , and month; and the feature vector x for E sw predictions consists of T sw , T a , P, RH, u 2 , R s , and month. Since XGBoost is designed to produce only point predictions [78] and NGBoost is not specifically designed for point predictions [77], the hybridization provides the best of both models.

ET Predictions Using FAO56 PME and MF
Hourly ET o were calculated with Equation (1), using local climate data at the NDR, BCRAGD, and CBS stations. Daily and monthly ET o at the NDR site are shown as an example in Figure 3a. ET o values at nighttime hours occasionally take on negative values. In some situations, negative hourly ET o may indicate condensation of vapor during periods of early morning dew [26]. Hourly ET o < −1 × 10 −3 mm occurred <0.1% of the times in our study area. Under these conditions, net condensation of water from the atmosphere is possible. However, on a daily basis ET o values were persistently non-negative, as expected for a semi-arid region. Predicted daily and monthly E sw (from pan evaporation data) at Ingram Lake are in good agreement with the estimated E sw at Ingram Lake using Equation (6) and local climate data from the NDR station or the BCRAGD station ( Figure 3b). Figure 3c shows that ET o , in general, set the lower bound for E sw at Ingram Lake for the entire period. In 2016, ET o ∼ E sw for most of the year except in December. Although E sw > ET o in the summer of the following years, with the largest difference in the summer of 2018, ET o appears to be a reliable predictor for E sw especially in spring and winter months. Figure 3d compares daily or monthly Bowen-ratio-corrected ET a measurements from the EC tower against ET o computed by the FAO56 PME, using local climate data from the CBS station. According to this plot, ET o ≥ ET a during the monitoring period, as expected. In some summer months, ET o was about three times higher than ET a (e.g., July 2017), indicating that the soil was too dry in summer times to contribute to the evapotranspiration at the CBS site. The results in Figure 3 indicate that ET a ≤ ET o ≤ E sw , which verifies the reliability of the local hydroclimatic data used in ET o , E sw , and ET a predictions for the semi-arid region. In the subsequent analysis, hourly ET o were aggregated to daily values, as the E sw and ET a data were available to us in daily time-stamps. Daily ET o , E sw , and E a were then used to train and test the ML models. Because the data available at the CBS site was nearly half the data available at the NDR site for training the corresponding ML model, 90% of the existing data were used for training to unravel the relationship between the climatic features and the ET a . Therefore, we consistently applied the 90-10 (train-test) percentage split of datasets at all sites to train the corresponding ML models.

ET Predictions Using Probabilistic ML Models
We investigated if daily ET o can be accurately computed by the probabilistic ML model using local climate data, as an alternative to Equation (1). The ML model was trained by using 90% of the daily climate data & the month of the year as features, and the FAO56 PME-computed ET o data as the target. Subsequently, the trained ML model was used to predict ET o for the remaining 10% of the data. In the end, ML-predicted daily ET o were compared against FAO56 PME-computed daily ET o to assess the performance of the ML-model on the testing data. Differences between the ML-predicted ET o from the FAO56 PME-computed ET o on the testing dataset are shown in Figure 4a, in which 100% of the FAO56 PME-computed ET o were within the model's 95% prediction interval. In other words, the model 100% of the time successfully predicted ET o value within the 95% confidence interval. In addition, based on point statistical measures in Table 1, probabilistic prediction of ET o by the hybrid NGBoost-XGBoost model can be used as a reliable alternative method to estimate watershed-scale ET o . The total training time for the ET o hybrid model was ∼39 min that involved choosing the optimum model out of 230 candidates using a 3-fold grid search cross-validation technique, which equates to 690 model fits on an Intel Core i9-9980XE processor and 64 GB RAM computer. The main advantage of the ML-based ET o prediction model is that it avoids extra-terrestrial, clear-sky, and longwave solar radiation calculations, as part of the net solar radiation calculations. The ML-based E sw prediction model was trained by using the first 90% of the daily climate data & the month of the year as features, and the measured E sw data as the target. Subsequently, the model was tested on the remaining 10% of the data. The comparison between ML-predicted daily E sw and the measured E sw on the testing data is shown in Figure 4b. The ML-predicted E sw matched the measured E sw very closely, and 89.9% of the actual E sw were within the model's 95% prediction interval in the testing dataset. Based on the statistical measures in Table 1, the hybrid NGBoost-XGBoost model can be used as a reliable method for E sw predictions. The total training time for the E sw hybrid model was ∼ 6 min, including the elapsed time for choosing the optimum model from 690 model fits. The main advantage of the ML-based E sw prediction model is that E sw predictions are not affected by anomalies in E p measurements ( Figure A4a) or uncertainties in monthly pan evaporation coefficients.
(c) ET a predictions on the testing data Figure 4. Graphical representation of the predictive capability of the hybrid NGBoost-XGBoost model.
The ML-based ET a prediction model was trained by using the first 90% of the daily climate data, FAO56 PME-computed ET o , & the month of the year as features, and the measured ET a data as the target. Subsequently, the model was tested on the remaining 10% of the data. The comparison between ML-predicted daily ET a and the actual ET a measurements on the testing data is shown in Figure 4c, in which 93% of the actual ET a values were found within the model's 95% prediction interval. ML-based ET o predictions were more accurate than the ET a predictions due, in part, to the availability of less data from the EC tower at the CBS site than at Ingram Lake or the NDR site for the ML-model training. However, based on the statistical measures in Table 1, probabilistic prediction of ET a by the hybrid NGBoost-XGBoost ML model (with R 2 = 0.801 on testing data) can be used as a reliable method to estimate ET a . The total training time for the ET a hybrid model was ∼9 min, including the 690 model fits to choose the optimum model. The main advantage of the ML-based ET a prediction model is that it can be used to reduce the number of required EC towers and/or their operational time length, and hence, offset the high capital and maintenance costs for the installation and operation of EC towers to acquire ET a measurements.
In Ref. [59], we had compared the predictive performance of the non-probabilistic ML models-including XGBoost, support vector machines (SVM), long short-term memory networks (LSTM), deep learning (DL), random forest (RF), and linear regression (LR)-in predicting ET o from structured tabular datasets acquired from multiple weather stations. In that study, the top performing interpretable XGBoost model exhibited comparable predictive accuracy to the top performing noninterpretable DL model. RF was identified as the second best interpretable model, which resulted in comparable but, slightly lower predictive accuracy than XGBoost, and outperformed the predictive accuracy of SVM, LSTM, and LR. Based on the analysis in Ref. [59], we chose RF as the baseline model in this study to establish a point of reference for analyzing the performance of the hybrid NGBoost-XGBoost ML model. Table 1 shows that the hybrid model exhibited better predictive performance than RF in terms of point predictions of both ET o and ET a on the testing data. Conversely, RF produced better point predictions of E sw than the hybrid model but, more importantly, unlike the RF model, the hybrid model was capable of providing prediction intervals and uncertainty estimates for ET o , E sw , and ET a , and hence, overcoming the scientific challenge discussed in Section 1.

Feature Importance in ET o , E sw , and ET a Predictive ML Models Using a Game Theory Approach
It is imperative for end-users to peek inside ML models to better understand how the features contribute to the model predictions or how they affect the overall model performance. The climatic variables (T a , P, RH, u 2 , R s ) in Equation (1) were chosen as the features for the ET o predictive model. The same climatic variables were used as the features for the ET a predictive model, in addition to ET o to quantify its contribution to ET a . T sw in Equation (6) was added as a new feature to the climatic variables in the E sw predictive model. Moreover, 'month' was included as an additional feature in all predictive models based on the observed seasonality in T sw data, ET a measurements from the EC tower, and expected seasonality in soil moisture content at the site where the the EC tower is located.
We investigated the relationship and contribution of each feature to the prediction of the ET o , E sw , and ET a models (Figure 5a Figure 5a shows that the order of importance of local climate variables from the highest to the lowest on the predicted ET o involves the R s , T a , RH, u 2 , & P. The month of the year is deemed to be the second least important feature for the ET o model. Figure 5b shows that, unlike ET o , E sw is largely impacted by the T sw followed by RH. T a is given lower importance because of the high correlation (R 2 = 0.95) between T sw and T a (Figure A6b), and thus, the model considers T a as redundant. Moreover, Figure 5b unveils the model's understanding of the underlying hydrological process. For example, after being trained with the historical data, the model predicts higher E sw (represented by higher Shapley value on the x-axis) when T sw are high (represented as red dots) and RH are low (represented as blue dots), capturing the underlying physics correctly, and hence, evidencing the model capability of making learning-based conscious predictions.
For the ET a predictions, R s was followed by the month of the year, and RH, and T a are the next most important climatic features for ET a predictions, as shown in Figure 5c. However, in comparison to ET o , ET a depends less strongly on R s (R 2 = 0.77, as shown in Figure A6c), as the time-dependent soil moisture and vegetation transpiration also impact ET a measurements, unlike for FAO56 PME-computed ET o in Equation (1). Besides, the analysis did not reveal a strong impact of ET o on the ET a predictions, because FAO56 PME-ET o calculations are based on the assumption of a hypothetical reference crop growing in a saturated soil (Section 3.1), and hence, not accounting for the effect of temporally-varying transpiration rates from the actual vegetation (open oak savanna at the EC tower site) and the transient nature of the soil moisture content affecting ET a measurements. Due to the temporal variations in water uptake, vegetation transpiration, and soil moisture content on the field near the EC tower, where ET a measurements were taken, the month of the year became a strikingly more important feature in the ET a model than in the ET o model.
After all, the Shapley values represented the underlying physics accurately in ranking the hydroclimate variables in the order of their importance in predicting different ET measures. In the next section, we compare our Shapley results against the order of importance of hydroclimate variables for ET predictions in the literature that have been typically obtained from sensitivity or correlation analyses, which are not capable of accounting for the interdependency among all the features.

What the Hybrid NGBoost-XGBoost Model Accomplished?
ET a is the most critical ET estimate, especially for irrigation, agricultural, and water resources management practices. The EC method provides accurate prediction for ET a ; however, the associated capital and maintenance costs are high. For example, the capital cost for the EC tower at the CBS site was about $40,000 and required frequent maintenance. E sw measurements are important indicators of global climate change [30], which could affect the water levels & chemistry. Pan evaporation method is a simple, inexpensive, and widely-used data acquisition method to predict E sw at open water bodies, but suffers from uncertainties in pan evaporation measurements and in pan coefficients when water evaporation is upscaled from the pan-scale to the large open-water body-scale. ET o is often computed by FAO56 PME that relies on local climate variables and net solar radiation calculations.
We demonstrated in Section 4.2 that using the standard local climate data as the independent feature, the hybrid NGboost-XGBoost model can simultaneously predict (i) ET o -without requiring net solar radiation calculations -as an alternative or complementary method to FAO56 PME calculations; (ii) E sw while eliminating uncertainties associated with pan evaporation measurements and pan evaporation coefficients needed to upscale E p to E sw ; and (iii) ET a while offsetting the high capital and operational costs for EC towers by potentially reducing their numbers and operational length. In addition, the NGBoost-XGBoost model exhibited great potential in overcoming data uncertainties, in which more than 89.9% of ETo, E sw , and ET a test data were within the model's 95% prediction interval, which is essential to use such models with confidence in practice. Moreover, unlike the black-box models (e.g., deep learning models) [89], the hybrid NGboost-XGBoost model provided explanation for the nonlinear and complex ET processes, as elaborated next.

How Shapley Analysis Results Compare to Findings in the Current Literature?
We explained the nonlinear feature dependencies on the ET o , E sw , & ET a predictions using Shapley analysis, based on a game theory, to enhance the interpretability of the ML model and explainability of the ML model results. The Shapley analysis is advantageous over traditional sensitivity and correlation analysis as it explicitly considers the interaction and interdependency among the predictive variables in predicting the target variables.
The findings in Figure 5 are useful to evaluate the suitability of the simplified versions of the FAO56 PME proposed for semi-arid watersheds with scarce climate data. Irmak et al. [12] proposed two simplified FAO56 PMEs that require less number of climate variables to calculate the net radiation (R n in Equation (1)). The first equation relied on the measured T a and R s , whereas the second equation relied on predicted R s , and measured T a and RH. Although the simplified equations were used to estimate R n only, the second equation built on the three most important climate variables, identified in Figure 5a, for more accurate ET o estimates is expected to perform better for the semi-arid regions, if the predicted R s has low uncertainty. This is consistent with the conclusion by Irmak et al. [12] that the second equation accounted for 79% of the variability in R n in their case studies.
Chia et al. [73] noted that T a and R s are the most critical climate variables to estimate ET in semi-arid regions. Our results in Figure 5a agree with their statement; however, RH also needs to be included for enhanced prediction accuracy of ET o . On the other hand, RH is relatively more significant climate variable than T a to estimate ET a more accurately in a semi-arid region (Figure 5c). Moreover, the commonly used Hargreaves-Samani method [90] for ET o prediction would not provide reliable ET o estimates in a semi-arid region, as the method calculates ET o based on T a and R a only. The R a , however, is independent of R s (Equation (4)), which is the most critical climate variable to estimate ET o in a semi-arid region, as shown in Figure 5c. Besides, the Hargreaves-Samani method does not account for the effect of RH, which is nonnegligible in ET o estimates, as is evident from Figure 5a.
R s was recently used as a surrogate variable to reduce the uncertainty of ET o projection data [91]. This can be justified by the findings from Figure 5a, in which R s displayed a more profound impact on ET o than the other forcing variables. On the other hand, the mean annual temperature was used by Hartmann et al. [17] as a proxy for ET o in assessing aquifer recharge sensitivity to climate variability based on the argument that R n is temperature-dependent and temperature is the best-understood and most common climatic variable for large-scale hydrological models. Similarly, a computationally simple method of Berti et al. [16] that relies only on T a was reported to be the best alternative method to the FAO56 PME in describing spatiotemporal characteristics of ET o in different sub-regions of mainland China [13]. Such assumptions [17] and conclusions [13], however, should be made with caution in ET o calculations, especially for semi-arid regions, as the ML analysis unveiled that R s (as part of R n in Equation (1)) is more important than T a in ET o prediction. Moreover, Figure A6 revealed that the statistical correlation between R s and T a is weak with R 2 < 0.6. Thus, the use of T a as a proxy for ET o is questionable for the semi-arid region.
Gong et al. [92] noted that although the order of importance of climate variables on ET o (computed by the FAO56 PME) varied with season and region in their study, ET o in general was most sensitive to RH, followed by R s , T a and u 2 . The authors used time-histories of daily T a , u 2 , RH, and daily sunshine duration. In our analysis, however, measured climate variables were available at the 15-min intervals, including also R s and P. Unlike the general conclusion by Gong et al. [92], our ML-based feature importance calculations in Figure 5a revealed that R s and T a were more critical than RH on ET o estimates.

What Are the New Insights from the NGBoost-XGBoost That Can Not Be Obtained from Other ML Models?
Interestingly, we found many instances where the ML model shows tendency to predict higher ET a (represented by higher Shapley value on the x-axis) when RH is relatively high (represented as red dots) in Figure 5c. Such findings were also reported by Yan and Shugart [93] from ET a measurements by the EC method. High ET a at high RH could be attributed, for example, to high air-vapor uptake by water deficit soil and vegetation in hot and humid days, which are subsequently released back into air due to evaporation from soil and transpiration from vegetation; or evaporation from saturated soil and transpiration from vegetation in high RH conditions following rain events; or evaporation from moist soil on a cold day following rain events. Unlike the ML-based modeling, the dynamics between soil moisture, vegetation water uptake, rain events, T a , RH and ET a cannot be captured by one-to-one correlation, as shown in Figure A6. Additionally, Figure 6 shows that, in certain situations, the model generates low ET a predictions despite high ET o & low RH measures, which could be driven by critical moisture deficiency in the soil, especially in hot and dry summer. This could be a serious concern in future, as for a 2 • C of global warming, most of Texas was projected to experience more than a doubling in the number of days above 38 • C [94]. Such more frequent high T a over extended periods could increase the soil moisture deficiency, and decrease aquifer recharge and springs flow, which could affect sustainability of groundwater for consumptive water uses and environmental flows.  The Shap values represent the model's behavior to either push the ET a value higher or lower. A higher Shap value means that the model is trying to produce a higher ET a prediction, and vice-versa. The green boxes highlight the regions where low RH values correspond to high ET o values but low ET a predictions, which could be attributed to soil moisture deficiency.

Conclusions
We developed and presented a novel probabilistic hybrid NGBoost-XGBoost model to simultaneously predict ET a , E sw , and ET o from the standard local hydroclimatic data. Different from other ML models, the proposed hybrid ML model is able to produce point predictions as well as a probability distribution over the entire outcome space to quantify uncertainties associated with ET predictions. The proposed hybrid model could provide practitioners with a better understanding of the uncertainty in the ET o , E sw , and ET a predictions without compromising the accuracy of the predictions. Our results showed that the hybrid NGBoost-XGBoost ML model successfully predicted the FAO56 PME-computed ET o , and measured E sw and ET a , in which ≥89.9% of the target data points were within the 95% prediction interval of the model, and R 2 values for the point predictions were 0.994, 0.750, and 0.801, respectively, using data from the model testing period. These results exhibit that the proposed hybrid ML model is a reliable and robust alternative method to predict ET o , E sw , and ET a from local climate data, without implementing net solar radiation calculations required by the FAO56 PME, coping with uncertainties in E sw estimates using evaporation pans, or having expensive EC tower setups for ET a measurements.
We also demonstrated that the Shapley method, based on a game theory approach, identified the order of importance of hydroclimatic features on ET o , E sw , and ET a predictions, and we compared and contrasted these findings against the findings in the literature, which were typically performed by sensitivity and correlation analyses. The idea behind this analysis was to explain the prediction of an instance by computing the contribution of each feature to the prediction. Our analysis revealed that the shortwave solar radiation, air temperature, and relative humidity are the most critical features for the daily ET o predictions, whereas the surface water temperature, relative humidity, and the month are the most critical features for the daily E sw predictions, and the shortwave solar radiation, month, and relative humidity are the most critical features for the daily ET a predictions in the semi-arid region.
Author Contributions: H.B. and J.W. performed database management, data quality checks, and FAO56 PME and MF calculations. D.C. developed the predictive ML models and performed the feature analysis. All authors are involved in conceptualization, analyzed the results, and wrote, reviewed, and edited the manuscript. All authors have read and agreed to the published version of the manuscript.

Acknowledgments:
The authors would like to thank to Newfel Mazari and Marcus Gary of EAA for their help with acquisition of climate data from EAA weather stations; Ned Troshanov and Sarah Eason of EAA for their help with preparation of the location map; John Zhu of TWDB in Austin, TX for providing us with the raw daily pan evaporation data and follow-up discussions; and Tara Bongiovanni of BEG in Austin, TX for sharing with us the daily actual evapotranspiration data from the Eddy covariance tower in Camp Bullis, TX.

Conflicts of Interest:
The authors declare no conflict of interest.

Abbreviations
Commonly used abbreviations in the paper: For hourly-ET o calculations, hourly-averaged T a , P, RH, and u 2 and hourly-summed R s at the NDR station, shown in Figure A1, were used as input in Equation (1). The total number of missing hourly records at this site was less than 0.1%, which were filled in by linear interpolation. The NDR weather station was selected in the analysis due to its proximity to Uvalde County, TX, where monthly representative cloud cover data were obtained. The closest weather station to Lake Ingram is located at the Bandera County River Authority and Groundwater District's office. The same set of local climate data at the BCRAGD station was available for the same period at the NDR station. The total number of missing hourly data was less than 0.1%, which were filled by linear interpolation.

. Surface Water Data
Daily and monthly surface water evaporation data closest to the NDR site were obtained from Ingram Lake in Texas. Daily pan evaporation measurements (E p ) from 9/1/2015 to 12/31/2019 were taken by the Texas Water Development Board (TWDB). 1.9% of these measurements were missing, which were filled in by linear interpolation. These measurements were upscaled to daily lake evaporation totals (E sw ) using monthly-varying pan coefficients developed by the TWDB. However, sporadically extremely high and low E sw values, shown in Figure A4a, were found to be quantitatively inconsistent with the climatic data (T a , R s , and RH) trends at the BCRAGD station, provided in Appendix A.2. Therefore, this time series is regarded as anomalous. Such anomalies are quite common in E p measurements due to birds drinking from the pan, debris falling in, or water splashing out [95]. Subsequently, these anomalies are carried into the daily E sw data, but largely smoothed out in monthly-averaged E sw . Because the ML model was run with daily E sw data here, a 7-day rolling median function was used to reduce the noise and outliers in the daily E sw data ( Figure A4a). Monthly E sw , derived from daily E sw ( Figure A4) were then used to determine the suitability of the MFs to predict the monthly E sw at Ingram Lake. <HDU 'DLO\E sw PP 0HDVXUHGE sw 0HDVXUHGE sw DIWHUQRLVHRXWOLHUUHPRYDO (a) Lake evaporation (b) Surface water and air temperature Figure A4. Surface water measurements closest to the NDR weather station. Surface water temperatures at 15-min intervals were obtained from Frio river in Concan, and daily lake evaporation data were obtained from Ingram Lake in Texas.
The daily and monthly E sw rely on surface water temperature, T sw [34,35]. The closest gauging station, with the surface water temperature data from 9/1/2015 to 12/31/2019 at the 15-min (or 1-h) intervals, to Ingram Lake is the U.S. Geological Survey Station (USGS 08195000) located at the Frio River in Concan, TX. The Frio River at the USGS 08195000 and Ingram Lake are small-size surface water bodies fed by groundwater from the Trinity aquifer. Therefore, T sw from the Frio river were used in MF-based E sw calculations at Ingram Lake. Daily-averaged T sw are shown in Figure A4b. Because the Frio river is a groundwater-fed river, T sw ≥ T a in winter; whereas, T sw ≤ T a in summer. 0.26% of daily T sw were missing, which were filled in by linear interpolation.

Appendix B. NGBoost and XGBoost Models
Appendix B.1. Natural Gradient Boosting (NGBoost) NGBoost is a supervised learning algorithm with generic probabilistic prediction capability. A probabilistic prediction produces a full probability distribution over the entire outcome space; thus, enabling the users to quantify the uncertainties in the construction cost predictions produced by the model. In standard point prediction settings, the object of interest is an estimate of the scalar function E(y|x), where x is the feature vector and y is the prediction target, without accommodating uncertainty estimates. In contrast, under a probabilistic prediction setting, a probabilistic forecast with probability distribution P θ (y|x) is produced by predicting the parameters θ. NGBoost can perform probabilistic forecast with flexible tree-based models, given that NGBoost is designed to be scalable and modular with respect to the base learner (e.g., decision trees), probability distribution parameter (e.g., normal, Laplace, etc.), scoring rule (e.g., Maximum Likelihood Estimation). As shown in Figure A5, the input feature vector x in the hybrid NGBoost model is passed on to the base learners (decision trees) to produce a probability distribution of the predictions P θ (y|x) over the entire outcome space y. The models are then optimized by scoring rule S(P θ , y) using a maximum likelihood estimation function that yields calibrated uncertainty and point predictions.

Appendix B.2. eXtreme Gradient Boosting (XGBoost)
Extreme gradient boosting (XGBoost), proposed by Chen & Guestrin [78], is a variant of tree-based boosting algorithm. Conceptually, XGBoost learns the functional relationship f between the features X and target Y through an iterative process in which the individual trees are sequentially trained on the residuals from the previous tree. Mathematically the predictions from the trees can be expressed as where Y is the predicted E c , 1 ≤ k ≤ n, and n is the total number of functions learnt by the n number of trees. The following regularized objective L(φ) is minimized to learn the set of functions f k used in the model where Ω( f k ) = γT + 1 2 λ||w|| 2 , where l is a differentiable convex loss function that measures the difference betweenŷ i (prediction) and y i (target). Ω is an extra regularization term that penalizes the growing of more trees in the model to prevent complexity and thus, reduce overfitting. γ is the complexity of each leaf, T is the number of leaves in a tree, λ is a penalty parameter, and ||w|| is the vector of scores on the leaves. Note that if the regularization parameter Ω is set to zero, the objective falls back to the traditional gradient tree boosting. For more detailed information on how the model is trained in an additive manner to optimize Equation (A2), the reader is referred to Chen & Guestrin [78].