Estimation of Reference Evapotranspiration Using Spatial and Temporal Machine Learning Approaches

: Evapotranspiration (ET) is widely employed to measure amounts of total water loss between land and atmosphere due to its major contribution to water balance on both regional and global scales. Considering challenges to quantifying nonlinear ET processes, machine learning (ML) techniques have been increasingly utilized to estimate ET due to their powerful advantage of capturing complex nonlinear structures and characteristics. However, limited studies have been conducted in subhumid climates to simulate local and spatial ET o using common ML methods. The current study aims to present a methodology that exempts local data in ET o simulation. The present study, therefore, seeks to estimate and compare reference ET (ET o ) using four common ML methods with local and spatial approaches based on continuous 17-year daily climate data from six weather stations across the Red River Valley with subhumid climate. The four ML models have included Gene Expression Programming (GEP), Support Vector Machine (SVM), Multiple Linear Regression (LR), and Random Forest (RF) with three input combinations of maximum and minimum air temperature-based (Tmax, Tmin), mass transfer-based (Tmax, Tmin, U: wind speed), and radiation-based (Rs: solar radiation, Tmax, Tmin) measurements. The estimates yielded by the four ML models were compared against each other by considering spatial and local approaches and four statistical indicators; namely, the root means square error (RMSE), the mean absolute error (MAE), correlation coefﬁcient (r 2 ), and scatter index (SI), which were used to assess the ML model’s performance. The comparison between combinations showed the lowest SI and RMSE values for the RF model with the radiation-based combination. Furthermore, the RF model showed the best performance for all combinations among the four deﬁned models either spatially or locally. In general, the LR, GEP, and SVM models were improved when a local approach was used. The results showed the best performance for the radiation-based combination and the RF model with higher accuracy for all stations either locally or spatially, and the spatial SVM and GEP illustrated the lowest performance among the models and approaches.


Introduction
Evapotranspiration (ET) is a combination of two separate processes that transfer huge volumes of water and energy from the soil (evaporation) and vegetation (transpiration) to the atmosphere [1,2]. ET is the second greatest component of hydrological cycle and a major component of the Earth's surface energy balance. ET closely relates to plant growth, droughts, gas efflux, and production. Since ET plays a crucial role in watershed and agricultural water management, accurate spatial and local estimation of crop water requirements (ET a ) at the scale of human influence is a critical need for a wide range of applications [3]. Quantifying ET a from agricultural lands is vital to the management of water resources. Measurement methods of ET a are available through water vapor wavelet algorithm to estimate ET o for the southern part of Iran and obtained results stating that the coupled RF model showed a great improvement compared to the conventional RF and empirical models. To our knowledge, this model has not been applied in the Northern US for ET o studies.
According to the literature, GEP and SVM models have been frequently applied across the world in various climate conditions for ET o estimation, while the LR and RF models' applications were minimal. Besides, these models have not been compared with commonly used SVM and GEP models for subhumid climate conditions. Since the limited studies have been conducted to evaluate ML models for the Northern part of the US (which experiences a high variability of weather conditions and a huge amount of agricultural production), the objectives of this study were to (1) investigate the effect of different input combinations of meteorological data on the accuracy of daily ET o estimation in subhumid climate using the GEP, SVM, LR, and RF methods, (2) compare the spatial and local prediction capabilities of the different ML models in ET o estimation, and (3) evaluate the performance of the models based on the various study years and meteorological stations.

Study Area Climate and Reference Evapotranspiration (ET o )
The weather data for the current study were obtained from the North Dakota general regression neural network model. Shiri [41] evaluated the coupled RF with wavelet algorithm to estimate ETo for the southern part of Iran and obtained results stating that the coupled RF model showed a great improvement compared to the conventional RF and empirical models. To our knowledge, this model has not been applied in the Northern US for ETo studies. According to the literature, GEP and SVM models have been frequently applied across the world in various climate conditions for ETo estimation, while the LR and RF models' applications were minimal. Besides, these models have not been compared with commonly used SVM and GEP models for subhumid climate conditions. Since the limited studies have been conducted to evaluate ML models for the Northern part of the US (which experiences a high variability of weather conditions and a huge amount of agricultural production), the objectives of this study were to (1) investigate the effect of different input combinations of meteorological data on the accuracy of daily ETo estimation in subhumid climate using the GEP, SVM, LR, and RF methods, (2) compare the spatial and local prediction capabilities of the different ML models in ETo estimation, and (3) evaluate the performance of the models based on the various study years and meteorological stations.

Study Area Climate and Reference Evapotranspiration (ETo)
The weather data for the current study were obtained from the North Dakota  To calculate the daily reference evapotranspiration (ETo) for each study station, the ASCE standardized reference evapotranspiration equation (ASCE-EWRI) was used for the alfalfa reference crop [43]. This equation provides a standardized calculation of ETo demand that can be used in developing Kc and comparing it with other methods. Equation (1) presents the form of the standardized ETo equation for all hourly and daily calculation time steps.
where ETo is reference evapotranspiration rate (mm d −1 ), Rn is net solar radiation (MJ m −2 d −1 ), G is soil heat flux (MJ m −2 d −1 ), is psychrometric constant (KPa °C −1 ), T is the mean daily air temperature, U2 is the average wind speed at 2 m height (m s −1 ), es is saturation vapor pressure (KPa), ea is actual vapor pressure (KPa), ∆ is the slope of the saturation vapor pressure-temperature relationship (KPa °C −1 ), Cn and Cd are coefficients which are related to the crop and time step. The value for the constants Cn and Cd are 1600 and 0.38 To calculate the daily reference evapotranspiration (ET o ) for each study station, the ASCE standardized reference evapotranspiration equation (ASCE-EWRI) was used for the alfalfa reference crop [43]. This equation provides a standardized calculation of ET o demand that can be used in developing K c and comparing it with other methods. Equation (1) presents the form of the standardized ET o equation for all hourly and daily calculation time steps.
where ET o is reference evapotranspiration rate (mm d −1 ), R n is net solar radiation , T is the mean daily air temperature, U 2 is the average wind speed at 2 m height (m s −1 ), es is saturation vapor pressure (KPa), e a is actual vapor pressure (KPa), ∆ is the slope of the saturation vapor pressure-temperature relationship (KPa • C −1 ), C n and C d are coefficients which are related to the crop and time step. The value for the constants C n and C d are 1600 and 0.38 for the alfalfa reference crop. Table 1 summarized weather parameters of the study locations for the study period.

Models Structure and Application
To process the GEP and SVM algorithm, we used the GeneXpro program in Matlab, and to process the LR and RF models, a scikit-learn module embedded in the Python 3.2 programming language was used. GEP is an extension of GP [44] developed by Ferreira [45] that creates a computer program to investigate a relationship between input and output variables. GEP was developed to find a better solution for solving a particular problem relating to the understudied phenomena [46].
The application of GEP requires several steps. GEP is, like GAs and GP, a genetic algorithm, as it uses populations of individuals, selects them according to fitness, and introduces genetic variation using one or more genetic operators. The procedure to model daily evapotranspiration (as a dependent variable) by using weather variables (as independent variables) is as follows: 1. Selection of fitness function; 2. Choosing the set of terminals T and the set of functions F to create the chromosomes; 3. Choosing the chromosomal architecture; 4., Choosing the linking function; 5. Choosing the genetic operators.
The fitness function must be determined in the first step with a random generation of chromosomes of a certain program (initial population) and evaluated against a set of fitness cases [47]. Using weather station data as input variables (terminals) to model daily ET o involves the next general step. The selection of fitness functions (i.e., absolute error, relative error, and correlation coefficient) depends on the experience and intuition of the user. The GEP model in the current study was developed based on the recommended functions by Shiri et al. [17]. In the third step, the chromosomal architecture can be defined by having the weather variables as terminal and function set as chromosomes. The fourth step was to select a linking function that relates genes to each other in addition to linking the parse trees [17]. Finally, genetic operators' corresponding rates were chosen. Table 2 summarizes the commonly used parameters for each run. The SVM was developed by Cortes and Vapnik [48] and is known as the classification and regression method [34] to solve problems by applying a flexible representation of the class boundaries and implementing an automatic complexity control to reduce overfitting. In SVM, the dependency of the dependent variable to a set of independent variables is evaluated. In regression estimation with Support Vector Regression (SVR), which is used to define SVMs in the literature, a functional dependency f (x) between a set of sampled input points X = {x 1 , x 2 , x 3 , . . . , x l } (here, input sampled refer to meteorological variables) taken from R n (input vector of n dimension) and target values Y = {y 1 , y 2 , y 3 , . . . , y l } (ET o as target values) with y i ∈ R n . More detail on SVM can be found in Vapnik [49].
The LR is a statistical method used to describe a quantitative relationship between a dependent variable and one or more independent variables [27,50]. In LR, the function is a linear equation and is expressed as: where b o -b k are the fitting constants, y i is the dependent variable, and x 1 -x k are the independent variables for this system. The RF method combines a group of decision trees for either classification or regression purposes. Although each decision tree may not be capable of learning well, the combination of the decision trees results in a strong learner. Each decision tree predicts the outcome individually, and RF votes among the outcomes for classification or averages the outcomes for regression. Each decision tree is trained on a different subset of samples by a bagging extension of the RF model to reduce the risk of overfitting. Moreover, a different subset of input variables can be used in each tree to make it more useful in prediction for datasets with higher dimensions [51]. For this study, a small subset of data was used to find a good combination of parameters for the RF model. As a result, there were several trees in the forest and the minimum number of samples required for the leaf nodes were 50 and 35, respectively. The mean square error criteria are used as a procedure for estimation.
The calculated daily ET o was used to feed the GEP, SVM, LR, and RF models. Three treatments including temperature, radiation, and mass transfer-based combinations were used as input to feed the models, and each model of the combinations was assessed for spatial and local approaches. Different statistical analysis was performed to evaluate the accuracy and performance of the different combinations and approaches for each studied station. The combinations were as follows: (i) T min , T max : temperature based (GEP1, SVM1, LR1, RF1) (ii) T min , T max , R s : radiation-based (GEP2, SVM2, LR2, RF2) (iii) T min , T max , W: mass transfer based (GEP3, SVM3, LR3, RF3).

K-Fold Cross-Validation
Splitting the data into the sets of data for testing and training is a usual procedure for assessing the ML techniques. Using 10-30% of the complete dataset as a single test set is a common method for GEP evaluation. Therefore, the K-fold cross-validation technique was used to increase the evaluation performance and set of data for either training or testing purposes. Using K-fold cross-validation, the dataset was divided into K subsets, and the training process was repeated K times leaving each time a distinct set of patterns for testing until a complete testing scan for the dataset was fulfilled. Computation cost defines the minimum assembly size of the test set. Here, the minimum test size was fixed as one year for local modeling and one station for spatial modeling. Consequently, at a local scale, one year was held out each time for testing while the models were trained using the remaining 16 years; hence, a total of 612 models (17 years × 6 stations × 3 input combinations × 2 models) were established for the local k-fold testing. At the spatial scale, one station was considered as a test block each time and the models were trained using the patterns from the remaining stations; hence, a total number of 36 models (6 stations × 3 input combinations × 2 models) were constructed. The local and spatial approaches were noted with T and S in the figures.

Evaluation Criteria
To investigate the performance of the models for each combination and approaches, four statistical indicators were used, namely, the root mean square error (RMSE), the mean absolute error (MAE), correlation coefficient (r 2 ), and scatter index (SI), defined as follows: where ET e and ET o are simulated and calculated reference evapotranspiration at the i-th time step, respectively, n is a number of time steps, ET e and ET o are mean values of simulated and calculated ET o , respectively. The RMSE describes the average magnitude of errors and can take on values from 0 to ∞ indicate perfect and worst fit, respectively, and the MAE scores the error magnitudes without any specific weight to larger/smaller errors. Therefore, the lower value of the RMSE and MAE is desirable. The r 2 values around 1 indicate a perfect linear relationship between estimated and calculated values, where the closer a value is to zero, the more it demonstrates the poor relationship between simulation and calculation. Finally, SI is a dimensionless index of RMSE that gives a good insight to compare the performance of different models.

Results and Discussions
The local and spatial analysis of four models for six studied stations is shown in Table 3. According to the three combinations' performance, the radiation-based method illustrated the highest accuracy for either local or spatial approaches compared to the other combinations. The mass-transfer-based combination was the next most accurate combination. The results showed that the local trained models surpassed the spatially trained models because of using the same patterns for both training and testing models. However, the spatial models gave comparable results compared to the local model in some cases, especially for radiation-based combinations. Differences in temperature among the stations have dramatically affected the performance of both the temperature-based and the mass transfer-based models. In all cases, the minimum differences between the performance accuracy of the local and spatial models belonged to the LR model. This can be inferred to the mathematical structure of this technique, where only linear relationships can be supposed between the input and target parameters with a lower degree of flexibility compared to heuristic data driven models. Table 3. Global average performance indicators of the gene expression programming (GEP), support vector machine (SVM), multiple linear regression (LR), and random forest (RF) methods for three input combinations of local (T) and spatial (S) approaches. Among four models with three input combinations, the models relying on radiation, mass-transfer, and temperature-based combinations showed the lowest RMSE and MAE, respectively (Table 3). Comparing the GEP, SVM, LR, and RF models, the RF model illustrated the lowest rate of RMSE and MAE with the best performance for radiationbased approaches. However, the RF model improved 4.37, 5.76, and 1.49 percent from local to spatial approaches for temperature, radiation, and mass-transfer-based combinations, respectively, which was in contrast with the improvement's direction in the other models. Considering the models based on radiation combination, the spatial RF model exhibited the highest linear relationship (r 2 = 0.927) between calculated and estimated ET o in comparison with the other models. The local RF method was the next accurate approach to estimate ET o based on radiation-based data. This observation illustrated the ability of the RF algorithms to estimate ET o using data from local stations for training. Furthermore, the LR model had significant improvement for RMSE and MAE from spatial to local approaches for all three types of input combinations. For the LR model, the r 2 value was not changed for radiation-based and temperature-based combination from spatial to local approaches and the change was 0.13 percent for the mass-transfer-based model. Therefore, the LR model illustrated almost similar results for both spatial and local approaches among all models.

Input
The GEP and SVM models illustrated the great improvement rate for all three input combinations from spatial to local condition with the highest improvement of 21 percent for mass-transfer-based combination. Specifically, the GEP model showed an improvement from spatial to local approach, however, the percentage of improvement was 2.3, 3.9, and 0.13 for radiation, temperature, and mass-transfer-based combinations, respectively. In term of obtained improvement for RMSE and MAE from spatial to local approaches, both of the GEP and SVM models gained similar results. The correlation coefficient of the SVM model decreased from spatial to local approaches for about 6.3, 6.5, and 10.9 percent for radiation, temperature, and mass-transfer-based combinations, respectively. By using local radiation data for training the models, the SI indicator for the GEP and SVM models showed an improvement of 8.2 and 10 percent from spatial to local approach, respectively. This improvement was about 6.6 and 8.7 percent for mass-transfer-based and 15 and 10.9 percent for temperature-based combinations, respectively.
Statistical analysis revealed the similar performance of the local GEP and SVM models. For RMSE and MAE statistical variables, GEP and SVM models showed a greater improvement in performance for mass-transfer and temperature-based combinations, respectively. By considering correlation coefficient values, it can be concluded that the improvement in accuracy of either GEP or SVM approaches was not significant and all illustrated the ability to estimate with acceptable accuracy. Therefore, if temperature data are not available at a particular station, but they are for other stations, the GEP and SVM approaches can be useful to estimate ET o . However, due to the higher mapping ability of the GEP models, using either local or spatial GEP are preferable.
The models relying on the mass-transfer combination had slightly higher accuracy than the temperature-based approach, but lower accuracy compared to the radiationbased combination. All of the local and spatial GEP and SVM methods illustrated lower improvement compared to that for the temperature-based approach. This showed that wind speed can have a significant effect on the accurate estimation of spatial and local ET o . Due to the flat topography of the study area and being faced with lots of high-speed winds during the growing season and almost all other seasons, including the wind as a parameter to build the model architecture and estimating the ET o can increase the accuracy of the approach.
Overall, the RF and the LR models illustrated the best performance among the four models, and comparing the GEP and SVM models, the GEP model showed better performance than the SVM model for all three input combinations.
A breakdown of the models' performance accuracy at each station is shown in Figures 2-4 for all of the three input combinations, respectively. In the case of the temperature-based combination (Figure 2), the local GEP and SVM models (shown as TGEP and TSVM) gave more accurate results than the spatial (shown as SGEP and SSVM) models. For the LR and RF models, the difference in accuracy between local (TLR and TRF) and spatial approaches (SLR and SRF) was not significant and both showed better performance than GEP and SVM models since they relied on the patterns of the same location used for training and testing the models. According to Table 2, station 6 (Fargo) had the highest, and station 2 (Galesburg) had the lowest range of recorded temperature among the study stations. This range may be caused to have the lowest performance for station 2; however, it was difficult to evaluate the model's performance in the climate context of each station due to the few number of study stations. The RF model showed the best performance with higher accuracy for all stations either locally or spatially, and the spatial SVM and GEP illustrated the lowest performance among other models and approaches. The RF and LR methods showed the lowest range of SI compared to the spatial and local GEP and SVM methods. For temperature-based combinations, the spatial and local LR approaches had minimum SI ranges of 0.018 and 0.020, respectively, and the spatial SVM and GEP methods illustrated the highest SI ranges of 0.113 and 0.119, respectively. The spatial RF approaches with an average of 0.333, and spatial SVM, with an average of 0.457, showed the lowest and highest SI rate, respectively. Therefore, spatial RF approaches may be the most practical way to estimate the missing meteorological data of the study stations. Figure 3 shows the evaluation result of the radiation-based combination for the four models with spatial and local approaches. The amount of received radiation for all study stations was similar. According to the global performance of the defined models, the radiation-based combination gained the best performance among the three input combinations. Besides, the radiation-based combination had the lowest rate of RMSE, MAE, and SI, and the highest rate of r 2 for each of the study stations. Among the spatial and local scenarios, the local approach had a better performance than the spatial approach. For the radiation-based combination, the spatial RF and local RF models had an accurate estimation of ETo, respectively. For stations 3 and 4 (Leonard and Sabin) either spatial or local approaches of GEP and SVM models gained lower performance than the other stations. This could be due to the slightly higher magnitude and variations of solar radiation (Table  2) among the other stations during the study period. Among the study stations comparison, the SI range of the spatial RF was 0.018, which showed the best performance compared to the other applied methods. As obtained from the temperature-based combination, the LR method performed well in the radiationbased combination too, with an SI indicator range of 0.021 and 0.024 for local and spatial LR approaches, respectively. The worst performance was observed for spatial GEP and SVM approaches, with SI indicator of 0.128 and 0.140, respectively. According to the GEP and SVM models, the local GEP performed well compared to other approaches of the SVM and GEP models. The statistical indicators were in agreement with the spatial RF perfor-  By having wind speed and temperature data as the input variable for the mass transfer-based combination, the spatial RF approach gained the lowest SI and highest r 2 values for ETo estimation compared to all other methods. The minimum and maximum SI values for mass transfer-based combination were obtained for the spatial RF and spatial SVM approaches, which were 0.011 and 0.120, respectively. According to the performance ranking of the models based on the SI indicator, spatial LR, local LR, and local RF showed better performance after spatial RF with SI values of 0.015, 0.018, and 0.018, respectively. The local SVM, local GEP, and spatial GEP had the SI values of 0.087, 0.10, and 0.119, respectively. The average of SI for all study stations showed that the spatial and local LR had the highest and spatial and local RF had the lowest SI values, respectively. Therefore, by having the lowest range of SI and lowest value of SI for the spatial RF approach, it might be more practical to apply the spatial RF for other stations without training a specific model for each station. Accordingly, no local dataset would be needed to train the local models. This could be helpful to estimate the ETo for stations with partial or missing meteorological data.
To understand the yearly performance of the applied models based at each of the study stations, the models were assessed per test year. Figure 5 illustrates the SI values obtained from the three input combinations for each study year of the study stations. The SI values of the models fluctuated considerably for almost all stations during the test years.
As shown in Figure 5, the SI values fluctuated considerably within test years for all input combinations and approaches. Among the study stations, Prosper and Sabin stations showed the average maximum range for the SI values. The minimum average of the SI value 0.223 was observed for the RF radiation-based combination for the Fargo station, and the maximum average of the SI was obtained for the SVM temperature-based combination (SVM1) for the Galesburg station. The Galesburg station had the lowest temperature range among the study stations. According to Figure 4, the RF2 (radiation-based combination) model showed the lowest fluctuation compared to the other approaches with a similar trend among the study stations. For all of the study stations, test years, and input The RF and LR methods showed the lowest range of SI compared to the spatial and local GEP and SVM methods. For temperature-based combinations, the spatial and local LR approaches had minimum SI ranges of 0.018 and 0.020, respectively, and the spatial SVM and GEP methods illustrated the highest SI ranges of 0.113 and 0.119, respectively. The spatial RF approaches with an average of 0.333, and spatial SVM, with an average of 0.457, showed the lowest and highest SI rate, respectively. Therefore, spatial RF approaches may be the most practical way to estimate the missing meteorological data of the study stations. Figure 3 shows the evaluation result of the radiation-based combination for the four models with spatial and local approaches. The amount of received radiation for all study stations was similar. According to the global performance of the defined models, the radiation-based combination gained the best performance among the three input combinations. Besides, the radiation-based combination had the lowest rate of RMSE, MAE, and SI, and the highest rate of r 2 for each of the study stations. Among the spatial and local scenarios, the local approach had a better performance than the spatial approach. For the radiation-based combination, the spatial RF and local RF models had an accurate estimation of ET o , respectively. For stations 3 and 4 (Leonard and Sabin) either spatial or local approaches of GEP and SVM models gained lower performance than the other stations. This could be due to the slightly higher magnitude and variations of solar radiation (Table 2) among the other stations during the study period.
Among the study stations comparison, the SI range of the spatial RF was 0.018, which showed the best performance compared to the other applied methods. As obtained from the temperature-based combination, the LR method performed well in the radiation-based combination too, with an SI indicator range of 0.021 and 0.024 for local and spatial LR approaches, respectively. The worst performance was observed for spatial GEP and SVM approaches, with SI indicator of 0.128 and 0.140, respectively. According to the GEP and SVM models, the local GEP performed well compared to other approaches of the SVM and GEP models. The statistical indicators were in agreement with the spatial RF performance in which they showed the lowest rate of RMSE and MAE and the highest value of r 2 . However, comparing the MAE might not be a valid indicator due to taking into account the local order of magnitude of the target variable. The ranking of the SI indicators showed that spatial RF and LR could overcome the lack of meteorological data for the station. On the other hand, the averages of the SI values for all six study stations showed that the spatial RF and local RF had the lowest and the spatial GEP and spatial SVM had the highest rate of SI indicators. Therefore, either spatial or local RF methods could be useful to estimate the missing values for any of the stations. Figure 4 shows the statistical indices of the mass-transfer-based combination. Similar to the previous combinations, the spatial and local RF gave a more accurate estimation than other methods. On the other hand, the local SVM approach showed better estimation than the spatial SVM and GEP methods for all stations except station 2, which had the lowest range of temperature variation. The fluctuations of the indices among the stations were higher than the radiation-based combination and lower than that for temperature-based combination, which showed mediocre accuracy compared to the other combinations.
By having wind speed and temperature data as the input variable for the mass transferbased combination, the spatial RF approach gained the lowest SI and highest r 2 values for ET o estimation compared to all other methods. The minimum and maximum SI values for mass transfer-based combination were obtained for the spatial RF and spatial SVM approaches, which were 0.011 and 0.120, respectively. According to the performance ranking of the models based on the SI indicator, spatial LR, local LR, and local RF showed better performance after spatial RF with SI values of 0.015, 0.018, and 0.018, respectively. The local SVM, local GEP, and spatial GEP had the SI values of 0.087, 0.10, and 0.119, respectively. The average of SI for all study stations showed that the spatial and local LR had the highest and spatial and local RF had the lowest SI values, respectively. Therefore, by having the lowest range of SI and lowest value of SI for the spatial RF approach, it might be more practical to apply the spatial RF for other stations without training a specific model for each station. Accordingly, no local dataset would be needed to train the local models. This could be helpful to estimate the ET o for stations with partial or missing meteorological data.
To understand the yearly performance of the applied models based at each of the study stations, the models were assessed per test year. Figure 5 illustrates the SI values obtained from the three input combinations for each study year of the study stations. The SI values of the models fluctuated considerably for almost all stations during the test years.
As shown in Figure 5, the SI values fluctuated considerably within test years for all input combinations and approaches. Among the study stations, Prosper and Sabin stations showed the average maximum range for the SI values. The minimum average of the SI value 0.223 was observed for the RF radiation-based combination for the Fargo station, and the maximum average of the SI was obtained for the SVM temperature-based combination (SVM1) for the Galesburg station. The Galesburg station had the lowest temperature range among the study stations. According to Figure 4, the RF2 (radiation-based combination) model showed the lowest fluctuation compared to the other approaches with a similar trend among the study stations. For all of the study stations, test years, and input combinations, the RF models gave the best performance with the lowest SI values compared to the other models and combinations. The SVM and GEP models showed the worst SI averages for the temperature-based combinations. In this case, the order of the accuracy of the models was similar to that obtained from the station-based analysis. The radiation-based combination gave the most accurate results and estimation in comparison with the temperature or mass-transfer-based combination models.
Comparing the performance of the models relying on each of the combination methods, it can be observed that the performance of the approaches is similar. However, the range of the SI indicator for different approaches was different depending on the test years. For example, 2011, 2012, and 2013 were the dry, normal, and wet years among the study years, respectively. The result of SI per test years showed that 2012 had lower SI than 2011 and 2013 for all of the three input combinations, and the SI values of the various methods and approaches were close together. On the other hand, for the 2013 test year, some of the approaches illustrated a huge jump for the obtained SI from 2012 to 2013 due to the weakness of the model performance.
combinations, the RF models gave the best performance with the lowest SI values compared to the other models and combinations. The SVM and GEP models showed the worst SI averages for the temperature-based combinations. In this case, the order of the accuracy of the models was similar to that obtained from the station-based analysis. The radiationbased combination gave the most accurate results and estimation in comparison with the temperature or mass-transfer-based combination models.  Due to the variability in the meteorological data and the climate pattern, the variability in performance accuracy at each station was expected. Similar variability in performance of the ML approaches was observed by Shiri et al. [17]. Selection of the training set and testing set plays an important role in model performance. The existence of any abnormal year in the test years in comparison with training datasets causes it to have an inaccurate estimation [19]. By lowering the number of the input values, the validity of the training set for estimation of test years decreases. Because of the lower input values, the variable rate would be low, and this type of input would only be valid for periods with very specific trends. This explanation may clarify the performance of spatial approaches, where the relationship encountered might be site-specific. Other researchers illustrated the sitespecific performance of spatial approaches for the different study regions and climate conditions [34,35].
The comparison of the three input combinations showed that the performance of some approaches was similar for some years while the performance of methods for some years were far from each other. For example, the SVM model showed the most improvement from temperature to mass-transfer-based combination, which became like the RF method. However, depending on the station and test year, this similarity becomes even closer. All the test years and stations showed an improvement from temperature to radiation or mass-transfer-based combination except for the Prosper station, which is in agreement with the findings of Adnan et al. [15]. On the other hand, the Prosper station showed the best improvement for the SVM model for radiation-based approaches. Considering that all of the input combinations rely on the temperature and another variable (solar radiation or wind speed), it might be thought that the performance differences could be due to the effect of the second variable in the estimation of the output. Besides, when the performance of the models is similar, the impact of the secondary variable might be less than the primary variable (temperature). However, when the gap between the performance of the SI indicator increases, it proves the crucial impact of the second variable on the model performance for the specific test year or station. A similar conclusion could be drawn for the comparison between the input combinations. If each of the combinations shows a better performance than the other combination method for a specific year and station, the second variable effect should be important and might have a significant influence in the explanation of the output for that test year.

Conclusions
This paper aimed to evaluate the different ML methods including GEP, SVM, LR, and RF to estimate ET o in the Red River Valley. The external approach exempts using local data like spatial approaches in the current paper for simulating ET o values in a decisive way when local data is not available, reliable, or sufficient. Global comparison of the performance accuracy of the applied models revealed that the RF model was the best for all combinations among the four defined models. Furthermore, the RF model illustrated the best performance for spatial and local conditions for all input combinations. In general, the LR, GEP, and SVM models were improved when a local approach was used, except for the RF model, which was less accurate with a local approach. The radiation-based combination was the most accurate predictor among all models tested. As a result, this combination showed the lowest rate of improvement due to better performance in the first step.
The results showed that due to the flat topography of the study area with high wind speeds during the growing season, including the wind as a parameter to build the model architecture and estimate the ET o can increase the accuracy of the prediction. Besides, it might be more practical to apply the spatial RF model for stations with missing meteorological data without the need for local training. The recommended application of spatial RF using radiation combination allows for a more reliable estimate of ET o to fill the missing values for more precision water management purposes. Further research should confirm the current results in other geographical locations and for the various input combination methods.