Modeling Pan Evaporation Using Gaussian Process Regression K-Nearest Neighbors Random Forest and Support Vector Machines; Comparative Analysis

: Evaporation is a very important process; it is one of the most critical factors in agricultural, hydrological, and meteorological studies. Due to the interactions of multiple climatic factors, evaporation is considered as a complex and nonlinear phenomenon to model. Thus, machine learning methods have gained popularity in this realm. In the present study, four machine learning methods of Gaussian Process Regression (GPR), K-Nearest Neighbors (KNN), Random Forest (RF) and Support Vector Regression (SVR) were used to predict the pan evaporation (PE). Meteorological data including PE, temperature (T), relative humidity (RH), wind speed (W), and sunny hours (S) collected from 2011 through 2017. The accuracy of the studied methods was determined using the statistical indices of Root Mean Squared Error (RMSE), correlation coe (cid:14) cient (R) and Mean Absolute Error (MAE). Furthermore, the Taylor charts utilized for evaluating the accuracy of the mentioned models. The results of this study showed that at Gonbad-e Kavus, Gorgan and Bandar Torkman stations, GPR with RMSE of 1.521 mm / day, 1.244 mm / day, and 1.254 mm / day, KNN with RMSE of 1.991 mm / day, 1.775 mm / day, and 1.577 mm / day, RF with RMSE of 1.614 mm / day, 1.337 mm / day, and 1.316 mm / day, and SVR with RMSE of 1.55 mm / day, 1.262 mm / day, and 1.275 mm / day had more appropriate performances in estimating PE values. It was found that GPR for Gonbad-e Kavus Station with input parameters of T, W and S and GPR for Gorgan and Bandar Torkmen stations with input parameters of T, RH, W and S had the most accurate predictions and were proposed for precise estimation of PE. The ﬁndings of the current study indicated that the PE values may be accurately estimated with few easily measured meteorological parameters.

Their obtained results revealed high capabilities of implemented firefly algorithm in decreasing the prediction error of standalone ANFIS model in all studied stations. Khosravi et al. [24] examined the potential of five data mining and four ANFIS models for predicting reference evapotranspiration in two stations in Iraq. They stated that for both studied stations, the ANFIS-GA generated the most accurate predictions. Salih et al. [25] investigated the capabilities of co-ANFIS for predicting evaporation from reservoirs using meteorological parameters. The findings of the mentioned study indicated the suitable accuracy of the co-ANFIS model in evaporation estimation. Recently, Feng et al. [26] examined the performance of two solar radiation-based models for the estimation of daily evaporation in different regions of China. They suggested that Stewart's model can be preferred when the meteorological data of sunny hours and air temperature are available. Therefore, it is possible to estimate the evaporation through intrinsically nonlinear models. Qasem et al. [27] examined the applicability of wavelet support vector regression and wavelet artificial neural networks for predicting PE at Tabriz and Antalya stations. Obtained results indicated that artificial neural networks had better performances, and the wavelet transforms did not have significant effects in reducing the prediction errors at both studied stations. Yaseen et al. [28] predicted PE values using four machine learning models in two stations of Iraq. They reported that the SVM indicated the best performance comparing to other studied methods.
The literature review showed that the data mining methods had a suitable application in estimating PE in different climates, but to the best of our knowledge, the application of Gaussian process regression not reported for estimating PE. So, in the present study, the ability of four data mining methods, including Gaussian process regression, support vector regression, Nearest-Neighbor, and Random Forest, are studied in estimating PE rates, using different combinations of meteorological parameters. In the following, the results compared. Finally, using performance evaluation indices, the best method is obtained for estimating evaporation in the humid regions of Iran.

Study Area
Golestan province is located in the southeastern part of the Caspian Sea with an area of 20,387 km 2 and covers about 1.3% of the total area of Iran ( Figure 1). This province has an average annual rainfall of 450 mm in the geographical range of 36 • 25 to 38 • 8 north latitude and 53 • 50 to 56 • 18 east longitude. The geographical location and topography of Golestan province influenced by various climatic factors and different climates observed in this province. So that, the semi-arid climate is observed in the international border and the Atrak basin, moderate and semi-humid in the southern and western parts of the province, as well as cold climate in the mountainous regions.   Meteorological parameters implemented at the current research are temperature (T), relative humidity (RH), wind speed (W), and sunny hours (S) and PE with the period of 2011 to 2017.
Generally, the class A pan was utilized for the PE measurement due to its international referenceability and acceptability. Table 1 represents the statistical parameters of all utilized variables in three studied stations, including Gonbad-e Kavus, Gorgan and Bandar Torkman. As it can be seen from Table 1, W has the highest skewness among implemented parameters. Also, S and PE specify skewed distributions. Nevertheless, T and RH having lower skewness values indicate normal distributions. Additionally, in all studied stations, T and S variables have higher values of correlation with PE. Furthermore, RH has an inverse correlation with PE.

Gaussian Process Regression (GPR)
Gaussian processes (GP) defined as a set of random variables, in which a few variables have a multi-variable Gaussian distribution. X and Y are the input and output domains, respectively; n pairs (x i , y i ) domains are independent and extracted and distributed equally. It assumed that the Gaussian process on X is defined by the average function of µ: Y → Re and the covariance function of k: X * X → Re. The primary assumption of the GPR is that y is determined by y = f (x) + ζ, in which, ζ is Gaussian noise with variance of σ 2 . In the Gaussian process regression, there is a random variable f(x) for each input variable x, that is the value of the random function f in that location. In this study, it was assumed that the observation error is independent and has the same distribution as the zero mean value (µ(x) = 0) and the variance (σ 2 ), and f(x) of the Gaussian process on X (denote with k) (Equation where I is the identity matrix and K ij = k(x i , x j ). Because Y/X~N (o, K + σ 2 I) is normal, the conditional distribution of the test labels with the condition of the training and testing data p(Y*/Y, X, X*); in this condition (Y*/Y, X, X*)~N (µ, σ), therefore: K(x, x') is the matrix n × n* of the evaluated covariance in all pairs of test and training data sets that are similar for other values of K(X, X), K(X, X*), and K(X*, X*). Where X and Y are the training vector and training data labels y i , while X* is the test data. The covariance function specified for creating a semi-finite positive covariance matrix of k, in which K ij = k(x i , x j ). By specifying the kernel k and the noise degree σ 2 , Equations (3) and (4) are sufficient for a deduction. The selection of the appropriate covariance function and its parameters is essential during the training process of GPR models because the central role in the Gaussian process regression model belongs to the covariance function K(x, x'). This function embeds the geometric structure of the training samples. In other words, for generating precise predictions, the mean and covariance functions should be estimated from utilized data, which are called hyperparameters. [29].

K-Nearest-Neighbor-IBK
Here the K-Nearest Neighbors or KNN is implemented through the instance-bases learning with parameter k (IBK). In general, this algorithm used for two purposes: 1. estimation of the density function of the distribution of test data and 2. classification of the test data based on the test patterns. The first step in applying this algorithm is to find a method and a relationship to calculate the distance between the test and training data. Euclidean distance is usually used to determine the distance between test and training data where X represents the training data with specified parameters (x n to x 1 ), and Y represents the training data with the same number of parameters (y n to y 1 ).
After determining the Euclidean distance between the data, the database samples sorted in ascending order from the least distance (maximal similarity) to the maximum distance (minimum similarity). The next step in this model is to find the number of points (k) of the experiment to estimate the characteristics of the desired database. Determining the number of neighbors (k) is one of the most critical steps, and the efficiency of this method is depended to the selection of the closest (the most similar) samples from the reference database considerably. If k is assumed to be small, the results are sensitive to the unconventional single points of the model, and if k is assumed to be significant, it is possible to place some points from other classes within the desired range. Usually, the best value for k is calculated using cross-validation [30].

Random Forest (RF)
Random forest (RF) utilizes classification and regression tree (CART) as a learning algorithms of decision trees. The RF is a set of decision trees, which in each one, the space of the variables is divided into smaller subspace so that the data in each region has as uniform as possible. This classification pattern is employed by a structure decision tree. In this tree structure, the branching point to two sub-branches is called node. The first node of the tree is called the root, and the last one is the leaf [31]. In RF, each tree grows with a self-serving sample of the original data, and in order to perform the best division, the number of m variables selected randomly by variables is searched [32]. In the RF method, the data discrepancy are determined in a completely different way from the usual distance functions. The similarity of data in RF is measured based on their placement in the same leaves (the final subspaces). In the RF, the similarity between i and j (s(i, j)) defined as the ratio of the times that the two given data are in the same leaf. The forest similarity matrix is random, symmetric, and positive. This matrix converts the following transformation into a non-similar matrix: The formation of the classification tree is not dependent on the values of the variables; hence, the lack of similarity of the random forest applied to the types of variables [33]. The splitting procedure is repeated in each tree until reaching a predefined stop condition.

Support Vector Regression (SVR)
Support vector machine is one of the learning methods introduced by Bousser et al. [34] on the basis of statistical learning theory. In the following years, they introduced the theory of optimum hyperplane as linear classifiers and introduced nonlinear classifiers by kernel functions [35]. The models of support vector machines are divided into two main groups: (a) the classifier models of support vector machine and (b) a support vector regression model. A support vector machine model is used to solve the classification problems of data in different classes, and the SVR model is used to solve the prediction problems. Regression is meant to obtain a hyperplane that is fitted to the given data. The distance from any point on this hyperplane indicates the error of that particular point. The best method suggested for linear regression is the least-squares method. However, for regression issues, the use of the least-squares estimator in the presence of outlier data may be completely impossible; as a result, the processor will show poor performance. Therefore, a robust estimator that would not be sensitive to small variations in the model should be developed. In fact, a penalty function (ε) is defined as follows [36]: The training data sets are S = {(x 1 , y 1 ), (x 2 , y 2 ), ..., (x n , y n )} and the class of the function is as f(x) = {w T x + b, w∈R, b∈R}. If the data deviate from the value of ε, a deficiency variable must be defined according to the value of the deviation. In accordance with the penalty function, the minimization is defined as follows: where ||w|| 2 is the norm of the weight vector, and ξ i and ξ i * are auxiliary deficiency variables, and parameter C is the coefficient of equilibrium of complexity between the machine and the number of inseparable points obtained by trial and error.

Model Development
For developing the studied models of GPR, IBK, RF, SVR different model characteristics were examined using trial and error procedure. For GPR model, radial basis kernel function was implemented with kernel length scale of 3.0 and kernel bias of 1.0. Also, the maximum basis vectors were set to be 100. In the IBK algorithm, k is typically a small, positive and odd integer. So, k was set to be 5. For RF, random forest number of trees was 100, random forest maximum depth was 10 and random forest subset ratio was set to be 0.2. In the SVR, radial basis kernel function was implemented with kernel parameter gamma of 1.0. Moreover, maximum iterations and convergence epsilon were 100,000 and 0.001, respectively.

Evaluation Parameters
Error-values between computed and observed data are evaluated by Root Mean Square Error (RMSE), Mean Absolute Error (MAE) and correlation coefficient (R) defined as follows:  (11) where, x i and y i are observed and estimated PE and n is the number of observations. Additionally, Taylor diagrams were employed for inspecting the accuracy of the implemented models. It is outstanding that in the mentioned diagram, measured and some correspondent statistical parameters are presented at the same time. Moreover, different points on a polar plot are used in Taylor diagrams for investigating the differences between observed and estimated values. Also, the CC and normalized standard deviation are indicated by azimuth angle and radial distances from the base point, respectively [37].

Results and Discussion
Before implementing the studied GPR, IBK, RF, SVR models in PE estimation, the preliminary statistical analysis was performed for three considered stations and the obtained results are presented in Table 2. According to Table 2, the results of trend and data outlier tests confirmed the hypothesis H o , meaning that our data is free of trend and outlier. Then, in order to evaluate the possibility of using different combinations of meteorological data, seven different scenarios including various meteorological data, were defined for a more accurate estimation of PE (Table 3). It should be noted that these input combinations were established according to the correlation coefficients of meteorological parameters with PE values. So, T with highest correlation coefficient was considered as a necessary parameter for PE estimation. Therefore, T is present in all combinations. Also, different input combinations were examined by presence of T parameter. Then, these input combinations were used in data mining methods to estimate evaporation at three stations of Gonbad-e Kavus, Gorgan and Bandar Torkaman. There is no straightforward guideline for splitting the training and testing data in machine learning modeling [38][39][40][41][42][43][44][45][46]. For instance, the study of Choubin [47] used a total of 63% of their data for model development, whereas Qasem et al., [48] utilized 67% of data, Asadi et al., [41], Samadianfard et al., [49,50], and Dodangeh et al., [51] used 70%, and Zounemat-Kermani et al., [52] implemented 80% of total data to develop their models. Thus, to develop the studied models for PE estimation, we divided the data into training (70%) and testing (30%). So, data in the time period of 2011-2015 was utilized for training and the residual data from 2016-2017 was used for testing the implemented models. The general results of the computations for the defined scenarios for the above-mentioned methods are presented in Table 4. The presented results in Table 4 showed that for the GPR at Gonbad-e Kavus station, GPR6 with R = 0.899, MAE = 1.128 mm/day, and RMSE = 1.521 mm/day has less error than the other GPR combinations. However, the GPR7 model with R = 0.904, MAE = 1.134 mm/day and the RMSE = 1.530 mm/day requires more meteorological parameters. After GPR6, it presented the most accurate estimations of PE. On the other hand, GPR3 using two parameters of T and S, with R = 0.894, MAE = 1.153 mm/day, and RMSE = 1.550 mm/day have higher accuracy than the other SVR models. Also, due to fewer implemented parameters, GPR3 can be used in the case of data deficiency with an acceptable error and high reliability. Based on the results obtained at Gorgan Station, GPR7 with meteorological data of T, RH, W, and S has the lowest error with RMSE = 1.244 mm/day, MAE = 0.958 mm/day, and R = 0.901 and selected as the most accurate method among the GPR models. After the GPR7, GPR6 with RMSE = 1.265 mm/day, MAE = 0.965 mm/day, and R = 0.897 and GPR4 with RMSE = 1.265 mm/day, MAE = 0.980 mm/day and R = 0.897 with higher error than the GPR7 is in the second order. Based on the results obtained in the Bandar Torkaman station, GPR7 with RMSE = 1.254 mm/day, MAE = 0.946 mm/day, and R = 0.912 is selected as superior GPR model. In the next rank, the GPR6 with RMSE = 1.257 mm/day, MAE = 0.939 mm/day, and R = 0.912 presented precise estimations.
According to the results obtained at Gonbad-e Kavus station in the nearest-neighbor method, the IBK-4 with R = 0.810, MAE = 1.513 mm/day, and RMSE = 1.991 mm/day showed better performance than the other models. This scenario has a higher MAE than the IBK-6, however, because of the low RMSE and also the higher R-values, it can be described as the best nearest-neighbor model in evaporation estimation at Gonbad-e Kavus station. Also, IBK-6 with R = 0.809, MAE = 1.507 mm/day, and RMSE = 1.994 mm/day had an acceptable performance. Besides, at Gorgan Station, IBK6 with RMSE = 1.775 mm/day, MAE = 1.34 mm/day, and R = 0.808, by having the lowest error rate, was selected as the superior IBK model. Furthermore, IBK5 and IBK6 can be used as the second and third ranks, respectively. According to the results obtained in Bandar Torkaman, IBK-7 with RMSE = 1.577 mm/day, MAE = 1.17 mm/day, and R = 0.885 shows the best result among the IBK models. In the next rank, the IBK-4 with RMSE = 1.737 mm/day, MAE = 1.285 mm/day, and R =0.833 presented the acceptable estimations.
According to the results of the Gonbad-e Kavus station on the RF method, the RF7 with the lowest RMSE = 1.614 mm/day, the lowest MAE = 1.999 mm/day, and the highest R = 0.886 showed the best results among the RF models. In the next rank, the RF6 with RMSE = 1.621 mm/day, MAE = 1.225 mm/day, and R = 0.879 presented relatively precise estimations. Based on the results obtained at Gorgan, the RF7 with the least error (R = 0.885, MAE = 1.011 mm/day, and RMSE = 1.337 mm/day) had the best performance as compared to the other RF models. Also, in Bandar Torkaman station, RF7 with RMSE = 1.316 mm/day, MAE = 0.980 mm/day, and R = 0.903 introduced as the superior RF model due to the lowest error rate; however, as mentioned before, the RF7 model requires many meteorological data to develop an accurate relationship between PE and meteorological data. Similar to the GPR method, the RF6 with RMSE = 1.349 mm/day, MAE = 1.007 mm/day, and R = 0.90 presented the best results after RF7 model. Although the RF6 model has slightly higher error than RF7, it described as the optimum RF model due to the use of the low meteorological data.
According to the computations, among the kernel functions in all SVR models, the Pearson function provided the best results. So, for Gonbad-e Kavus station, SVR6 with the least error (R = 0.895, MAE = 1.129 mm/day, and RMSE = 1.55 mm/day) has the best performance in comparison with other SVR models. After that, SVR7 has the least error, but it is not recommended due to the use of more meteorological parameters than the other models. On the other hand, SVR3, with the use of two parameters of T and S, is more precise than the other SVR models (with R = 0.892, MAE = 1.154 mm/day, and RMSE = 1.574 mm/day). Moreover, at Gorgan Station, SVR7 with RMSE = 1.262 mm/day, MAE = 0.958 mm/day and R = 0.898 has the lowest error rate and considered as the superior SVR model; however, as previously mentioned, the SVR7 requires more meteorological parameters to make accurate estimations of PE. Based on the results obtained in the Bandar Torkaman station, SVR6 with the lowest RMSE = 1.275 mm/day and MAE = 0.943 mm/day and the highest R = 0.911 yield the best result among the SVR models. On the other hand, the SVR2 using only two parameters of T and W can be used with acceptable reliability and error in case of data deficiency. These uncertainties in the obtained results were maybe due to data division, input variability and model parameter optimization. Figure 2 compares the variations of the predicted evaporation for the superior models (GPR, IBK, RF, and SVR) with the observed evaporation at one year of the testing period. Also, the distribution patterns of the methods mentioned above shown in Figure 3.     It can be comprehended from Figure 2 that the estimations of GPR6 at Gonbad-e Kavus station and GPR7 at both Gorgan and Bandar Torkman stations are in better agreement with observed PE. Similarly, it indicates from Figure 3 that the estimates of GPR6 and GPR7 are less scattered through the bisection line, and they preferred in correspondent stations. Moreover, an individual assessment of observed and estimated PE values by the best GPR, IBK, RF, and SVR models accomplished for each studied stations ( Figure 4). Taylor diagrams, presented in Figure 4, are practical tools for better understanding the different potential of studied models. In the Taylor diagram, the most accurate model is explained by the point with the lower RMSE and higher R values. So, the Taylor diagrams proved that GPR6 in Gonbad-e Kavus and GPR7 in both Gorgan and Bandar Torkman stations indicated the best performances and presented the most accurate predictions of PE.  With a general look at the results and considering the above interpretations, it concludes that the meteorological parameters of T, W, and S have the most crucial role in increasing the accuracy of the PE estimation. Finally, for Gonbad-e Kavus station, GPR6 with input parameters of T, Wand S, and Gorgan and Bandar Torkaman stations, GPR7 with input parameters of T, RH, W, and S have the best performance, and they considered as the most accurate models for estimating PE. It should be noted that the obtained results are based on the meteorological parameters of the studied stations and at a certain period, and the results changed in different climate zones. Comparing the obtained results with findings of Ghorbani et al. [22] showed that the accuracy of the GPR7 at Gorgan station with RMSE of 1.244 mm/day, as the most precise model, was better than developed MLP-FFA model by Ghorbani et al. [22] at Manjil station and lower than the mentioned MLP-FFA model at Talesh station, located at north of Iran.
As a concluding remark, the obtained results indicated that the accuracy of machine learning methods were satisfactory while using meteorological parameters of T, W and S. Furthermore, although the accuracies of different machine learning methods vary in three studied stations, but the performance of GPR and SVR were better than other examined models. Moreover, it should be noted that the accuracy and performance of these machine learning methods are not constant in different climates and regions. So, for increasing the applicability of machine learning methods in PE With a general look at the results and considering the above interpretations, it concludes that the meteorological parameters of T, W, and S have the most crucial role in increasing the accuracy of the PE estimation. Finally, for Gonbad-e Kavus station, GPR6 with input parameters of T, Wand S, and Gorgan and Bandar Torkaman stations, GPR7 with input parameters of T, RH, W, and S have the best performance, and they considered as the most accurate models for estimating PE. It should be noted that the obtained results are based on the meteorological parameters of the studied stations and at a certain period, and the results changed in different climate zones. Comparing the obtained results with findings of Ghorbani et al. [22] showed that the accuracy of the GPR7 at Gorgan station with RMSE of 1.244 mm/day, as the most precise model, was better than developed MLP-FFA model by Ghorbani et al. [22] at Manjil station and lower than the mentioned MLP-FFA model at Talesh station, located at north of Iran.
As a concluding remark, the obtained results indicated that the accuracy of machine learning methods were satisfactory while using meteorological parameters of T, W and S. Furthermore, although the accuracies of different machine learning methods vary in three studied stations, but the performance of GPR and SVR were better than other examined models. Moreover, it should be noted that the accuracy and performance of these machine learning methods are not constant in different climates and regions. So, for increasing the applicability of machine learning methods in PE estimation, developing a general model for a homogeneous climatic region is recommended for future researches.

Conclusions
Evaporation is of particular importance in agriculture, hydrology, water and soil conservation studies. In this study, GPR, IBK, RF, and SVR machine learning methods were used to simulate daily PE in three stations of Gonbad-e Kavus, Gorgan, and Bandar Torkaman, located in Golestan province (Iran). The results of this study showed that in Gonbad-e Kavus station, the GPR6 and in the Gorgan and Bandar Torkaman stations, GPR7 have the lowest estimation errors and showed higher accuracy than other studied models. In other words, GPR can estimate PE with high accuracy using the meteorological parameters of (T, W, and S) in Gonbad-e Kavus station and the meteorological parameters of T, RH, W and S in Gorgan and Bander Turkmen stations. As a conclusive, overall result proved the superiority of the GPR method in PE estimation. The GPR recommended for PE estimation with a high degree of reliability.
The outcome indicates that the optimum state of Gonbad-e Kavus, Gorgan and Bandar Torkman stations, GPR with the error values of 1.521, 1.244, and 1.254, the KNN with error values of 1.991, 1.775, and 1.577, RF with error values of 1.614, 1.337, and 1.316, and SVR with error values of 1.55, 1.262, and 1.275, respectively, have more appropriate performances in estimating PE. It found that GPR for Gonbad-e Kavus Station with input parameters of T, W and S and GPR for Gorgan and Bandar Torkmen stations with input parameters of T, RH, W, and S had the most accurate performances and proposed for precise estimation of PE. So, the findings of the current study indicated that the PE values might be estimated using few easily measured meteorological parameters and with suitable accuracy in similar climates. Thus, the obtained conclusive remarks should be verified in different climates for evaluating the accuracy of the considered models.