1. Introduction
Evapotranspiration is essential information in agriculture. The agriculture sector is found to be a major water consumer in most countries. The proportion of water withdrawn for agriculture in developing counties is estimated at nearly 81%, while it accounts for 71% of water withdrawal globally. Information on evapotranspiration is important in order to estimate crop water requirements and irrigation water requirements and control several hydrological processes [
1,
2,
3]. Evapotranspiration (ET) is an agrometeorological parameter that can be measured using a lysimeter or water balance approach. These methods for measuring ET are not always possible to use. The lysimeter and water balance approaches are time-consuming methods and need precisely and carefully planned experiments [
4]. Therefore, the use of evapotranspiration estimation methods is very important, and, for that, an adequate meteorological database is necessary to achieve good estimates [
5,
6].
The concept of evapotranspiration is related to the transfer rate of water from the soil–plant system to the atmosphere. In this study, we focus on the use of reference evapotranspiration (ET
0), which is related to the rate of water consumption from a reference crop surface (grass or alfafa). ET
0 can be used for a large area, e.g., for climatic classification of a region [
7,
8], or for small areas, e.g., for obtaining crop water requirements or crop evapotranspiration (ETc) [
9,
10,
11]. The standard model used today for reference evapotranspiration estimation is the Penman–Monteith evapotranspiration model. This model is considered more realistic physically, but it requires some additional meteorological variables when compared with other methods [
8]. This dependence on several meteorological variables combined with the limitations of weather station networks and interruptions and errors in weather databases makes it difficult to measure ET
0. Thus, some models are used to estimate ET
0. These models seek less dependence on many weather inputs and high predictive power.
Among the models used in the literature, this study focused on the following models: artificial neural network (ANN), random forest (RF), support vector machine (SVM), and multiple linear regression (MLR) models. These models show different levels of predictive capacity for different meteorological variables and in other fields of science [
11,
12,
13,
14]. ANN, RF, and SVM models can capture complex relationships between input and output data, which makes them powerful models for modeling. These machine-learning models have been successfully used to estimate ET
0 with fewer input meteorological data [
12,
15,
16]. Although the inability of MLR to handle non-linear relationships between dependent and independent variables is evident in some studies, MLR has been successfully used to estimate ET
0 [
13,
17].
Considering the models, ANN is a promising and effective tool for non-linear modeling and complex time series. An ANN’s architecture is composed of three layers—input, hidden, and output layers—and each layer includes an array of processing elements [
6,
12,
16]. Several papers have shown the excellent predictive capacity of ANN models with different architectures in studies with ET
0 [
14,
15,
18]. The RF model is a non-parametric statistical data modeling method that is decision-tree-based. RF is a classification and regression technique that has also been adopted to predict agrometeorological parameters such as ET
0 [
15,
19,
20]. RF has been found to be a more efficient predicting tool compared with other tools like ANN [
11,
21]. SVM is a supervised machine-learning algorithm developed by [
22]. SVM is used for regression, classification, pattern recognition, and forecasting. This model has been used in meteorological variable estimation and shown high predictive power [
23,
24]. MLR aims at explaining the collinearity between a dependent variable and an independent variable by means of a linear combination of independent predictor variables (more than one). This regression technique has been adopted in several fields of science, including climatology, hydrology, and irrigation, with varying performance [
17].
There is so much literature on evapotranspiration that in this context it is practically impossible to propose even a partial review. Some remarkable recent contributions are due to [
25,
26,
27,
28,
29,
30,
31,
32]. This paper focuses on ET
0 estimation in the Minas Gerais state, Brazil, using different models. Agriculture has an important role in this region and ET
0 estimation on a monthly scale is extremely important for the agricultural chain. Among its main applications are the following: (i) climatic classification of a region—fundamental in the zoning of climatic risk in agricultural regions; (ii) hydrological processes—knowledge of evapotranspiration is fundamental in the hydrological cycle and, consequently, all studies related to hydrology and water resources; (iii) crop water requirements or crop evapotranspiration (ETc)—essential information in planning and implementing irrigation projects (i.e., determining the water demand of a given crop during the months of the year); and (iv) agrometeorological modeling—several models use ET data as an input variable for estimating productivity and other important variables; among other applications. This study also presents a relevant and innovative contribution through evaluation of the evapotranspiration estimates considering different climatic scenarios for the same state; that is, for regions which cover an extremely large area (such as the Minas Gerais state), there may be a trade-off between generalization capacity and the performance of developed models. Therefore, data partition in the spatial sense aims to achieve the highest efficiency for the evaluated models, thus becomes relevant for the study of different climatic scenarios.
Considering that the presence of gaps or discontinuities in the meteorological data series can delay the state of development, this study proposes to analyze the use of different combinations of input data and climate scenarios for the accurate estimation of ET0, and, especially, with the minimum possible use of input data in these models, this can facilitate the estimation of ET0. The hypothesis of this study is that models based on machine learning are an efficient tool for estimating evapotranspiration, even under conditions of limited climatic data.
ET0 calculated by the FAO Penman–Monteith method requires several input data. This amount of input data makes it difficult to use this method. New technologies can make it easier to obtain ET0 reliably. In this context, the aim of this study was to develop, evaluate, and compare the performance of ANN, RF, SVM, and MLR models in estimating ET0 with four different combinations of input data in three climate scenarios.
3. Results and Discussion
The results presented in this study are essential for more adequate water management, since accurate estimation of ET0 is fundamental for water demand quantification. Moreover, the use of different estimation techniques and combinations of input data in the models allowed us to obtain important results at different spatial scales. It must be noted that while daily ET0 values are useful for conducting irrigation, monthly ET0 provides an overview of how much water is required to maintain plant health over a longer period, such as a month or growth cycle. Monthly ET0 is particularly valuable in irrigation planning, as it helps water managers, designers, development planners, and farmers estimate the total water requirements for a successful harvest and make accurate decisions.
According to the results, it was possible to observe linear correlations between the input data and ET
0, with the variables Tmean, Tmax, and Tmin showing the best correlation (
Figure 3). The other variables have a low (lat, alt and RH) or no (lon and month) correlation with ET
0. Behavior inversely proportional to ET
0 was observed for the lat, alt, and RH variables. Higher latitudes tend to be cooler regions, with less energy available for the ET
0 process. An increase in altitude also results in a decrease in temperature according to the vertical thermal gradient in the troposphere. An increase in RH increases the potential gradient, increasing the water transfer rate from the soil–plant system to the atmosphere. However, proportional behavior was observed between the Tmean, Tmax, and Tmin variables and ET
0. An increase in Tmean, Tmax, or Tmin results in more energy being available for ET
0. The authors of [
14] observed the same behavior in the variables Tmean, Tmax, Tmin, and RH when estimating ET
0. The variables Tmean, Tmax, and Tmin were all highly correlated with ET
0, and the RH mean was the least correlated variable.
In this way, the capability of machine-learning approaches using the variables mentioned above was investigated in different conditions and scenarios. The ANN, RF, SVM, and MLR statistical performance indicators for estimating ET
0 in any location within the Minas Gerais state (SI: data from the 56 climatological stations—100% of the input data available) are presented in
Table 3.
All the models developed with the I8 and I6 input combinations exhibited better performance than versions developed with I3 and I2. The lowest predictive capacity was observed when the RF model was used with the I8 input combination. The greatest predictive capacity, in SI, was observed when the RF and ANN models were used with the I6 and I8 input combinations, respectively. The SVM and MLR models exhibited better performance than ANN and RF when only Tmean and RHmean (I2) were used as input data.
When comparing combination I8 with I6, the average r, MAE, and RMSE values for all models do not show high variation. Removal of the geographic coordinates (I6 to I3) resulted in greater performance reduction for the SVM and MLR models. The greatest impact on performance was observed for ANN and RF when the month variable was removed (I3 to I2). Average r decreased by 8%; MAE and RMSE increased by 52.2% and 43.9%, respectively. The removal of month did not impact the SVM and MLR models’ performance.
In the case of SII, a scenario in which the state of Minas Gerais was divided into two areas (Tho1 and Tho2), the statistical performance indicators of the models used in ET
0 estimation are shown in
Table 4.
Tho1 and Tho2 had 48.2% and 51.8%, respectively, of the data available as input data. The highest predictive capacities in the Tho1 and Tho2 areas were observed when the ANN model was used with the I8 input combination and RF model was used with the I6 input combination, respectively. The removal of Tmax and Tmin input data (I6) did not increase the models’ predictive capacities in the Tho1 area, except for the RF model. This behavior is similar to that observed in SI. However, all models performed better in the Tho2 area when the I6 input combination was used (better results).
Removal of the month variable (I3 to I2) resulted in the greatest impact on the ANN and RF models’ quality. When comparing combination I8 with I3, the average r values of the ANN and RF models decreased by 7.2% and 5.7%, respectively. The MAE values of the ANN and RF models increased by 36.4% and 31.6%, respectively. However, no expressive variation was observed in the performance of the SVM and MLR models.
In the case of SIII, the statistical performance indicators of the models for this scenario are presented in
Table 5, where the Minas Gerais state was divided in areas K1 and K2, which were characterized by 62.5% and 37.5% of the climatological stations, respectively.
In general, the ANN and RF models were better than the SVM and RLM models with the input combinations I8, I6, and I3. When the I2 combination was used, the SVM and RLM models were superior. The model with highest predictive capacity in the K1 area was ANN with the I8 input combination. The RF model with the I6 input combination showed the highest predictive capacity in the K2 area.
In the K1 area, removal of the month variable resulted in the greatest impact on the ANN and RF models’ performance. Removal of the alt, lat, and lon variables resulted in the highest impact on the SVM and MLR models’ performance. In the K2 area, the behavior of RF, SVM, and MLR was similar to that observed in the K1 area. However, withdrawal of the alt, lat, and lon variables resulted in the highest impact on ANN in the K2 area.
The ANN and RF models showed greater predictive capacity in all scenarios when compared with the SVM and MLR models. This high capacity is achieved with the data input combinations I8 and I6. Both models had similar performance, but, on average, RF showed slight superiority. In [
12,
15], the authors evaluated the performance of different machine-learning models in ET
0 estimation in Brazil. In these studies, it was observed that, in general, ANN performed slightly better than the other traditional machine-learning models (i.e., RF and extreme gradient boosting—XGBoost). However, in some studies, the RF model performed slightly better than other models (i.e., generalized regression neural networks—GRNN) in estimating ET
0 [
20,
45]. There are papers suggesting better performance than other machine-learning models in different situations and regions [
24,
46]. Therefore, there is a need for studies that address more than one model.
The SVM and MLR models showed similar statistical indices and responses in all scenarios. These results can be explained by the use of the linear kernel function in SVM, which probably presented behavior similar to an MLR. Tests with the nonlinear kernel function did not result in improvements in prediction. Possibly, the data used does not present complexity that justifies the use of SVM.
The SVM and MLR models showed greater predictive capacity in all scenarios when the input data were limited to only Tmean and RH (I2). This result may indicate a low predictive capacity of the ANN and RF models in situations of low variability in the input data. This low variability may hinder the search for patterns that justify variations in ET0.
In some scenarios, the removal of Tmax and Tmin improved the ET
0 estimation results. According to [
14], the authors observed an increase in the accuracy of the support vector regression (SVR) and Gaussian process regression (GPR) models with the removal of some input data, including Tmax and Tmin.
Although Tmax and Tmin showed a good correlation with ET
0 (
Figure 3), the weight of Tmax and Tmin is diluted in the calculation of Tmean used in the calculation of ET
0. Thus, adding Tmax and Tmin can make ET
0 estimation more complex or confusing. This fact can decrease the accuracy of the models, and the removal of this input data can improve the prediction. Determining the input data is critical to the success of the models. This selection can facilitate the training and testing process, improving the understanding of the system [
47,
48]. However, this result shows that linear regression alone is not sufficient to decide which input data should be removed in order to increase predictive performance.
When the independent variables lat, lon, and alt were removed (I3), a reduction in the statistical indexes of all models was observed. These variables are related to the spatial location of the observed data. Although the correlation observed between these variables and ET
0 is low (
Figure 4), the joint removal of these data negatively impacted the model’s performance. The air temperature and solar radiation variables are among the main data impacting ET
0 [
1,
46]. Several studies have indicated the influence of lat, lon, and alt variables on air temperature and solar radiation [
49,
50]. Therefore, variations in lat, lon, and alt may indirectly impact ET
0. This can explain these observed results.
The division of the input data into two areas with climatic similarity aimed to increase the performance of the models. The division presented in SII and SIII managed to slightly increase the capacities of the models in relation to SI. However, this increase was only observed in the Tho1 and K1 areas. Thus, we can infer that, although the division into areas with climatic similarity can reduce the amount of input data for training, in some situations this division is valid, and the models can respond more accurately. Machine-learning models developed for broader scenarios (e.g., SI) typically have reduced predictive capacity due to the high nonlinearity and low similarity of their input data; however, these models have greater ability to generalize [
24]. According to [
12], although the models developed locally perform better, these models may have low predictive capacity when used in other regions, since they may be highly specific to the location.
Regarding the importance of each input variable to the response variable of the evaluated algorithms, WEKA was used to select the attributes (
Figure 4,
Figure 5 and
Figure 6). Attributes were selected using the “ClassifierAttributeEval” tool associated with the “Ranker” method. These tools rank attributes by their individual evaluations. Correlation coefficient was the measure used to evaluate the performance of attribute combinations in the Ranker configuration. The same ranking method in WEKA was used by [
51] in order to verify the importance of each input variable in solar radiation prediction.
Different ANN settings were used for different input data (
Table 2). These ANN settings resulted in different weights for each input attribute (
Figure 4). However, similar behavior was observed in the different configurations. In all scenarios, Tmean, Tmax, and Tmin had greater weight in producing the estimate. In SIII K2, the relative importance of Tmax surpassed Tmed (
Figure 4). This result may explain the decrease in ANN’s performance in this scenario when Tmax and Tmin were removed (
Table 4). The variables lat and month had a similar weight in all scenarios. Although similar, the removal of the month variable resulted in a greater reduction in ANN’s performance when compared with the removal of the variables lat, lon, and alt.
The ranked values of each input variable in RF are shown in
Figure 5. The Tmean and month variables had a higher weight in the ET
0 estimate. In SII Tho1 and SIII K2, the month variable was more important than the Tmean variable. This result may explain the drop in the RF model’s performance when it removed the month variable (I
3 to I
2). The Tmax and Tmin variables also had a high weight in the ET
0 estimate. However, the removal of these variables increased the capacity of the RF model as observed (
Table 3,
Table 4 and
Table 5) and discussed previously.
The relative importance of each input variable in SVM is shown in
Figure 6. It was possible to observe that the Tmean, Tmax, and Tmin variables had a higher weight in the ET
0 estimate, followed by HR and lat. The month variable was of low importance in the ET
0 estimate. In SI, the month showed a negative weight. Therefore, this input data can negatively impact the ET
0 estimate. In the performance results for the SVM model (
Table 3,
Table 4 and
Table 5), there was no significant variation in performance when the month variable was removed. Both results make it possible to highlight that, for this region, the month variable does not contribute to the performance of the SVM model.
Although each model has a different pattern in the ranking of the input variables (
Figure 4,
Figure 5 and
Figure 6), air temperature was the most important attribute. The observed correlation between air temperature and ET
0 (
Figure 4) may explain the importance of air temperature in this estimate. This behavior was not observed in SIII K2 or SII Tho2. However, in these scenarios, no significant difference was observed between the month and Tmean variables. Studying the ranking of the importance of meteorological variables based on the RF method, the three most important variables were insolation (n), Tmax, and RH [
20]. The high relative importance observed corroborates the results of the present study.
The other variables presented different weights according to each model applied. These results indicate a peculiarity of the models. Hence, new research and applications can be based on these results, choosing the best method to suit the conditions of the input data. However, it is recommended that the models be previously experimented with using different input data; as noted, some variables may have a relatively high weight in the ET0 estimate, but their use can decrease the predictive performance of the model. This behavior was observed when using the RF model. In this model, removal of the variables Tmax and Tmin increased predictive capacity, although these variables have shown high relative importance.
It is important to note that the month variable was highly important in estimation with RF. However, low importance was observed when the SVM model was used, since this variable was not correlated with ET
0 (
Figure 3). These results highlight the need for more techniques to select the meteorological variables used in modeling. Linear regression alone is not sufficient to identify the relevance of the input data. Furthermore, different models may present different behaviors regarding classification of the importance of the input variable and still present satisfactory results.
Differently from the evaluation of the importance of the ANN, RF, and SVM attributes, for the MLR method, the attribute selection method was applied (the M5 method), which indicates the importance of each input attribute in the generated model. The adjusted coefficients are shown in
Table 6. It was observed that, in some models, the method used (the M5 method) excluded the month variable. This behavior indicates a low importance of this variable in the MLR estimate. This result was similar to that observed in the analysis of the importance of the input variables in SVM. The exclusion of lat and Tmax was also observed in some cases.
The results presented in this study reveal that, for locations in Minas Gerais state, these models can be used safely. The ANN and RF models are recommended to estimate ET0 when considering a wider range of input data, as they have better predictive capacity in this situation. The SVM and MLR models are recommended in situations where only temperature and relative humidity data are available. However, between these two models, MLR is recommended because it requires less computational effort. These models, although they have a high predictive capacity, cannot be perfect. Other meteorological variables not considered as input data (e.g., solar radiation, wind speed, and vapour-pressure deficit) and other factors (e.g., data recorded in error) contributed to a decrease in the predictive capacity of these models.
No statistical method or machine-learning method can produce results that are the same as the observed and/or recorded data. There will always be some error, no matter how small. Therefore, it is important that the meteorological stations function continuously. As in all studies, some limitations were noted in this study. One of the main limitations is related to difficulties in the availability of quality meteorological data. The malfunction and limited collection of meteorological data has been a limitation in several countries. Another limitation that can be observed is the difficulty and complexity of using some models. In this context, it is recommended to evaluate and use models with good results and that present greater simplicity in their use.
The models developed in this study are expected to help decision-making by different professionals, mainly farmers. Agricultural companies are responsible for a considerable part of the Brazilian gross domestic product [
52], and the Minas Gerais state had the third-largest gross domestic product in Brazil in 2018 [
33]. The results of these models can assist in irrigation management, climatic zoning, and the construction of productivity models, among other applications. In addition, the approaches used in the present study have the potential to benefit the development of other types of models and studies from other regions.