AgroML: An Open-Source Repository to Forecast Reference Evapotranspiration in Different Geo-Climatic Conditions Using Machine Learning and Transformer-Based Models

Accurately forecasting reference evapotranspiration (ET0) values is crucial to improve crop irrigation scheduling, allowing anticipated planning decisions and optimized water resource management and agricultural production. In this work, a recent state-of-the-art architecture has been adapted and deployed for multivariate input time series forecasting (transformers) using past values of ET0 and temperature-based parameters (28 input configurations) to forecast daily ET0 up to a week (1 to 7 days). Additionally, it has been compared to standard machine learning models such as multilayer perceptron (MLP), random forest (RF), support vector machine (SVM), extreme learning machine (ELM), convolutional neural network (CNN), long short-term memory (LSTM), and two baselines (historical monthly mean value and a moving average of the previous seven days) in five locations with different geo-climatic characteristics in the Andalusian region, Southern Spain. In general, machine learning models significantly outperformed the baselines. Furthermore, the accuracy dramatically dropped when forecasting ET0 for any horizon longer than three days. SVM, ELM, and RF using configurations I, III, IV, and IX outperformed, on average, the rest of the configurations in most cases. The best NSE values ranged from 0.934 in Córdoba to 0.869 in Tabernas, using SVM. The best RMSE, on average, ranged from 0.704 mm/day for Málaga to 0.883 mm/day for Conil using RF. In terms of MBE, most models and cases performed very accurately, with a total average performance of 0.011 mm/day. We found a relationship in performance regarding the aridity index and the distance to the sea. The higher the aridity index at inland locations, the better results were obtained in forecasts. On the other hand, for coastal sites, the higher the aridity index, the higher the error. Due to the good performance and the availability as an open-source repository of these models, they can be used to accurately forecast ET0 in different geo-climatic conditions, helping to increase efficiency in tasks of great agronomic importance, especially in areas with low rainfall or where water resources are limiting for the development of crops.


Introduction
The worldwide population is increasing to alarming values that will require almost 50% more food to meet the demand in 2050 [1]. Therefore, research into new methodologies to outperform agroclimatic forecasts (solar radiation, precipitation, or evapotranspiration) is a relevant task that allows the optimization of water resource management, the improvement of irrigation scheduling, and, indeed, contributes to the great challenge of increasing food production. Furthermore, it is significantly impactful in arid and semiarid areas such as the Andalusian region (Southern Spain), where crop water uses are elevated and the scarce precipitation is limiting growth and agricultural yield.
Crop evapotranspiration measures the crops' water demand, being affected by atmospheric parameters (such as temperature, wind speed, or solar radiation), specific crop type, soil characteristics, as well as management and environmental conditions. The evapotranspiration rate from a reference surface with no shortage of water is named reference evapotranspiration (ET0), which studies the evaporative demand of the atmosphere independently of the surface, the crop type, its development stage, and the management practices. Its calculation can be accurately determined using physics-based methods such as the FAO56-PM [2], which has been assessed globally in different climatic conditions and countries, including Korea [3], Argentina [4], and Tunisia [5], among others. However, measuring all the required parameters (air temperature, relative humidity, wind speed, and solar radiation) is very costly in installation and maintenance. Moreover, Automated Weather Stations (AWS) usually contain non-reliable long-term datasets, mainly for wind speed and solar radiation, due to a lack of maintenance or miscalibration [6]. These are the reasons why the geographical density of complete AWS is generally low, especially in rural areas and developing countries [7,8].
Therefore, developing new algorithms with fewer climatic input parameters is of high interest. In this context, Hargreaves and Samani [9] introduced an empirical equation (HS) that uses maximum and minimum daily air temperature values (Tx and Tn, respectively) and extraterrestrial solar radiation (Ra). Different studies have assessed HS in different aridity conditions and countries, such as Iran [10], Italy [11], Bolivia [11], China [12], and others. Nevertheless, advances in computation during the last several decades led to the application of new methodologies based on Artificial Intelligence (AI) with a very intensive computational cost. Thanks to the progress in CPU and GPU computation, the time spent training these models has dropped significantly, allowing scientists to apply them without needing a vast CPU/GPU farm and obtaining promising results in all sectors, especially agriculture. For example, Karimi et al. [13] evaluated the performance of random forest (RF) and other empirical methods to estimate ET0 when several meteorological data were missing. RF surpassed the other models for temperature-based data availability when using Tx, Tn, Ra, and relative humidity (RH) as input features. Ferreira and da Cunha [14] assessed RF, extreme gradient boosting (XGB), multilayer perceptron (MLP), and convolutional neural network (CNN) to estimate daily ET0 through different approaches, using hourly temperature and relative humidity as features in different AWS in Brazil. CNN outperformed the rest of the models for most statistics and locations in both local and regional approaches. However, no optimization algorithm was used during hyperparameter tuning. Yan et al. [15] evaluated XGB to estimate daily ET0 in two different regions (an arid and humid region) from China and seven meteorological input combinations using maximum and minimum daily temperature (Tx and Tn, respectively), extraterrestrial solar radiation (Ra), relative humidity (RH), wind speed (U2), and sunshine hours (n). In order to tune the different hyperparameters, the Whale Optimization Algorithm (WOA) was used. Their results showed that using local and external (neighbor stations) datasets obtained even better performance than using only local data in some cases. Therefore, this strategy is very promising when there is a lack of long-term records. Wu et al. [16] studied the performance of extreme learning machines (ELM) in different locations from China. They analyzed the use of the K-means clustering algorithm and the Firefly Algorithm (FFA) to estimate monthly mean daily ET0 using Tx, Tn, Ra, and Tm (mean daily temperature). Nourani et al. [17] assessed support vector regression (SVR), Adaptive Fuzzy Inference System (ANFIS), MLP, and multiple linear regression (MLR) to forecast monthly ET0 in Turkey, North Cyprus, and Iraq. Moreover, three ensemble methods were applied (simple averaging, weighted averaging, and neural ensemble) to outperform the performance and reliability of single modeling. The use of neural ensemble models highly outperformed single modeling in all cases, although simple and weighted averaging did not significantly perform better. Ferreira and da Cunha [18] evaluated the performance of daily ET0 forecasts (up to 7 days) using CNN, long short-term memory (LSTM), CNN-LSTM, RF, and MLP using hourly data from different weather stations with heterogeneous aridity index characteristics in Brazil. In all cases, the use of the machine learning (ML) models outperformed the baselines, where CNN-LSTM performed the best in both local and regional scenarios using Tx, Tn, maximum and minimum relative humidity (RHx and RHn, respectively), wind speed, solar radiation (Rs), Ra, the day of the year (DOY), and ET0 values from a lag window in the past (up to 30 days). In order to tune the different hyperparameters, a random search algorithm with 30 epochs was used.
In addition to these well-known and standard ML models, new architectures have been recently developed to deal with natural language programming (NLP) problems with outstanding results, called transformers [19]. The transformer model is an encoderdecoder architecture based on a self-attention mechanism that looks at an input sequence and decides which timesteps are valuable. The promising results of transformers have fostered their use on time series problems due to its apparent relationship. In both types of problems, words/parameter values are more or less meaningful based on their position. Therefore, several scientists have evaluated attention-based architectures in forecasting problems. For example, Wu et al. [20] proposed an Adversarial Sparse Transformer (AST) based on generative adversarial networks (GAN). They assessed it to forecast five different public datasets: (I) an hourly time series electricity consumption dataset, (II) an hourly traffic level from the San Francisco dataset, (III) an hourly solar power production dataset, and (IV) an hourly time series dataset from the M4 competition. Furthermore, [21] analyzed a transformed-based architecture to forecast influenza-like illness (ILI), obtaining promising results. Finally, Li et al. [22] evaluated the performance of transformers in time series forecasting using the same public datasets as Wu et al. [20] and obtained more accurate modeling with long-term dependencies.
This work is motivated by the need to minimize error in daily ET0 forecasts, which is one of the main drawbacks in the reviewed literature, as well as the outstanding and promising performance of transformers and transformer-based models in different fields. Thereby, this work is the first one using a multivariate input transformer-based architecture in order to forecast daily ET0 (from one to seven days ahead). The development and assessment have been carried out using past values of ET0 and temperature-based measured variables as features in five sites of Andalusia (Córdoba, Málaga, Conil, Tabernas, and Aroche) with different geo-climatic characteristics. Moreover, standard ML models such as RF, MLP, SVR, ELM, CNN, and LSTM have been also evaluated in conjunction with Bayesian optimization to tune all their different hyperparameters. Thus, the main objectives of this work are (a) to assess the performance of the proposed transformer model to forecast ET0 and to compare it to standard ML models and two simple baselines (historical monthly mean value and mean of previous seven days); (b) to evaluate different input feature configurations based on ET0 past values and several temperature-based features to forecast ET0, and (c) to analyze the forecast efficiency depending on the different geo-climatic characteristics of the sites.

Study Area and Dataset
Andalusia is located in the southwest of Europe, ranging from 37° to 39° N, from 1° to 7° W, and occupying an extension of 87,268 km 2 . This work was carried out with data from five locations in Andalusia (Figure 1), with different geo-climatic characteristics and representing great variability in terms of UNEP aridity index [23] in this region (ranging from 0.555-dry subhumid-in Aroche, to 0.177-arid-in Tabernas). The coordinates and other characteristics of the AWS are reported in Table 1. In contrast, in Table 2, the minimum, mean, maximum, and standard deviation values of minimum, mean, and maximum daily air temperature (Tn, Tm, and Tx, respectively), relative humidity (RHn, RHm, RHx, respectively), wind speed (U2), solar radiation (Rs), and reference evapotranspiration (ET0) data are shown. The dataset used in this study belongs to the Agroclimatic Information Network of Andalusia (RIA), which can be downloaded at https://www.juntadeandalucia.es/agriculturaypesca/ifapa/ria/servlet/FrontController (Accessed on 1 February 2022).  In this work, because the accurate estimation of ET0 using limited meteorological data has been improved in recent years [14,24] and due to the high availability of temperature records, only temperature-based and ET0 values from the past have been used as input features to forecast ET0. Specifically, two different windows have been evaluated, the use of 15 and 30 days from the past. Moreover, several temperature-based variables have been calculated, such as EnergyT (the area below the intraday temperature in a whole day), HourminTx (the time when Tx occurs), HourminTn (the time when Tn occurs), Hourmin-Sunset (the time when sunset occurs), HourminSunrise (the time when sunrise occurs), es (mean saturation vapor pressure), ea (actual vapor pressure) and VPD (vapor pressure deficit), Tx-Tn, HourminSunset-HourminTx, and HourminSunrise-HourminTn. All the configurations assessed in this work contained Tx, Tn, Tx-Tn, and Ra as features due to their very high Pearson correlation (Figure 2), and the rest of the configurations were selected based on their Pearson correlation values and the previous results on these same locations regarding ET0 and solar radiation [24][25][26] estimations. The 27 different assessed configurations are shown in Table 3.

Conf. Tx Tn Tx-Tn Ra EnergyT ea es VPD HTx HTn HSs-HTx
In machine learning applications, a vital prerequisite to guarantee accurate modeling is the use of reliable datasets. In this work, the control guidelines reported by Estévez et al. [6] have been followed to identify erroneous and questionable data from sensor measurements by applying different tests (range, internal consistency, step, and persistence) and a spatial consistency test [27]. These quality assurance procedures have been successfully employed in different countries [4,28,29]. Afterward, the input and output matrices had to be built depending on the number of lag days from the past (15 or 30), the features to use (up to 27 input configurations), and the number of days to forecast (up to 7 days). In Figures 3 and 4, a mind map with all the possibilities is shown. It is worth noting that a MIMO (Multiple Input Multiple Output) approach was used in models that allowed it, whereas a direct approach was considered in the others according to the results of Ferreira and da Cunha [18].
Consequently, using configuration 1 and 15 lag days as an example, the values from day to day-14 of Tx, Tn, Tx-Tn, Ra, ea, and ET0 are used as input features (a total of 90 values) for all the ML models (except for transformers-see Section 2.5.7), where Tx and Tn are directly given by AWS, and Ra and ea can be calculated using Tx, Tn, and the latitude, as stated by [2]. Finally, ET0 is calculated using the well-known FAO56-PM method.  Later, in order to train, tune all the hyperparameters, and assess the final performance of the model, for each location, the dataset was split into training (70% of the entire dataset length), validation (20% of the training dataset length), and testing (30% of the entire dataset length) using a holdout technique. Next, the Bayesian optimization algorithm was used to tune all the hyperparameters (the hyperparameter space can be seen in Table S1 from Supplementary Data). Eventually, after the best hyperparameter set was found, the final model was trained using the entire training dataset (70% of the entire dataset length) and evaluated using the testing dataset. Figure 5 shows an overview of this methodology.

Reference Evapotranspiration Calculation
In this work, the ET0 (FAO56-PM) values were used as input and target values. They were determined following the procedure of [2], and can be mathematically expressed as Equation (1): where ET0 is the reference evapotranspiration (mm day −1 ), 0.408 corresponds to a coefficient (MJ −1 m 2 mm), ∆ is the slope of the saturation vapor pressure versus temperature curve (kPa °C −1 ), Rn is the net radiation calculated at the crop surface (MJ m −2 day −1 ), G is the soil heat flux density at the soil surface (MJ m −2 day −1 ), is the psychrometric constant (kPa °C −1 ), T is the mean daily air temperature (°C), U2 is the mean daily wind speed at 2 m height (m s −1 ), and es and ea are the saturation vapor pressure and the mean actual vapor pressure, respectively (kPa).

Baselines
In order to compare the performance of the developed models and configurations, it is crucial to have a baseline performance as a starting point. In this sense, two empirical baselines have been proposed in this work, following the methodology proposed by Ferreira and da Cunha [18]. In the first place, a moving average from the last 7 days was used. Secondly, the historical average monthly values from the training dataset were used for the corresponding forecast day.

Multilayer Perceptron
The multilayer perceptron (MLP) is one of the most used agronomical and hydrological AI models [14,30,31]. Its popularity is based on its similarities to neurons in the biological nervous system, easy coding, and promising results in most cases. They are structured in three kinds of layers, the input and output layer, representing the inputs and outputs of the model, respectively, and the hidden layers, where all the neurons are located. The neurons work together to create stimuli (reference evapotranspiration forecast values) based on different inputs (the input matrix containing features from the past). A back-propagation algorithm makes the neurons learn (automatically update all weights and biases) and improve every mini batch every epoch. A single neuron architecture can be seen in Figure 6.   Figure 6. Single neuron architecture. I1, I2, I3, and I4 represent the inputs of the neuron; W1, W2, W3, and W4 correspond to the weights of every input; B is the bias, and O represents the output of the neuron after passing through an activation function.

Extreme Learning Machine
Extreme learning machine models (ELM) were first introduced by Huang et al. [32] as a single hidden layer feed-forward neural network with the following main characteristics: (I) the input weights and biases are randomly generated and (II) the output weights and biases are analytically determined. As a result, these models do not require any training process and have a meager computational cost, with promising results in ET0 [24,33,34]. However, on the other hand, when the model is working with massive datasets, the amount of random access memory (RAM) required is enormous.

Support Vector Machine for Regression
Support vector machine (SVM) models for regression tasks, also known as support vector regression (SVR) models, are supervised AI models based on a different functionality than neuron-based architectures such as MLP and ELM. They search for the best hyperplane (and its margins) that contains all data points. Thus, it could be easily related to linear regression with the flexibility of defining how much error can be considered acceptable. Moreover, one of their most important features is the use of kernels to allow the model to operate on a high-dimensional feature space. SVMs can be mathematically expressed as a minimization problem of Equation (2) with the constraints in Equation (3). ( where wi corresponds to the weight vector, xi to the input vector, yi to the output vector, represents the margins, ξ represents the deviation of values to the margins, C is a coefficient to penalize deviation to the margins, and n is the length of the training dataset. For further details, the work of [35] can be consulted.

Random Forest
A random forest (RF) is composed of the conjunction of multiple tree-based models in order to improve the overall result (ensemble model). The general idea is that different models are trained on different data samples (bootstrap) and feature sets. Instead of searching for the best features when splitting nodes, it searches among a random subset of the features. Thus, it results in greater diversity and better final performance.

Convolutional Neural Network
Convolutional neural network (CNN) models were first developed for image classification problems, where the convolution algorithm captures local patterns to learn a representation of figures to classify them. Moreover, this process can be extrapolated to 1D sequences of data such as time series datasets. One of the advantages of using convolutions is that they can obtain local features' relationships without the requirement of an extensive preprocessing method and can obtain outstanding results in ET0 [14,36,37] and in other agro-climatic parameters [25,38,39].
Typically, such CNNs are composed of three layers: the convolutional layer, the pooling layer, and a fully connected layer. The convolutional layer is used to extract local relationships between the different features and timesteps. The pooling layer is added after the convolutional layer, and it gradually reduces the feature map. Finally, a fully connected layer is used to forecast the seven-day horizon ET0 values (in this work). For further details, the work of Aloysius et al. [40] can be reviewed.

Long Short-Term Memory
Long short-term memory (LSTM) models were first introduced by Hochreiter et al. [41] as a recurrent neural network (RNN)-based model that could deal with long-term dependencies and address the vanishing gradient problem. In order to control the information flow, the LSTM block contains an input gate, an output gate, a forget gate, a cell state, and a hidden state. The gates are in charge of deciding which information is allowed on the cell state, i.e., whether a piece of information is relevant to keep or forget during training. The cell and hidden state can be seen as the memory of the network, used to carry relevant information throughout the sequence.

Transformers
A new state-of-the-art architecture has been recently presented for NLP problems, the transformers [19]; see Figure 7. One of the main motivations of transformers is to deal with the vanishing gradient problem of LSTM when working with long sequences. Although LSTMs can theoretically propagate crucial information over infinitely long sequences, due to the vanishing gradient problem, they pay more attention to recent tokens and eventually forget earlier tokens. In contrast, transformers use an attention mechanism, which learns the relevant subset of the sequences to accomplish the specific task. For a single head, the operation can be expressed as Equation (4), where Q, K, and V represent the query, key, and value, respectively, as an analogy to a database, and dk corresponds to queries and keys' dimension. As stated by Yıldırım and Asgari-Chenaghlu( 2021), the attention mechanism can be defined as follows: "This can also be seen as a database where we use the query and keys in order to find out how much various items are related in terms of numeric evaluation. Multiplication of attention score and the V matrix produces the final result of this type of attention mechanism". In particular, transformers use a multi-head attention mechanism, which can be mathematically expressed as Equation (5).
where Headi is attention (QWi, KWi, VWi) and W denotes all the learnable parameter matrices. Generally, the transformer is an encoder-decoder architecture. Considering a translation task from English to Spanish, the encoder takes an input sequence ('I am from Spain') and maps it into a higher-dimensional space using a multi-headed attention, an adding, a normalization, and a fully connected feed-forward layer. The abstract vector obtained in the encoder module is fed into the decoder, which uses it to obtain the translated sentence ('Soy de España'). It is worth noting that both the encoder and decoder are composed of modules that can be stacked on top of each other multiple times. However, before carrying out any mathematical operation to the input data, it is required to convert words into numbers. The embedding layer is used for this purpose, transforming words into a vector of numbers that can be easily recognized by the model.
Another vital aspect to consider is the need for transformers to learn the temporal dependencies of the different timestamps through positional encoding because they do not inherently carry it out. In this work, the positional encoding was achieved using Equations (6) and (7) for monthly and daily values (Figure 8). In this way, 31 January and 2 February are close, but 5 May and 26 July are not.
where pos represents the position, dmodel is the input dimension, and i represents the index in the vector. It is worth noting that this temporal dependency information is shared with the rest of the models as new features in this work to make the comparison between models as fair as possible. Thus, new features are included in all configurations. For example, in configuration 1, the input features would be Tx, Tn, Tx-Tn, Ra, ea, ET0, Sin_day (sine of days over 31 days period), Cos_day (cosine of days over 31 days period), Sin_month (sine of days over 12 month period) and Cos_month (cosine of days over 12 month period). The architecture used in this work can be seen in Figure 9. It is based on the original transformer architecture from Vaswani et al. [19] and the attention-based architecture of Song et al. [42]. Several aspects were modified. First, since the input data already have numerical values, the embedding layer was omitted. Then, the positional encoding included new features in the input matrix instead of adding their values to the "embedded vector". Consequently, four more features were used in this model (sine and cosine positional encoding for days in a month, and sine and cosine positional encoding for months in a year). Finally, the SoftMax layer was also deleted because we are dealing with a regression problem (forecasting ET0). Thus, the processing of data in the proposed transformer-based model can be described as follows. Firstly, the input matrix passes through a positional encoding mechanism. Then, the positional encoding features are added to the input matrix. Later, the data go to an attention-based block containing multi-head attention, dropout, normalization, addition, and feed-forward layers. Two different variations have been tested depending on the model used in the feed-forward layer: Trans-formerCNN, where a convolutional approach has been used, and TransformerLSTM, where an LSTM approach has been implemented. Eventually, the processed data go to an MLP model to carry out the regression task. The following works provide further details [19,21,43,44] and the code can be checked at the AgroML GitHub repository.  Figure 9. The architecture of the proposed multi attention-based model.

Bayesian Optimization
The most critical aspect to obtain accurate performance in machine learning models is choosing the fittest hyperparameter set. The results could dramatically change from outstanding to very poor. A prevalent practice among the scientific community in agronomy and hydrology is using a trial-and-error approach [14,18,36], evaluating from dozens to hundreds of sets. However, it is not an efficient approach because the process is too slow if the hyperparameter space is large, spending a significant amount of time on nonpromising configurations. Otherwise, if the hyperparameter space is made to be small, one may obtain a suboptimal model. Several optimization algorithms have been assessed to solve this problem-for example, Particle Swarm Optimization (PSO), Grey Wolf Optimizer (GWO), Genetic Algorithms (GA), Bayesian Optimization (BO), and the Whale Optimization Algorithm (WOA), among others [31,[45][46][47].
In this work, the BO algorithm has been proposed due to its high sample efficiency and popularity in automated machine learning libraries such as Auto-Weka 2.0 [48], Auto-Keras [49], and Auto-Sklearn [50] and they can be consulted in Hutter et al. [51]. Part of its popularity is related to the close relationship to human behavior when carrying out this same process [52,53], where prior results are considered to choose the following set. BO is based on Bayes' theorem, and it can be explained using the following four steps: (I) definition of the hyperparameter space; (II) the algorithm first tries several random sets; (III) the algorithm takes into account the previously assessed configuration sets when choosing the following one, balancing between exploitation (it exploits regions that are known to have good performance) and exploration (choosing region with higher uncertainty), and evaluating it; (IV) if the process has not finished yet, it goes to step 3.
In this work, BO has been implemented using Scikit-Optimize (gp_minimize) and Python 3.8. In all cases, this process was configured using 50 Bayesian epochs (80% of them were randomly chosen), selected after a trial-and-error algorithm among 50, 100, 150, and 200 Bayesian epochs, the mean absolute error (MAE) as the objective function, and the rest of parameters as default. The hyperparameter space can be found in Table 1 from the Supplementary Material and their results in Table 2.

Evaluation Metrics
The models' performance has been evaluated by using the following parameters: mean bias error (MBE), root mean square error (RMSE), and the Nash-Sutcliffe model efficiency coefficient (NSE). The MBE, RMSE, and NSE are defined as Equations (8)-(10): where x and y correspond to the observed and forecasted ET0 values, respectively, n represents the number of records in the testing dataset, and the bar denotes the mean.

Results and Discussion
It is worth noting that the code developed in this work is available on GitHub in the public repository called AgroML, which can be found at https://github.com/Smarity/agroML (Accessed on 1 February 2022). This new library focuses on helping scientists to research state-of-the-art machine learning models, mainly focused on agronomy estimations and forecasts but easily extrapolated to other sectors and problems. It lets new scientists test these models on their datasets, and experienced scientists commit new features and architectures. The code has been programmed in standard Python using Tensorflow, Scikit-Learn, Scikit-Optimize, Pandas, and Numpy. Tables 4 and 5 show the RMSE and NSE performance for the baselines along the different forecast horizons (up to 1 week), where B1 refers to the moving average of the last seven ET0 values and B2 the use of mean historical monthly ET0 values (mean ET0 values for each month of the year). Generally, B2 outperformed B1 for all the forecast horizons except for one day ahead, where B1 performed better in all sites. Moreover, B1 obtained the most accurate forecasts on the one day ahead horizon, and it gradually dropped when the forecast horizon increased. In Aroche, the most humid site, the best performance in both RMSE and NSE values was obtained (NSE = 0.9038 and RMSE = 0.6390), followed by Córdoba, Málaga, Conil, and Tabernas (the most arid site), in this order. This suggests a relationship between the aridity index, distance to the sea, and the performance of the models. In inland locations, the higher the aridity index, the fewer the forecasting errors. On the other hand, in coastal locations, the opposite occurs. The higher the aridity index and the farther from the sea, the more precise the ET0 modeling. Finally, Table 6 shows the MBE values for the different stations and forecast horizons. In this case, B1 outperformed B2 in most of the cases. Table 4. RMSE values for ET0 forecast during seven forecast horizons and the two empirical baselines (B1-using the average value from the last seven days-and B2-using the mean monthly value from the training dataset).  Table 7 shows the minimum, mean, and maximum NSE, RMSE, and MBE values for all the sites and models using two different lag intervals (15 and 30 days). Generally, in terms of NSE and RMSE, the use of 15 days slightly outperformed all the models using 30 lag days for almost all the cases. On the other hand, the MBE performance for all models, locations, and lag days was very similar. Additionally, ML approaches highly outperformed the baselines, although the CNN and the transformer-based models gave the worst results in all sites. In Tabernas  In Figures 10-12, the RMSE and NSE values for all forecasting predictions in the different sites are shown in a boxplot, respectively. Firstly, no significant performance distinctions were observed from the two approaches depending on the number of lag days (15 and 30 days). However, the first approach (15 lag days) slightly outperformed the second (30 lag days) on mean values, and more precision was observed (a lower interquartile range). Moreover, the number of outliers having non-accurate modeling was much higher using the second approach. Then, as a general rule, using daily values from 15 days in the past is recommended over using 30 days. Furthermore, regarding the efficiency of different models, SVM, RF, and ELM were predominantly better than the rest of the models according to NSE and RMSE values, giving more precise results. In contrast, CNN and both transformer models were at the bottom in the ranking. Finally, the MBE results are plotted in a boxplot. The results were very accurate in both approaches and for all the models and sites, but CNN gave more outliers, especially using the 30 lag days approach. To further analyze these results, Figures 13-15 show the best statistic values (NSE, RMSE, and MBE, respectively) of all the models and sites for the different forecast horizons used. In terms of NSE ( Figure 13), all ML models highly outperformed B1 and B2 in all the forecast horizons and locations, except for Conil. In Conil, only SVM, RF, and ELM outperformed both B1 and B2 in all cases. On the other hand, the transformers, CNN, and MLP models underperformed B1 and B2 for a horizon higher than 3 days. Regarding RMSE, the results were similar to those shown in Figure 12. However, a more significant improvement in ML models is appreciated for most models and horizons. In terms of MBE (Figures 13-15), B2 obtained significantly worse results in Aroche, Córdoba, Málaga, and Tabernas, where ML performed very accurately in all cases. In Conil, there were no major differences in performance between all the models. Thereby, due to these results, it could be stated that the use of ML models to forecast ET0 up to a week is highly recommended, especially SVM, RF, and ELM models. Generally, B1 highly outperformed B2 to forecast ET0 values one day ahead, but its performance profoundly decreased for higher horizons, obtaining even worse results than B2. This denotes a low autocorrelation of daily ET0 values but a higher relation with historical monthly values. Moreover, SVM generally showed the best performance in terms of NSE and RMSE, whereas, regarding MBE, all models performed very accurately. Finally, it is worth noting that in Conil (a coastal site with an aridity index close to being a dry sub-humid climate), the best ML models (SVM, RF, and ELM) could not highly outperform B2 as in the rest of the locations when forecasting more than two days ahead, due to the effect of the close distance to the sea and the higher aridity index.

Assessing the Different Configurations
In order to evaluate the performance of the different configurations at all locations, Table 8 shows the average and best RMSE values of each configuration in the different sites. In Tabernas, configurations III, XXII, IV, and IX obtained the most accurate results on mean, whereas configurations XVI, XII, and XXIV were the worst. In Conil, the best configurations in terms of mean RMSE were XXV, VI, and XX. Furthermore, configuration XXVI obtained the best value in absolute terms. On the other hand, configurations XIII, XI, and XII performed the worst on average. In Córdoba, regarding mean values, configurations XVII, XXIV, and V were at the bottom, whereas configurations III, XXVII, and II were at the top of the ranking. In Aroche, configuration V obtained the lowest RMSE value (RMSE = 0.598 mm/day). Moreover, considering the mean values, all configurations obtained very similar performance, beginning with RMSE = 0.764 mm/day (configuration I), followed closely by configurations IV (RMSE = 0.764 mm/day), III (RMSE = 0.767 mm/day), IX (RMSE = 0.767 mm/day), and XXII (RMSE = 0.768 mm/day), and finally RMSE = 0.788 mm/day (configurations XIII and XVII). Thus, it could be stated that in terms of the mean, although there were no significant differences in performance between the best and worst configurations, the use of configurations I, III, IV, and IX is recommended.

Overall Discussion
In this work, several aspects were evaluated in forecasting daily ET0 at five locations in the Andalusia region (Southern Spain) with different geo-climatic conditions. Firstly, a new state-of-the-art architecture for NLP problems was assessed to forecast daily ET0, the transformers. Specifically, two different approaches were evaluated, TransformerCNN and TransformerLSTM, and they were compared to standard machine learning models such as MLP, SVM, RF, or CNN, among others. In general, the results obtained using standard machine learning approaches such as RF, SVM, and ELM highly outperformed the rest of the models assessed in this work. Moreover, transformer-based models did not perform as expected in all cases when compared to standard ML models. However, their results were better than the baselines for most sites and cases (except for Conil). Secondly, another critical aspect to highlight in this work is that even using a self-attention mechanism (transformer-based models), the use of 30 lag days instead of 15 lag days was not beneficial to forecasting daily ET0. On the contrary, slightly better results were obtained when 15 lag days were used, along with fewer serious outliers. Moreover, when comparing the different feature input configurations proposed in this study, none of them predominantly outperformed the rest, although configurations XIII, XIV, XX, and XXI were better on average. Figures 16-18 show a scatter plot of measured vs. predicted ET0 values using the best ML model and configuration for 1 and 7 days ahead. Furthermore, the results of the proposed models were significantly better than those reported by Ferreira and da Cunha [18] in terms of RMSE and NSE using different deep learning approaches in Brazil in AWS with an aridity index ranging from 0.3 to 1.6. The best NSE performances in Brazil ranged from 0.35 to 0.62 (approximately), whereas in this work, the best NSE values ranged from 0.60 to 0.95 (approximately). Moreover, this work also obtained slightly better NSE values than those reported by Nourani et al. [17] using ensemble modeling in different weather stations from Iran, Turkey, and Cyprus. These previous works used temperature, relative humidity, solar radiation, and wind speed values as input features, whereas all the configurations of this work were temperature-based variables. Additionally, comparing the results to those obtained by de Oliveira and Lucas et al. [54], the assessed models in the present work outperformed their CNN and ensemble CNN results in Brazil.
In all, the models developed in this work, especially SVM, ELM, and RF, are able to accurately forecast ET0 for one week ahead using only temperature-based parameters and ET0 past values. This issue is vital for improving crop irrigation scheduling, allowing adequate and anticipated planning, and contributing to agricultural production. Furthermore, providing reliable ET0 future values positively impacts the current challenge of optimizing water resource management, especially in arid and semiarid locations.

Conclusions
In this work, several machine learning models have been developed and assessed for daily ET0 forecasting from 1 to 7 days ahead using different input configurations, as well as different lag days. In general, all the ML approaches outperformed the baselines for all the forecast horizons and most locations, but SVM, RF, and ELM highly outperformed the rest of the models evaluated for most sites except for Conil de la Frontera, with unusually low wind speed values in this region. On the other hand, the transformers were, on average, at the bottom of the ranking. Moreover, all configurations obtained very similar results in terms of RMSE, but configurations I, III, IV, and IX slightly outperformed the rest. The NSE values were above 0.85 for Conil, Tabernas, and Málaga and above 0.9 for Córdoba and Aroche for their best modeling. In terms of RMSE, the average performance for Tabernas was 0.92 mm/day, 1.00 mm/day for Conil, 0.81 mm/day for Córdoba, 0.80 mm/day for Aroche, and 0.78 mm/day for Málaga. This denotes a relationship in performance regarding the aridity index and the distance to the sea. For inland locations, the higher the aridity index, the lower the error of forecasting ET0 will be. On the other hand, for coastal sites, the higher the aridity index, the higher the error. Regarding MBE, most stations and models obtained very accurate values on average for most cases, with a mean performance value of 0.011 mm/day.
Further studies can deeply explore using these models in new regions with different geo-climatic conditions, different scenarios (a different time interval and a regional scenario), and for other parameters, such as solar radiation or precipitation. Moreover, accurate feature selection or reduction could be researched because, as could be stated based on the present results, the configurations containing the worst related features based on Pearson correlation (HTx, HTn, HSr-HTn) obtained very accurate minimum and mean RMSE (Table 8 and Figure 2). The approaches proposed in this work may result in greater efficiency for optimizing water resources, improving irrigation scheduling, and anticipating the decision-making for agricultural goals. Finally, the creation of an open-source repository will allow novel scientists to apply these models using their own datasets, as well as experienced scientists to commit improvements with new features and architectures. Overall, the ultimate aim is to democratize the use of machine learning to more efficiently solve today's agricultural problems.

Supplementary Materials:
The following supporting information can be downloaded at: www.mdpi.com/article/10.3390/agronomy12030656/s1, Table S1. Hyperparameter space for all the models assessed in this work. MLP-Multilayer Perceptron, RF-Random Forest, SVR-Support Vector Regression, ELM-Extreme Learning Machine, CNN-Convolutional Neural Network, LSTM-Long Short-Term Memory, Transformer CNN-Transformer using CNN in the feed-forward layer, Transformer LSTM-Transformer using LSTM in the feed-forward layer.