Modeling Soil Water Content and Reference Evapotranspiration from Climate Data Using Deep Learning Method

: In recent years, deep learning algorithms have been successfully applied in the development of decision support systems in various aspects of agriculture, such as yield estimation, crop diseases, weed detection, etc. Agriculture is the largest consumer of freshwater. Due to challenges such as lack of natural resources and climate change, an efﬁcient decision support system for irrigation is crucial. Evapotranspiration and soil water content are the most critical factors in irrigation scheduling. In this paper, the ability of Long Short-Term Memory (LSTM) and Bidirectional LSTM (BLSTM) to model daily reference evapotranspiration and soil water content is investigated. The application of these techniques to predict these parameters was tested for three sites in Portugal. A single-layer BLSTM with 512 nodes was selected. Bayesian optimization was used to determine the hyperparameters, such as learning rate, decay, batch size, and dropout size.The model achieved the values of mean square error values within the range of 0.014 to 0.056 and R 2 ranging from 0.96 to 0.98. A Convolutional Neural Network (CNN) model was added to the LSTM to investigate potential performance improvement. Performance dropped in all datasets due to the complexity of the model. The performance of the models was also compared with CNN, traditional machine learning algorithms Support Vector Regression, and Random Forest. LSTM achieved the best performance. Finally, the impact of the loss function on the performance of the proposed models was investigated. The model with the mean square error as loss function performed better than the model with other loss functions.


Introduction
The world's population living in urban areas will increase by 13% by 2050 [1]. On the other hand, with the current agricultural production method, the capacity of the Earth is already exceeded [2]. Therefore, improving agricultural production and feeding the world without exceeding natural sources such as water remains the main problem in agriculture.
Agricultural production depends on water. It is also the largest consumer of water as it consumes on average 70% of all freshwater [3]. Therefore, it is essential to optimize irrigation and make irrigation systems more efficient, especially in arid regions. Many factors need to be considered when scheduling irrigation. The amount of water available in the soil and reference evapotranspiration (ETo) are two critical factors [4,5]. Evapotranspiration is the amount of water that evaporates from the Earth and plant surface. Solar radiation, wind speed, temperature, and relative humidity all influence daily ET, and this effect is highly nonlinear. The Penman-Monteith equation recommended by the Food and Agriculture Organization (FAO) is the most commonly used equation for calculating ET [6].
Soil Water Content (SWC) is the volume of water per unit volume of soil. It can be measured by weighing a soil sample, drying the sample to remove the water, and then weighing the dried soil or by remote sensing methods. Measuring SWC is important for of extracting features from raw data [20]. A CNN model is superimposed on the LSTM model to investigate the performance of CNN-LSTM models compared to LSTM. The performance of the model is also compared with CNN and traditional machine learning algorithms such as Support Vector Regression (SVR) and Random Forest [21,22]. Finally, the impact of the loss function on the training of the model is investigated. The most commonly used loss functions for regression problems are selected, including MSE, RMSE, MAE, and the Huber function. The Huber function is basically MAE, which becomes MSE when the error is small. Unlike MAE, the Huber function is differentiable at 0 and is less sensitive to outliers in the data than MSE [23]. Each of these functions yields a different error for the same prediction and affects the performance of the model on the test set.

Data Collection
Data for three sites in Portugal, collected over a period of 7 to 10 years, were used for the training. Datasets were retrieved from the stations Estação Póvoa de Atalaia, Estação Borralheira and Direção Regional de Agricultura e Pescas do Centro, Portugal. Table 1 shows details of these sites. The soils of Loc. 1 ana Loc. 3 are of light texture (sandy) or more often of medium texture (sandy loam), permeable, with low to medium organic matter content, with low acid to neutral reaction, rich in phosphorus and potassium, and without salt effects. Climatically, these are areas with mild winters, with precipitation falling mostly in autumn and winter and occasionally in the spring. Usually, temperatures increase significantly from May onwards, peaking in July and occasionally in August, with the registration of several days of maximum temperatures above 40 • C being common. It has also been noted that registered temperatures reach progressively higher values in September, extending the summer period into October. In the Loc. 2, soils are similar, although soils with medium texture (loam) are more common. Climatically, the regions differ mainly in spring and summer with lower temperature values and higher relative humidity. In terms of precipitation, the values are always higher, and the number of days recorded with precipitation is higher. These plots are covered with fruit trees such as pears, nectarines, cherry, and peach trees. Datasets Loc. 1 and Loc. 2 were recorded daily, and the Loc. 3 record was recorded every 15 min. These datasets included minimum, maximum, and average temperature; minimum, maximum, and average relative humidity; average solar radiation; minimum and average wind speed; and precipitation. Table A1 in the Appendix A shows the details of these variables. Figure 1 shows the amount of precipitation at each site, which translates to the soil water content in the surface layer of the soil.  The ETo calculator is software developed by the Land and Water Division. The ETo calculator estimates ETo from meteorological data using FAO Penman-Monteith equation [6]. This calculator was used to calculate the daily ETo at these three locations. The amount of soil water content at the end of the day was collected from the ECMWF dataset (ERA5) [24]. ERA5-Land is a reanalysis dataset that provides a consistent view of the evolution of land variables over several decades with improved resolution compared to ERA5. ERA5-Land was created by reproducing the land component of the ECMWF ERA5 climate reanalysis. In particular, ERA5-land runs at a higher resolution (9 km compared to 31 km in ERA5). The reanalysis combines model data with observations from around the world to produce a globally complete and consistent dataset using the laws of physics. The temporal frequency of the output is hourly. ERA5-Land is generated in a single simulation without coupling to the atmospheric module of ECMWF Integrated Forecasting System (IFS) or to the ocean wave model of IFS. It runs without data assimilation, making it computationally affordable for relatively rapid updates. The core of ERA5-Land is the Tiled ECMWF Scheme for Surface Exchanges over Land, which incorporates land surface hydrology (H-TESSEL). It uses the CY45R1 version of the IFS [25].
Soil moisture values refer to the top 7 cm of soil as modeled using the Hydrology Tiled ECMWF Scheme for Surface Exchanges over Land (H-TESSEL) [26] using Richard's equation for water flow through the unsaturated soil profile [27]. H-TESSEL uses a variety of soil texture classes with specific properties, such as infiltration capacity and wilting point. H-TESSEL has a four-layer representation of the soil (0-7 cm, 7-28 cm, 28-100 cm, 100-289 cm). The volumetric soil water is associated with soil texture, soil depth, and underlying water table [24]. Since layer four is very deep and its role in agriculture applications is limited, layer four was discarded from the dataset, and the three layers between 0 and 100 cm were considered in this study. Figure 2 shows ETo and SWC at different levels for each site.

Data Preprocessing
If there are values in the dataset that are missing, invalid, or difficult to process, the algorithm may end up with overfitted, less accurate, or even misleading results. So, before a model can be trained, data preparation procedures such as data transformation, statistical testing, and more must be performed. The missing data were filled in using the moving average method [28].
The other preprocessing in this work involves the removal of collinear parameters and normalization. The collinear parameters contain the same information about the variables and can lead to redundancies in the calculations. For each parameter x and y, the collinear coefficient is calculated by Equation (1) [29]: where cov(x, y), σ x , and σ y are the covariance of x and y, the variance of x, and the variance of y, respectively. The values 1 and −1 of the correlation coefficient indicate the linear and inverse linear correlation, respectively, and the value 0 means no correlation. In this work, if ρ x,y exceeded a threshold, then one of the parameters was removed from the dataset. In the end, the parameters temperature (T min , T max ), average humidity (HR Avg ), average wind speed (WS Avg ) and solar radiation (SR Avg ), precipitation (Prec), ETo, and SWC were used as inputs to the models. Tables A2 and A3 in the Appendix A show the collinear coefficients between variables in the different datasets.
The goal of normalization is to bring the values of the dataset to a common scale without distorting the differences in the ranges of values. The standardization formula given by Equation (2) was used to normalize the data [30]: wherex is the mean value of the data.

Deep Learning Methods
The LSTM, Bidirectional LSTM, and CNN-LSTM models were used in this paper.

LSTM Architectures
Recurrent Neural Networks (RNN) are specifically designed for processing sequential data. In the RNN model, the output of the previous step is fed into the current step as input. Mathematically, an RNN model is a nonlinear function that takes t-steps of variables in a time series as input and outputs a vector value at time t + 1 [31]. The output of an RNN unit is determined using a tanh function by Equation (3): where h t−1 is the recurrent output from the previous step, x t is the input at the current time, and W x , W h , b t are the weights and bias of the network to be trained during the training. The main problem with RNN is the vanishing gradient, i.e., the gradient of the loss function approaches zero [31]. LSTM [32] is a special type of RNN, created as a solution to the vanishing problem in RNN. A single-layer LSTM is shown in Figure 3, where there is one unit for each time step. A common LSTM unit consists of a block input z t with corresponding weight matrices W zx , W zh , b z , an input gate i t with corresponding weight matrices W ix , W ih , b i , a forget gate f t with corresponding weight matrices W f x , W f h , b f , an output gate o t with corresponding weight matrices W ox , W oh , b o , and a memory cell c t (see Figure 4).
Forget gate f t determines what information remains in the cell by outputting a number between 0 and 1 using Equation (4). The output 0 means "forget all information completely" and one means "keep all information". The input gate i t protects the unit from irrelevant inputs and decides which values to update using Equation (5). The output gate o t uses the sigmoid function given by Equation (6) to decide which information in the cell is used to calculate the output of the LSTM unit. The block input z t creates a new vector that could be added to the new state using the tanh function given by Equation (7). The memory cell c t decides whether to update the information obtained from the previous inputs and the current information provided by Equation (8).
The output of the current block h t is calculated using Equation (9).

Bidirectional LSTM
Bidirectional LSTM (BLSTM) is an extension of the LSTM model that can improve the results [33]. Each layer of a BLSTM is a stack of two LSTM layers, called the forward and backward layers (see Figure 5). The input of the forward layer is the sequence in positive order and computes a sequence of forward hidden states − → h 1 , · · · , − → h t . The backward layer obtains an inverted copy of the sequence and computes a sequence of the backward hidden states ← − h 1 , · · · , ← − h t . The final representation of the observation is obtained by concatenating the forward hidden states with the backward hidden states.

CNN-LSTM Model
CNN is a supervised learning model specifically designed to handle classification, detection, and segmentation [34]. A CNN model consists of a stack of convolutional layers, nonlinear activation functions (e.g., tanh, ReLU), pooling layers (e.g., maximum pooling, average pooling), and fully connected layers [31]. A convolutional layer contains a set of kernels whose parameters must be learned, and it is used to extract features from data. Each kernel slides across the height and width of the input and the dot product between the entries of the kernel, and each position of the input is computed (see Figure 6). A nonlinear activation function is used to introduce nonlinear features into a model. Rectified linear units (ReLU) have become state of the art and are the most commonly used activation function in deep learning models [31]. The output of a ReLU function is Max(x, 0). Figure 6. The left-hand side is the general CNN architecture, and the right-hand side is the convolutional operation.
A CNN-LSTM model is a participation of the convolutional neural network and an LSTM model. The CNN layers are used to extract features from the input sequence, and an LSTM model is used to support sequence prediction [35].

Dropout
Overfitting occurs when the training loss is small but the generalization of the model to the unseen data is unreliable. To prevent overfitting, it is suggested that regularization techniques such as dropout be used [31]. Dropout is a computationally inexpensive way to regularize during model training by removing units from the network. In LSTM models, dropout can be added to the input layer, outputs, and recurrent outputs [36]. Dropout size is a hyperparameter that should be set during training. Table 2 shows the effect of dropout based on the lost validation. In this work, the dropout was used on recurrent outputs.

Bayesian Optimization
Bayesian optimization is a strategy for the global optimization of black-box functions. A black-box function is an unknown analytic expression of the function, and the evaluation of the function is restricted to a sample at each point [37]. The core components in Bayesian optimization are the objective function, the surrogate model, and the acquisition function. The objective function is the black box function used to maximize compliance with some parameters. For example, in the classification problem, accuracy can be used as the objective function. The surrogate model is a model to approximate the objective function. The Gaussian processes and Parzen-Tree Estimator are the most popular models for the surrogate model [38]. The acquisition function is used to determine where to evaluate the objective function next. There are many options for acquisition functions, such as the Probability of Improvement (PI), Expected Improvement (EI), and upper confidence bound [38].
The Bayesian algorithm is as follows [39]: 1. Choose a surrogate function to approximate the objective function and define its prior. The choice of this function is optional; the algorithm is convergent, and the choice of the prior function can help increase the speed of convergence.

2.
Given a set of observations, Bayes' rule is used to obtain the posterior distribution over the objective function. Bayes rule is given by Equation (10): where A and B are models and new observations, respectively; P(A) and P(B) are the probabilities of the prior distribution (probability that A is true) and the probability that B is observed, respectively; P(A|B) is the posterior distribution, and P(B|A) is the likelihood of the event B occurs if model A is true.

3.
Use an acquisition function α which is a function of the posterior distribution, to determine the next sample point as

4.
Add the new sample to the observation set and go back to step 2 until convergence or budget has elapsed.
In this paper, Bayesian optimization [40] is used to minimize the loss function in terms of batch size, dropout size, learning rate, and decay. Figure 7 shows the validation loss during Bayesian optimization for Loc. 3.

Evaluating the Models
Once the model is trained, the critical question is how good the model is. The following metric is used to evaluate the model: • Mean Square Error (MSE): is the average of the squared differences between prediction and actual observation. In other words, the MSE is the variance of the error [41]. It is calculated by Equation (11): • Root Mean Square Error (RMSE): is the square root of the MSE. In other words, the RMSE is the standard deviation of the error [41]. In the MSE metric, the errors are first squared before averaging, resulting in a high penalty for large errors. Therefore, the RMSE is useful when large errors are undesirable. • Mean Absolute Error: is the average of the absolute differences between predictions and actual observations and is calculated by Equation (12) [41]: • Mean Bias Error (MBE): captures the average bias in the prediction. A positive bias in a variable means that the data from the datasets are underestimated, while a negative bias means that the model overestimates the observed data. It is calculated by Equation (13): • R 2 : is a statistical measure that calculates the variance explained by the model over the total variance. The higher R 2 is, the smaller the differences between the actual observations and the predicted values. It is calculated by Equation (14) [19]: where y i 's, y i 's, and y are the actual observations, the values predicted by the model, and the mean value of the actual observations, respectively.

Results and Discussion
The experimental environment configuration consists of an Intel Core i5-4200H CPU, an NVIDIA GEFORCE GTX 950M graphics card, and 6.144 GB RAM. The models were implemented using the Tensorflow framework and the Keras library [42,43].
The dataset was divided into training, validation, and test set (70%, 20%, and 10%, respectively). The model predicted SWC and ETo for one day ahead, considering eight days of historical data, including minimum temperature (T min ), maximum temperature (T max ), average humidity (HR Avg ), average wind speed (WS Avg ), and solar radiation (SR Avg ), daily ETo and SWC. Mean square error and Adam optimization method [44] were used for the loss function and optimization method.
Various machine learning algorithms have certain values that should be set before starting the learning process. These values are called hyperparameters and are used to configure the training process. The validation set is used to set the hyperparameters. The hyperparameters of the proposed model are learning rate, batch size, dropout size, number of layers in the model, and number of nodes per layer. The number of hidden layers and the number of nodes per layer were selected manually, and then Bayesian optimization was used to set the remaining hyperparameters. Unlike neural networks, LSTM does not require a deep network to obtain accurate results [20]. Changing the number of nodes per layer has been shown to have a greater impact on results than changing the number of layers [20]. In this work, the number of LSTM layers stayed between 2 and 4 for BLSTM less than 3 and kept the number of nodes per layer below 513. Finally, the network architectures were tested with two, three, and four LSTM layers, with one and two BLSTM layers, and 64, 128, 256, and 512 nodes per layer. The MSE was calculated for the validation dataset with all combinations of the number of layers and the number of nodes, and the results for the different sites are shown in Table 3. Increasing the number of LSTM layers to four and the number of BLSTM layers to two increased the MSE on the validation dataset, showing the overfitting of the models. Keeping the LSTM layers of two and three for Loc. 1 and Loc. 2 and increasing the number of LSTM nodes per layer, the network accuracy on the validation dataset first increases and then decreases, and the best network accuracy is achieved with 256 nodes. The best accuracy for Loc. 3 was achieved with a single-layer BLSTM with 512 nodes. In the end, the mean error for each architecture was calculated on three datasets, and the single-layer BLSTM with 512 nodes had the best performance with a mean of 0.01536. Among the simple LSTM models, the double-layer LSTM with 256 nodes performed better than the other models. Optimizing the hyperparameters of a machine learning model is simply a minimization problem that seeks the hyperparameters with the least validation loss. Tuning hyperparameters using Bayesian optimization can reduce the time required to find the optimal set of hyperparameters. In this paper, the validation loss in terms of the learning rate, batch size, decay, and dropout size was minimized using Bayesian optimization [40]. Ten steps of random exploration and 60 iterations of Bayesian optimization were performed. Random exploration can help by diversifying the exploration space [40]. The learning rate and decay were kept between 10 −7 and 10 −2 , and the batch size and dropout size were kept in the range of (32, 128) and (0, 0.5), respectively. For each iteration of the Bayesian optimization algorithm, the network was run for 50 epochs. In the first ten iterations, the algorithm randomly selected the hyperparameters, calculated the validation loss, and then attempted to minimize the validation loss with respect to the hyperparameters (see Figure 7). The validation loss and the selected value for each hyperparameter were stored after each iteration. At the end of 70 iterations, the set of hyperparameters with minimal validation loss was selected. These selected values are shown in Table 4. For simplicity, the decay and learning rates were modified to a power of 10. The model was trained on different datasets. Once the model was trained on all datasets (70% of each dataset). The other time, the model was trained on two datasets and tested on the third dataset. Each model was trained for a maximum number of 500 iterations (epoch = 500). Early stopping was used to avoid overfitting. Early stopping is a regularization strategy that determines how many epochs can be run before overfitting begins [31]. In our implementation, the validation error was monitored during training, and if it did not improve after a maximum of ten epochs, training was stopped. Figure 8 shows the R 2 and RMSE during training on the training set and the validation set. The error improved after 200 epochs on the training set but did not improve again on the validation set. Tables 5 and 6 and Figure 9 show the evaluation of the model on the test set and the predicted value compared to the true values of the SWC and ETo, respectively. The results indicated that although the MSE (the variance of the error) at Loc. 2 was lower than the MSE at Loc. 3, the R 2 -score (the variance explained by the model over the total variance) improved at Loc. 3. Therefore, it is not sufficient to use only one metric to evaluate a model. As Figure 9 and Table 5 show, the best accuracy in predicting ETo and SWC was achieved with the dataset of Loc. 2 and Loc. 3, respectively. ET was underestimated with the model in Loc. 2 and Loc. 3 but overestimated in Loc. 1. The results also show that as the depth of the soil increases, the accuracy of the model for predicting SWC increases. This could be due to the fact that the range and standard deviation of SWC decrease as the soil depth increases. Adding some other variables to the input of the model, such as irrigation amount, and adding the real data to the simulated dataset may improve the accuracy of the predicted value of SWC in the upper layer (see Table A1).
As the results show, the models trained with two datasets showed similar performance to the model trained with all datasets. The results also show that the model trained with datasets from different locations can be generalized to other locations.  CNN models are capable of extracting features from raw data. The convolutional layers were added to the proposed models to investigate the performance of the CNN-LSTM model for predicting ETo and SWC. Table 7 shows the CNN architectures that were used to stack the LSTM models. First, the input sequence was divided into two subsequences, each with four-time steps. The CNN can interpret each subsequence with four-time steps, and the LSTM used a time series of the interpretations from the subsequences as input. The time-distributed layer wrapped the CNN model and allowed each layer to be applied to each subsequence [43]. The CNN model contained two convolutional layers with kernel sizes of 3 and 2, respectively, followed by an average pooling layer with a pooling size of 2. The number of filters remained the same as the number of nodes in the proposed BLSTM model, i.e., 512 filters in each convolutional layer.
In this paper, the performance of the models was improved by using the function tanh and the average pooling layer instead of the layer Relu or Max Pooling (see Table 8). The performance and computational efficiency of the CNN-BLSTM models are shown in Tables 8 and 9. The BLSTM model outperformed the CNN-BLSTM model in terms of model performance and computation time. As shown in Table 9, the number of trainable parameters tripled in the CNN-BLSTM model compared to the BLSTM. Therefore, the model began to overfit due to the complexity of model. Figure 10 shows the RMSE and R 2 -score during the training of the CNN-LSTM models.   Table 9. Computation efficiency of the LSTM and CNN-LSTM models. The performance of the LSTM model was compared with the CNN model and two traditional machine learning models: SVR and Random Forest. The CNN architecture used for the comparison was the same as the CNN architecture built on the LSTM architecture, with the expectation that the time-distributed layers were removed and a dense layer was added in place of the LSTM layers to make the prediction (see Table 10).

Models
RF is one of the most powerful machine learning methods. The Random Forest consists of several Decision Trees. Each individual tree is a very simple model that has branches, nodes where a condition is checked, and if it is satisfied, the flow goes through one branch, otherwise through the other, and always to the next node until the tree is finished [22]. Support Vector Regression (SVR) is a supervised learning algorithm used to predict discrete values. The basic idea behind SVR is to find the best fitting line. In SVR, the best fitting line is the hyperplane that has the maximum number of input points [21]. A kernel in the SVR model is a set of mathematical functions that take data as input and transform it into the desired shape. They are generally used to find a hyperplane in a higher-dimensional space. The most commonly used kernels include Linear, Nonlinear, Polynomial, Radial Base Function (RBF), and Sigmoid. Each of these kernels is used depending on the dataset. In this work, the RBF kernel has been used [45].  Table 11 shows the performance of each model on the test set. As shown, the LSTM models show better performance on the datasets. The CNN model has achieved the second-best performance. One of the applications of this model is a decision system that decides when and how much to irrigate in order to avoid water waste without compromising productivity. In future work, a reinforcement learning agent can be trained to select the best amount of irrigation. In deep reinforcement learning algorithms, there is an environment that interacts with an agent. During training, the agent chooses an action based on the current state of the environment, and the environment returns the reward and the next state based on the previous state to the agent. The agent tries to choose the action that maximizes the reward [46]. In the agricultural domain, the state of the environment can be defined as the climate data and the water content of the soil and ETo; the action is the amount of irrigation, and the reward is the net yield. An agent can be trained to choose the irrigation amount based on the state of the field, and the SWC and ET prediction model can be used as part of the environment of Deep Reinforcement Learning to calculate the next state of the environment. By using the trained model in this paper, the training of the agent can be end-to-end and does not require manual processing.
In the end, the importance of the loss function to train the LSTM models was investigated for Loc. 3. The model was trained using MSE, RMSE, MAE, and Huber loss function. The Huber loss function (H δ ) is the composition of the MAE and MSE and can be calculated by Equation (15): where δ is a hyperparameter that should be set. The Huber function is basically MAE, which becomes MSE when the error is small. Unlike the MAE, the Huber function is differentiable at 0. Table 12 shows the performance of the model on the validation set using Huber, MAE, RMSE, and MSE as loss function for Loc. 3. The model with MSE as a loss function outperformed the others.

Conclusions
The advantage of the recurrent neural network is its ability to process time-series data like agricultural datasets. In this paper, the ability of recurrent neural networks, including LSTM and its extension BLSTM, to model ETo and SWC was investigated. A drawback of deep learning methods is that they require a large amount of data to train. To overcome this problem, the simulated dataset was used for ET and SWC. The BLSTM model outperformed the LSTM model on the validation set, and hence a single layer BLSTM model with 512 nodes was trained. The proposed model achieved MSE in the range of 0.014 to 0.056, and the results show that the BLSTM model has good potential for modeling ETo and SWC. The CNN models are able to extract features from the data. A CNN model was added to the BLSTM model to extract features, and then BLSTM used these features to predict ET and SWC. When a CNN model was added to the LSTM model, and the number of parameters in the CNN-BLSTM was increased by three times, the model began to overfit after 100 epochs, and BLSTM showed better results than the CNN-BLSTM. Therefore, increasing the number of trainable parameters in deep learning algorithms may cause the model to overfit.
The performance of BLSTM was also compared with Random Forest, SVR, and CNN. The best performance was obtained with BLSTM. The second-best performance was obtained with the CNN model. Among the machine learning methods, RF outperformed the SVR model. One disadvantage of BLSTM compared to CNN, SVR, and RF was that the training time of BLSTM was higher than the other models, but when trained, it was more accurate, and the computation time for prediction was almost the same as the other models.
Finally, the importance of choosing a loss function was investigated. The model with MSE as the loss function performed better than the other models with an MSE of 0.056, and the model with H δ=0.0001 as the loss function had the worst performance. These results show that the choice of a loss function depends on the problem and must be chosen carefully.
In future work, the variables such as irrigation amount and groundwater level will be added to the input of the model to make the prediction more accurate. As mentioned earlier, one of the applications of the SWC and ET prediction model is to develop an end-to-end decision support system that automatically decides when and how much to irrigate. Deep reinforcement learning models are used to build such a system. An agent can be trained to select the amount of irrigation based on the condition of the field. Deep reinforcement learning algorithms require an environment that interacts with the agent and tells the agent the next state and reward. The SWC and ET prediction model is used as part of the algorithms' environment and determines the next state, which is the SWC and ET a day per head. Funding: This work is supported by the project Centro-01-0145-FEDER000017-EMaDeS-Energy, Materials, and Sustainable Development, co-funded by the Portugal 2020 Program (PT 2020), within the Regional Operational Program of the Center (CENTRO 2020) and the EU through the European Regional Development Fund (ERDF). Fundação para a Ciência e a Tecnologia (FCT-MCTES) also provided financial support via project UIDB/00151/2020 (C-MAST).

Acknowledgments:
We would like to express our sincere gratitude for the support provided by AppiZêzere and DRAP-Centro with the data from the meteorological stations near Fadagosa. P.D.G. and T.M.L. acknowledge Fundação para a Ciência e a Tecnologia (FCT-MCTES) for its financial support via the project UIDB/00151/2020 (C-MAST).

Conflicts of Interest:
The authors declare no conflict of interest.