Crop Yield Estimation Using Deep Learning Based on Climate Big Data and Irrigation Scheduling

: Deep learning has already been successfully used in the development of decision support systems in various domains. Therefore, there is an incentive to apply it in other important domains such as agriculture. Fertilizers, electricity, chemicals, human labor, and water are the components of total energy consumption in agriculture. Yield estimates are critical for food security, crop management, irrigation scheduling, and estimating labor requirements for harvesting and storage. Therefore, estimating product yield can reduce energy consumption. Two deep learning models, Long Short-Term Memory and Gated Recurrent Units, have been developed for the analysis of time-series data such as agricultural datasets. In this paper, the capabilities of these models and their extensions, called Bidirectional Long Short-Term Memory and Bidirectional Gated Recurrent Units, to predict end-of-season yields are investigated. The models use historical data, including climate data, irrigation scheduling, and soil water content, to estimate end-of-season yield. The application of this technique was tested for tomato and potato yields at a site in Portugal. The Bidirectional Long Short-Term memory outperformed the Gated Recurrent Units network, the Long Short-Term Memory, and the Bidirectional Gated Recurrent Units network on the validation dataset. The model was able to capture the nonlinear relationship between irrigation amount, climate data, and soil water content and predict yield with an MSE of 0.017 to 0.039. The performance of the Bidirectional Long Short-Term Memory in the test was compared with the most commonly used deep learning method, the Convolutional Neural Network, and machine learning methods including a Multi-Layer Perceptrons model and Random Forest Regression. The Bidirectional Long Short-Term Memory outperformed the other models with an R 2 score between 0.97 and 0.99. The results show that analyzing agricultural data with the Long Short-Term Memory model improves the performance of the model in terms of accuracy. The Convolutional Neural Network model achieved the second-best performance. Therefore, the deep learning model has a remarkable ability to predict the yield at the end of the season.


Introduction
Agriculture is in a state of flux, and obstacles are emerging, such as climate change, environmental impacts, and lack of labor, resources, and land. Annual population growth and increasing demands on agricultural society to produce more from the same amount of agricultural land while protecting the environment are the significant challenges of this century [1]. This scenario reinforces the constant need to seek alternatives in the face of challenges to ensure higher productivity and better quality. Sustainable production of sufficient, safe, and high-quality agricultural products will be achievable if new technologies and innovations are adopted. Smart farms rely on data and information generated standard error (SE) of 6.59% between visual fruit count and fruit detection by the model was determined. An LSTM model was trained for per-tree yield estimation and total yield estimation. Actual and estimated yields per tree were compared, yielding an approximate error of SE = 4.53% and a standard deviation of SD = 0.97 kg. Maimaitijiang et al. [12] used Partial Least Squares Regression (PLSR), Random Forest Regression (RFR), Support Vector Regression (SVR), DNN (DNN-F1) based on input-level feature fusion, and DNN (DNN-F2) based on mid-level feature fusion to estimate soybean yeast. The results showed that multimodal data fusion improved the accuracy of yield prediction. DNN-F2 achieved the highest accuracy with an R 2 score of 0.720 and a relative root mean square error (RMSE) of 15.9%. Yang et al. [13] proposed a CNN architecture for predicting rice grain yield from low-altitude remote sensing images at the maturity stage. The proposed model consisted of two separate branches for processing RGB and multispectral images. In a large ricegrowing region of Southern China, a 160-hectare area with over 800 cultivation units was selected to investigate the ability of the model to estimate rice grain yield. The network was trained with different datasets and compared with the traditional vegetation indexbased method. The results showed that the CNNs trained with RGB and multispectral datasets performed much better than the VI-based regression model in estimating rice grain yield at the maturity stage. Chen et al. [14] proposed a faster Region-based Convolutional Neural Network (R-CNN) for detecting and counting the number of flowers, mature strawberries, and immature strawberries. The model achieved a mean average accuracy of 0.83 for all detected objects at 2 m height and 0.72 for all detected objects at 3 m height. Zhou et al. [15] implemented an SSD model with two lightweight backbones, MobileNetV2 and InceptionV3, to develop an Android app called KiwiDetector to detect kiwis in the field. The results showed that MobileNetV2, quantized MobileNetV2, Incep-tionV3, and quantized InceptionV3 achieved true detection rates of 90.8%, 89.7%, 87.6%, and 72.8%, respectively.
The disadvantages of estimating the yield from images are: • Pictures of the entire field must be collected each year to identify the crop in the pictures and then estimate the yield. • To train the model, a large number of labeled images is needed, which is very timeconsuming. • Illumination variance, foliage cover, overlapping fruits, shaded fruits, and scale variations affect the images [16].
Ma et al. [17] used climate, remote sensing data, and rice information to estimate rice yield. A Stacked Sparse Auto-Encoder (SSAE) was trained and achieved a percent mean square error of 33.09 kg (10a) −1 . Han et al. [18] implicated machine learning methods including Support Vector Machine (SVM), Gaussian Process Regression (GPR), Neural Network (NN), K-Nearest Neighbor Regression, Decision Tree (DT), and Random Forest (RF) to integrate climate data, remote sensing data, and soil data to predict winter wheat yield based on the Google Earth Engine platform (GEE). SVM, RF, and GPR with an R 2 > 0.75 were the three best yield prediction methods, among others. They also found that different agricultural zones and temporal training settings affected the prediction accuracy. Kim et al. [19] developed an optimized deep neural network for crop yield prediction using optimized input variables from satellite products and meteorological datasets. The input data were extracted from satellite-based vegetation indices and meteorological and hydrological data, and a matchup database was created on the Cropland Data Layer (CDL), a high-resolution map for classifying plant types. Using the optimized input dataset, they implemented six major machine learning models, including multivariate adaptive regression splines (MARS), SVM, RF, highly randomized trees (ERT), ANN, and DNN. The DNN model outperformed the other models in predicting corn and soybean yields, with a mean absolute error of 21-33% and 17-22%, respectively. Abbas et al. [20] used four machine learning algorithms, namely Linear Regression (LR), Elastic Net (EN), K-Nearest Neighbor (k-NN), and Support Vector Regression (SVR), to predict tuber yield of potato (Solanum tuberosum) from soil and plant trait data acquired by proximal sensing. Four datasets were used to train the models. The SVR models outperformed all other models in each dataset with RMSE ranging from 4.62 to 6.60 t/ha. The performance of k-NN remained poor in three out of four datasets.
In these papers, however, the amount of irrigation was not considered as an input to the model. Yield is highly dependent on the amount of irrigation, and a change in the amount of water can make a big difference in the yield. Considering irrigation scheduling as an input to the model can help to create an intelligent system that selects the best irrigation schedule to save water consumption without affecting production. To optimize irrigation based on productivity, the irrigation amount must be considered as an input to the model.
In this work, Recurrent Neural Networks (RNN), including the LSTM model and Gated Recurrent Units (GRU) model and their extensions, Bidirectional LSTM (BLSTM) and Bidirectional GRU (BGRU), were implemented to estimate tomato yield based on climate data, irrigation amount, and water content in the soil profile. Agricultural datasets are time-series, and agricultural forecasting relies heavily on historical datasets. The advantage of RNN is its ability to process time-series data and make decisions for the future based on historical data. The proposed models predict the yield at the end of the season given historical data from the field such as temperature, wind speed, solar radiation, ETo, the water content in the soil profile, and irrigation scheduling during a season (the codes are available at the following links: https://github.com/falibabaei/tomato-yieldestimation/ blob/main/main (accessed date 22 May 2021), https://github.com/falibabaei/potatoyield-estimation/tree/main (accessed date 22 May 2021)). The performance of the model was evaluated using the mean square error and R 2 score. In addition, the performance of these models was compared with a CNN, an MLP model, and a Random Forest Regression (RF). The advantages of the yield estimation model are: • Using RNN models to extract features from past observations in the field and predict yield at the end of the season. • Using climatic data collected in the field as input to the model, which is easier than using collected images from the field. • Irrigation amount was used as input in the model, and it is shown that the model can capture the relationship between irrigation amount and yield at the end of the season. • It is shown that the model can be used as part of an end-to-end irrigation decisionmaking system. This system can be trained to decide when and how much water to irrigate and maximize net return without wasting water.

Deep Learning Algorithms
Machine learning (ML) is a subfield of artificial intelligence that uses computer algorithms to transform raw data from the real world into valuable models. ML techniques include Support Vector Machines (SVM), Decision Trees, Bayesian learning, K-Means clustering, association rule learning, regression, neural networks, and many others [2].
Deep learning is a subfield of ML. The word "deep" refers to the number of hidden layers in DL algorithms, making them more complex than ML algorithms. Deep neural networks can learn the features from data with multiple hidden layers and solve more complex problems. Unlike ML methods, DL models automatically extract useful features from the raw data through training and do not require feature engineering. The training time of DL models is longer than that of ML methods, and they require a large amount of data to train, but when trained, they are more accurate and faster [21]. For these reasons, they have been widely used in recent years.
The most widely used algorithm of DL consists of the CNN model and Recurrent Neural Network. The CNN models are already used for classification, recognition, and localization. In 2012, AlexNet [22] won the LSVRC competition for classification. Sermanet et al. [23] showed that DL algorithms could be used for classification, recognition, and localization and achieved excellent results. However, CNN models make predictions based on current input data and do not use past observations to make future decisions.
Unlike CNN, information in recurrent neural networks goes through a loop that allows the network to remember the previous outputs [24]. It enables the analysis of sequences and time series. RNN is commonly used for natural language processing and other sequences. A recurrent network can be thought of as multiple copies of the same network, each passing information to the next (see Figure 1). The output of an RNN unit is calculated by Equation (1).
where h t−1 is the recurrent output from the previous step, x t is the input at the current time, and W x , W h , b t are the weights and bias of the network to be trained when training. The problem with RNN is that if the sequential input is large, the gradient of the loss function can end at zero, effectively preventing the weights of the model from being updated [24].

LSTM and GRU Structures
The LSTM [25] and Gated Recurrent Units (GRU) [26] were developed to address the problems of the RNN.
The LSTM contains a forget gate that can be used to train individual neurons on what is important and how long it remains important. An ordinary LSTM unit consists of a block input z t , an input gate i t , a forget gate f t , an output gate o t , and a memory cell c t (see Figure 2). The forget gate f t is used to remove information that is no longer useful in the cell state using Equation (2). The input at a given time x t and the previous cell output h t−1 are fed to the gate and multiplied by weight matrices, followed by the addition of the bias. The result is passed through a sigmoid function that returns a number between 0 and 1. If the output is 0, the information is forgotten for a given cell state; if the output is 1, the information is retained for future use. Adding useful information to the cell state is performed by the input gate i t using Equation (3). First, the information is controlled by the sigmoid function, which filters the values to be stored, similar to the forget gate. Then, a vector of new candidate values of h t−1 and x t is generated with the block gate z t using Equation (4), which outputs from −1 to +1. The vector values and the controlled values are multiplied to obtain useful information using Equation (5). The output gate o t decides which information in the cell is used to calculate the output of the LSTM unit using Equation (6).
First, a vector is created by applying the tanh function to the cell. Then, the information is regularized using the sigmoid function, which filters the values to be stored based on the inputs h t−1 and x t . The vector values and the regulated values are multiplied by Equation (7) to be sent as output and input to the next cell.
GRUs discard the cell state and use the hidden state to transmit information. This architecture contains only two gates: the update gate z t and the reset gate r t . Like LSTM gates, GRU gates are trained to selectively filter out all irrelevant information while preserving the useful information and can be calibrated using Equations (8)- (11): The functions tanh and σ add nonlinearity to the network. These functions allow the model to capture the nonlinear relationships between the inputs and outputs of the model. At the beginning of training, the weights and biases in Equations (2)-(4), (6) and (8)-(10) are set randomly. During training, the model tries to set the weights and biases in such a way that the loss function is minimized. Therefore, the algorithm of an RNN model is an optimization problem.
The LSTM and GRU models have similar units, but they also differ. For example, in the LSTM unit, the amount of memory content seen or used by other units in the network is controlled by the output gate. In contrast, the GRU releases all of its content without any control [27]. From these similarities and differences alone, it is difficult to conclude which model performs better on one problem than another. In this paper, both models were implemented to see which model performs better on the yield estimation problem.

Bidirectional LSTM Structure
Bidirectional LSTM (BLSTM) is an extension of the LSTM model that can improve the results [28]. Its architecture consists of a stack of two separate intermediate LSTM layers that send a sequence forward and backward to the same output layer, using contextual information from both sides of the sequence (see Figure 3).

Dropout and Early Stopping
Overfitting occurs when the model learns noise in the training data, and the generalization of the model to the unseen data is unreliable and cannot make accurate estimates. It is proposed to use regularization techniques to prevent overfitting. One of them is "dropout", which consists of randomly selecting some neurons of the hidden layer and blocking their output so that they are not evaluated in the learning algorithms, and then, after a while, releasing the outputs of the blocked neurons and blocking other neurons [29]. This leads to the neural network becoming more general and not depending only on one group of neurons to make certain decisions. In LSTM models, the dropout can be added to the inputs of the layers, the outputs of the layers, and the recurrent outputs [29]. The dropout size is a hyperparameter that should be set during training (see Figure 4). Table 1 shows the effect of the dropout on the validation loss.  Another solution is early stopping, which consists of splitting the dataset into three sets, one for training, one for validation, and one test set [30]. In this method, the validation loss is constantly evaluated in each episode, and if the validation loss does not improve for a certain number of episodes, the training is stopped. This technique does not allow the network to be very specific about the training set.
In this work, dropout and early stopping were used to prevent overfitting.

Data Collection
The climate big data were collected by an agricultural weather station for a site in Portugal. They were retrieved from the government agency of the Ministries of Agriculture and the Sea, the Direção Regional de Agricultura e Pescas do Centro, Portugal (www.drapc.gov.pt (accessed date 22 May 2021)). The soil type in Fadagosa is either sandy or sandy loam, permeable, with low to medium organic matter content, with low acid to neutral reaction, rich in phosphorus and potassium, and without salt effects. The climate type of Fadagosa is the Mediterranean hot summer climate (Csa). These are areas with mild winters, with precipitation falling mainly in autumn and winter and occasionally in spring. Summers are hot and dry, with maximum temperatures above 40 • C. Figure 5 shows the Fadagosa region from Google Earth and Table 2 shows location details.  The big data included minimum, maximum, and average temperature; minimum, maximum, and average relative humidity; average solar radiation; minimum and average wind speed; and precipitation. They were recorded every 15 mins and converted to the daily variable in the data preprocessing section. Table 3 shows the details of these variables. Figure 6 shows the daily variables from 2010 to 2019 (10 years of data).   To train a deep learning model on a dataset, a sufficient amount of data is needed. However, in reality, recording crop yields at the end of the season based on different irrigation schedules is extremely slow and sometimes impossible (e.g., for a season without irrigation). The simulation model package Aquacrop was used to overcome this problem. AquaCrop is a crop growth model developed by the Food and Agriculture Organization (FAO) to ensure food security and assess the impact of environmental and management influences on crop production [31]. The structure of the model was designed to be applicable across different locations, climates, and seasons. To achieve this goal, AquaCrop distinguishes between conservative (fixed) and non-conservative (case-specific) parameters. The conservative parameters do not change with geographic location, crop type, management practices, or time and are intended to be determined with data from favorable and non-stressful conditions but remain applicable under stressful conditions by modulating their stress response functions [31]. The Aquacrop model was calibrated with climate data collected over the past ten years at Fadagosa. The crops selected for model calibration were tomato and potato. In the crop menu of Aguacrop, a planting/sowing date is generated by automatically evaluating the rainfall or temperature data prior to seeding. The temperature criterion was selected to determine the planting/sowing date. April 7 was generated for tomato and February 17 for potato. The experiments were conducted on sandy loam soil typical of Fadagosa with no salinity. The irrigation method chosen was sprinkler irrigation, which is adjustable in Aquacrop. A fixed interval of four days and a fixed value between 0 and 60 mm were used as time and depth criteria for calculating the irrigation regimes. Different irrigation treatments were applied in each experimental year, including no irrigation and a fixed irrigation depth of 5,10,15,20,25,30,35,40,45,50, 60 mm every four days or when the allowable depletion reached the thresholds of 90%, 80%, and 70%, respectively. Other parameters were kept unchanged. Figure 7 shows tomato yield under the fixed irrigation depth of 20 mm and water content throughout the soil profile from 2010 to 2018. As can be seen in the figure, the yield varies under different climatic conditions. Evapotranspiration is the amount of water that evaporates from the Earth, and soil water content is the volume of water per unit volume of soil. Evapotranspiration (ETo) and water content in the total soil profile (WCTot) were also simulated during the simulation and used as inputs to the models. AguaCrop estimated ETo from meteorological data using the FAO Penman-Monteith Equation (12) [32].
where R n is the net radiation, G is the soil heat flux, (e s − e a ) represents the vapor pressure deficit of the air, ρ is the mean air density at constant pressure, c p is the specific heat of the air, ∆ represents the slope of the saturation vapor pressure temperature relationship, γ is the psychrometric constant, and r s and r a are the (bulk) surface and aerodynamic resistances. Figure 8 shows the ETo from 2010 to 2019. The advantage of using deep learning models is that they do not require manual adjustments once the model is trained and can be used automatically. These models can be used to create an end-to-end decision support system for irrigation scheduling.

Data Preprocessing
The quality of the features extracted from the data by a deep learning model is determined by the quality of the dataset provided as input. The actions performed in the preprocessing phase aim to prepare the data so that the feature extraction phase is more effective. In this work, the missing big data were filled using the moving average method [33].
The other preprocessing in this paper involves the normalization and removal of multicollinear parameters. The multicollinear parameters contain the same information about the variables and can lead to redundancy in the calculations. One way to measure multicollinearity is the Variance Inflation Factor (VIF), which assesses how much the variance of an estimated regression coefficient increases when its predictors are correlated. If the VIF is equal to 1, there is no multicollinearity between factors, but the predictors may be moderately correlated if the VIF is greater than 1. A VIF between 5 and 10 indicates a high correlation, which may be problematic [34]. If the VIF is greater than 10, it can be assumed that the regression coefficients are underestimated due to multicollinearity. The VIF was used to determine the multicollinear variables, and the variables with a VIF greater than five were removed. Finally, average temperature (T Avg ), average humidity (HR Avg ), average wind speed (WS Avg ), reference evapotranspiration (ETo), water content in total soil profile (WCTot), irrigation, and precipitation (Prec) were used as inputs.
The normalization aims to bring the values of the dataset into a normal form without distorting the differences in the ranges of values. In this work, min-max normalization was used. For each feature in the dataset, the minimum value of that feature is converted to 0, the maximum value is converted to 1, and every other value is converted to a decimal number between 0 and 1 using Equation (13).
where x min and x max are the minimum and maximum of each variable in the dataset.

Metric Evaluation
In regression problems, the evaluation of the model can be calculated by the distance between the actual value and the value predicted by the model. In this work, the mean square error (MSE) and R 2 score were used to evaluate the model's performance. The MSE calculates the variance explained by the model, and the R 2 score is a statistical measure that calculates the variance explained by the model over the total variance [35]. The higher the R 2 score, the smaller the differences between the observed data and the fitted values [36]. MSE and R 2 score can be calculated using Equations (14) and (15): where y i ,ŷ i , andȳ i are, respectively, the true value, the value determined by the model, and the mean of the true values.

Results and Discussion
The predictions were made using a computing system consisting of an Intel Core i7-8550 CPU, an NVIDIA GEFORCE MX130 graphics card, and 8.0 GB of RAM. Anaconda is a distribution of the Python and R programming languages for scientific computing that aims to simplify package management and deployment. TensorFlow is an opensource software library for numerical applications with computational graphs. It was developed directly by Google Brain Team for machine learning and deep neural networks. Keras is a deep neural network library written in Python and running as a front-end in TensorFlow or Theano. It was developed to enable rapid experimentation [37,38]. The Anaconda environment was used to implement the model using Python version 3.7.9 and the Tensorflow GPU 2.1.0 framework, and Keras library version 2.2.4-tf.
The model obtained daily historical data for a season, including average temperature (T Avg ), average humidity (HR Avg ), average wind speed (WS Avg ), reference evapotranspiration (ETo), total soil profile (WCTot), irrigation amount, and precipitation (Prec), and estimated the yield at the end of the season. The tomato season was a string of length 110 days and the potato season was a string of length 121 days.
The data were divided into a training set, a validation set, and a test set. Training and test datasets are used to train the model and evaluate the performance of the trained network, respectively. The test set is ignored during training and hyperparameter selection, and its errors determine how well the model generalizes to the unseen data. Validation data are used to compare different models, select hyperparameters, and prevent overfitting during the training phase. Here, 30% of the data were selected as a test set, and the rest were split into 60-10% as training and validation sets, respectively. The tomato prediction model was trained on 77 samples (season) and tested on 34 samples, and the potato crop prediction model was trained on 45 samples and tested on 20 samples.
The error of a model is defined as the difference between the actual value and the value predicted by the model and must be minimized in the training step. The function used to calculate this error is called the loss function. Training of DL models is based on backpropagation [39]. Backpropagation uses the chain rule and computes the gradient of the loss function with respect to the weights of the network for a single input-output example, the adjusted weights of the network, and the biases. The optimization algorithm must be chosen to minimize the loss function. The choice of optimization algorithm and loss function is crucial. In this paper, the mean square error and Adam optimizer were used [2].
When training a neural network, some specific values must be initialized, which are called hyperparameters. Hyperparameters such as the number of hidden layers, activation function, etc., determine the structure of a model, and hyperparameters such as learning rate, decay time, etc., determine how the model is trained [24]. Results from multiple simulations of the same algorithm with different hyperparameters will vary. The hyperparameters considered in this work are the number of layers, the number of hidden units, batch size, the dropout size, the learning rate, and the learning rate decay. The learning rate is a hyperparameter that controls how much the weights of the model are adjusted with respect to the gradient of the loss function. It has a significant impact on the training process of the deep learning models. A very low learning rate makes the learning of the network very slow, while a very high learning rate causes fluctuations in training and prevents the convergence of the learning process [24]. Learning rate decay is another hyperparameter where the learning rate is calculated by decay at each update.
In this paper, the hyperparameters were selected manually. Unlike neural networks, the recurrent network does not require a deep network to obtain accurate results [40]. Changing the number of nodes per layer has been shown to have a more significant impact on results than changing the number of layers [40]. Since choosing LSTM and GRU with four layers and BLTM with three layers significantly increases the number of trainable parameters, leading to overfitting of the models, the number of LSTM and GRU layers was kept below four and the number of BLSTM and BGRU layers below three. Moreover, the LSTM and GRU with one layer showed poor performance and were excluded from the experimental results. Finally, the network architectures with two and three LSTM and GRU layers and one and two BLSTM and BGRU layers were tested.
Due to the hardware-based reasoning, the number of nodes per layer is usually chosen as a power of two, which can speed up the training of the model [41]. Since the time step in the input of the model is more than 110 days, the number of nodes per layer was kept less than or equal to 512 to avoid overfitting due to the number of trainable parameters and to make the training faster and more efficient [24]. The models with 16 and 32 nodes also performed very poorly and were excluded from the experimental results. In the end, the network architectures were tested with 64, 128, 256, and 512 nodes per layer.
The MSE was calculated for the validation dataset with all combinations of the number of layers and the number of nodes. The results for the different crops are shown in Table 4. As can be seen in Table 4, the BLSTM model improves the validation loss, but the GRU model has worse performance compared to the LSTM model. Therefore, removing the cell state from the LSTM reduces the performance of the model in the yield prediction problem. Finally, a two-layer BLSTM with 128 nodes and a two-layer BLSTM with 512 nodes were selected to predict potato and tomato yield, respectively. Tables 5 and 6 show the architecture of the models selected for crop yield prediction. Adding an additional dense layer after the BLSTM layers improved the results. The tanh function was used as an activation function after each BLSTM layer to capture the nonlinear relationship between input and output [24]. The batch size is also chosen as a power of two due to hardware reasons. The number of samples in the training datasets is less than 77, so the batch size is selected from {16, 32, 64}. For simplicity, the learning rate and decay time were chosen as negative powers of ten. With a learning rate and decay of 10 −5 , the model trains very slowly, and even after 500 epochs, the validation loss is very high, and the learning rate and decay of 10 −2 causes fluctuations in training. Therefore, the learning rate and decay are kept below 10 −6 and above 10 −2 . Table 7 shows the loss on the validation set for different crops when the hyperparameters are chosen differently. As can be seen, the same model with different hyperparameters achieves different results. The models with a learning rate of 10 −4 and 10 −3 (potato and tomato, respectively), batch size of 64, and decay of 10 −5 had the best performance. Dropout size is the percentage of nodes that randomly drop out during training. The dropout size of 0.4 did not improve the validation loss, so it was chosen to be less than or equal to 0.4 and from the set {0.1, 0.2, 0.3, 0.4}. In LSTM models, dropout can be added to the input layer, outputs, and recurrent outputs [29]. Adding dropout on the recurrent outputs with dropout size 0.1 for the potato yield estimation model and on the outputs of the layers with dropout size 0.3 for the tomato yield estimation model improved the validation loss and was therefore selected for each model (see Table 1). Each model was trained for a maximum of 500 epochs, and each epoch lasted two seconds. As mentioned earlier, early stopping was used to prevent overfitting. In this method, training is stopped when the validation loss does not improve for a certain number of epochs. Patience is a hyperparameter in the early stopping method that controls the number of epochs in which the validation loss no longer improves [30]. The exact amount of patience varies by model and problem. Examining plots of model performance measures can be used to determine patience. In this work, by examining the plot, patience was determined to be 30 and 50 for the tomato and potato models, respectively. As shown in Figure 9, training of the tomato and potato yield prediction models was completed after 360 and 250 epochs, respectively. The tomato yield prediction model was trained with more samples in the training set due to larger availability of experimental data, which may result in the training of this model being more stable than that of the potato yield prediction model. Since the test dataset was randomly selected, there were different seasons with different amounts of irrigation in each test dataset. Table 8 shows the performance of the models on the test dataset, and Figure 10 shows the actual value of yield compared to the values predicted by the models. The model predicting tomato yield with an MSE of 0.017 performed better on the test dataset than the model predicting potato yield with an MSE of 0.039. This result could be due to the fact that the standard deviation of tomato yield is smaller than the standard deviation of potato yield (see Table 3) and also, as mentioned earlier, the model used to estimate tomato yield was trained with a larger training set.
As shown in Figure 10, the tomato test dataset included the 2010 season under four different irrigation levels. The irrigation amounts were 0, 10, 20, and 60, and the model was able to achieve an MSE of 0.02 in this season. The same result was true for the potato crop estimation model, and the model achieved an MSE of 0.09 to predict the 2012 crop under four irrigation amounts. These results show that the model not only captures the relationship between climate data and yield but also can accurately predict yield under different irrigation amounts in a season. The ability to hold water depends on the soil type. As mentioned in the Data Collection section, the soil type of Fadagosa is sandy loam, and these models are designed for a sandy loam soil type. Therefore, these models work well on this soil type. Moreover, the models were trained with simulated data. However, the combination of simulated data and real data may give better results.
The performance of BLSTM models was compared with CNN, MLP, and the traditional machine learning algorithm Random Forest (RF).
MLP is a computational model inspired by the human nervous system. It is able to detect patterns in a mass of data in order to categorize it or regress a value. An MLP model consists of an input layer, a stack of fully connected layers (hidden layers), and an output layer. The fully connected layers connect every neuron in one layer to every neuron in the next layer. Mathematically, these layers are a linear function where each layer takes the output of the previous layer and adds weights and biases to it. To add nonlinearity to the model, the activation function (e.g., ReLU, tanh) is used after each fully connected layer [42]. Again, to avoid overfitting, the number of nodes in each layer of the MLP model was kept below 513 nodes, and the model with 512 neurons achieved the best performance in each dataset. An MLP with 512 neurons and a different number of layers was implemented. The performance improvement of the MLP model with five layers stopped on both datasets, so the number of layers was kept below 5. Table 9 shows the performance of these models. The models with three layers and four layers achieved the best performance in predicting tomato and potato yields, with R 2 scores of 0.89 and 0.71, respectively.
CNN is a deep learning algorithm. A CNN model consists of a stack of convolutional layers, nonlinear activation functions, pooling layers (e.g., maximum pooling, average pooling), Flatten layers, and fully connected layers [24]. Similar to the fully connected layers, a convolutional layer is a linear function, and it contains a set of kernels. A kernel is a matrix used for a matrix multiplication operation. This operation is applied multiple times in different input regions, and extracts feature maps from the input. Pooling is a reduction process. It is a simple process to reduce the dimension of feature maps and hence the number of parameters trained by the network. A Flatten layer is usually used in splitting convolutional layers and the fully connected layers. It basically performs a transformation in the output of the convolutional layer and changes its format to an array. The fully connected layer takes the feature maps from the Flatten layer and applies weights and biases to predict the correct label or regress a value [24]. The number of kernels in each convolutional layer and the number of convolutional layers was chosen manually. For the same hardware reason, the number of kernels in each convolutional layer is kept as a power of two. The models with a number of four convolutional layers or more than 512 kernels in each layer start to overfit. Therefore, the number of kernels and layers is kept below 513 and four, respectively. All combinations of the number of layers and kernels were implemented. The CNN model with 512 kernels and a number of layers of 2 and 3 achieved better validation loss. The batch size, learning rate, and decay time from the BLSTM model were used for the CNN models. Padding was used in each convolutional layer to ensure that the output had the same shape as the input. The kernel size is usually chosen as an odd number. The model with a kernel size greater than 11 starts with overfitting, and below three, the model performance was poor, so the kernel size was chosen from {5, 7, 11}. Table 9 shows the architecture of the CNN model.  The CNN with two layers and three layers and a kernel size of 5 and 11 (tomato and potato, respectively), with an R 2 score of 0.96 and 0.933, performed better in predicting yield than models with other combinations of the number of layers and kernel sizes.
RF is one of the most powerful machine learning methods. The Random Forest consists of several Decision Trees. Each individual tree is a very simple model that has branches, nodes where a condition is verified, and if it is satisfied, the flow goes through one branch, otherwise through the other, always to the next node until the tree is finished. As Table 8 shows, the RF model outperformed the MLP model in predicting yield, with an R 2 score of 0.87 to 0.90.
Computation time in the training process for each epoch lasted two, one, and less than one second for BLSTM, CNN, and MLP, respectively. Although the computation time in training BLSTM was higher than the other models, BLSTM achieved the best accuracy in the test dataset.
As mentioned earlier, one of the applications of this model is to create a decisionmaking system that decides when and how much to irrigate to avoid wasting water without affecting productivity. The yield prediction model is used to calculate the net yield at the end of the season. The net return in agriculture is calculated using Equation (16).
where Y is the yield at the end of the season, P y is the price of the yield, W is the total amount of water used for irrigation, and P w is the price of the water.
To show an application of the model, the net return for tomato yield in 2018 and 2019 was calculated under random irrigation every five days. An RF model was implied to predict WcTot after each irrigation. The model receives five days of climate data and irrigation rate and predicts WcTot for the next day. The RF model predicts WcTot with an R 2 score of 0.80. Algorithm 1 was used to calculate the net return at the end of the season under random irrigation. To calculate the net return, the cost of irrigation per 1 ha-mm/ha was assumed to be nearly 0.5 USD [43], and tomato prices were assumed to be nearly 728.20 USD/ton (www.tridge.com (accessed date 22 May 2021)).  Tables 10 and 11 show the net returns and the parameters used in Algorithm 1.  the data from the first 5 days of the season In this example, the irrigation amount was randomly selected, but in future work, a reinforcement learning agent is trained to select the best irrigation amount, and the random action (water amount) is replaced by the model. In Deep Reinforcement Learning algorithms, there is an environment that interacts with an agent. During the training, the agent chooses an action based on the current state of the environment, and the environment returns the reward and the next state to the agent. The agent tries to choose the action that maximizes the reward [44]. In the agricultural domain, the state of the environment can be defined as the climate data and the water content of the soil; the action is the amount of irrigation, and the reward is the net return. The function Env() in Algorithm 1 is used as the environment. An agent can be trained to select the amount of irrigation based on the condition of the field. Therefore, the yield estimation model can be used as part of the environment of Deep Reinforcement Learning to calculate the reward of the agent. This system can be trained end-to-end.

Conclusions
Irrigation and storage are closely related to the use of energy. Efficient irrigation minimizes unnecessary water use, which contributes to energy conservation. Yield estimation models help to reduce energy consumption, increase productivity, and estimate labor requirements for harvesting and storage requirements. RNN models offer several advantages over other deep learning models and traditional machine learning approaches. The most important aspect is their ability to process time-series data such as agricultural datasets. In this work, the ability of RNN models to predict tomato and potato yields based on climate data and irrigation amount was investigated. The LSTM, GRU, and their extension BLSTM and LSTM models were trained on sandy loam soil for crop yield prediction. The results show that the use of BLSTM models outperformed the simple LSTM, GRU, and BGRU models on the validation set. In addition, the LSTM model performed better than the GRU model in the validation set. Therefore, removing the cell state from the LSTM nodes could be problematic in our context.
The BLSTMs achieved an R 2 score of 0.97 to 0.99 on the test set. The results show that BLSTM models can automatically extract features from raw agricultural data, capture the relationship between climate data and irrigation amount, and convert it into a valuable model for predicting future field yields. One drawback of these models is that a sufficient amount of clean data is needed to train the model. With more data, the model can make better predictions. In this work, the simulated yield dataset was used to overcome this disadvantage.
The performance of the BLSTM was compared with the CNN model, MLP, and RF, and it was found that the BLSTM outperformed the MLP networks and CNN and RF in yield prediction. The CNN model achieved the second-best performance and MLP the worst performance. The results show that past observations of a season are important in yield prediction. The BLSTM model can capture the relationship between past observations and the new observations and predict the yield more accurately. One disadvantage of the BLSTM model was that the training time of the BLSTM model was higher than other implemented models.
One of the applications of the yield prediction model is to develop an end-to-end decision support system that automatically decides when and how much to irrigate. Deep Reinforcement Learning models are used to build such a system. An agent can be trained to select the amount of irrigation based on the condition of the field. To train such a model, a reward function must be developed. In the agricultural domain, the net reward is used as the reward for the agent. Since it is difficult and time-consuming to work with a simulation system such as Aquacrop to train a deep learning model, the yield estimation model is used to determine the net return of the agent at the end of each season, and the agent can decide based on this reward. This system can help farmers to decide when and how much to irrigate and reduces water consumption without affecting productivity. Funding: This work is supported by the project Centro-01-0145-FEDER000017-EMaDeS-Energy, Materials, and Sustainable Development, co-funded by the Portugal 2020 Program (PT 2020), within the Regional Operational Program of the Center (CENTRO 2020) and the EU through the European Regional Development Fund (ERDF). Fundação para a Ciência e a Tecnologia (FCT-MCTES) also provided financial support via project UIDB/00151/2020 (C-MAST).