Comparison of On-Policy Deep Reinforcement Learning A2C with Off-Policy DQN in Irrigation Optimization: A Case Study at a Site in Portugal

: Precision irrigation and optimization of water use have become essential factors in agriculture because water is critical for crop growth. The proper management of an irrigation system should enable the farmer to use water efﬁciently to increase productivity, reduce production costs, and maximize the return on investment. Efﬁcient water application techniques are essential prerequisites for sustainable agricultural development based on the conservation of water resources and preservation of the environment. In a previous work, an off-policy deep reinforcement learning model, Deep Q-Network, was implemented to optimize irrigation. The performance of the model was tested for tomato crop at a site in Portugal. In this paper, an on-policy model, Advantage Actor–Critic, is implemented to compare irrigation scheduling with Deep Q-Network for the same tomato crop. The results show that the on-policy model Advantage Actor–Critic reduced water consumption by 20% compared to Deep Q-Network with a slight change in the net reward. These models can be developed to be applied to other cultures with high production in Portugal, such as fruit, cereals, and wine, which also have large water requirements.


Introduction
Water deficiency directly or indirectly affects all physiological processes in plants, some of which have a major impact on crop growth, development, and productivity [1,2].The effect of water stress on transpiration, photosynthesis, and the subsequent absorption of water and nutrients by plants has a profound impact on crops and their potential productivity [1,2].
The Food and Agriculture Organization (FAO) reports that agriculture is the sector where the greatest need for action is to reduce water consumption, as about 60% of the water used for irrigation is lost as waste [3].The same studies indicate that reducing this loss by 10% would be enough to supply twice the current world population, based on statistical averages [3].Hence, an efficient water management system is essential.With the use of the Internet of Things (IoT) [4] in agriculture, systems are being developed to effectively manage fields [5,6].The IoT sensors enable monitoring of light, humidity, temperature, soil moisture, and analysis of water among other parameters [5][6][7][8][9].
The huge amount of data provided by these sensors must be analyzed to develop an automated decision-making system.Machine-learning methods are an important tool to analyze this data [10][11][12].Machine learning is a topic that has become increasingly important recently.The learning that gives the term "machine learning" its name consists of running algorithms that automatically build knowledge representation models based on a dataset [13,14].The idea behind this learning is that the machines are trained by giving them access to historical data and one or more performance measures and letting the algorithm "learn", i.e., iteratively adjust the knowledge representation model so that it improves its performance [13,14].After this training, the model has the potential to make high-quality predictions in future situations related to historical patterns [10][11][12][13][14].
Zia et al. [15] compared traditional irrigation calculations by farmers with an IoT-based irrigation method on a lemon farm.IoT sensor data were collected wirelessly through the cloud and a mobile application, and a Decision Support System (DSS) provided irrigation recommendations.The DSS system is based on temperature and humidity, real-time sensor data from the IoT device used in the farm, and plant data (Kc and plant type).The results show water savings of over 50% while increasing yields by 35% compared to the traditional irrigation method.
Tseng et al. [16] estimated the soil moisture from the images using a convolutional neural network (CNN).The results showed that CNN outperformed traditional machinelearning methods, such as support vector machine (SVM), random forest (RF), and twolayer neural networks (ANN).Song et al. [17] combined the macroscopic cellular automata (MCA) model with a deep belief network (DBN) to estimate soil water content in the field.The DBN-MCA model performed better compared to the multilayer perceptron model by reducing the mean square error by 18%.
Saggi and Jain [18] estimated daily reference evapotranspiration using a deep-learning multi-layer perceptron (MLP) for Hoshiarpur and Patiala districts in Punjab.The MLP outperformed traditional machine-learning methods, such as RF, Generalized Linear Models (GLM) and Gradient Boosting Machine (GBM) with a mean square error of 0.0369 to 0.1215.
De Oliveira and Lucas et al. [19] employed three CNN models to predict daily reference evapotranspiration.Performance was compared between CNN and AutoregRessive Integrated Moving Average (ARIMA) and the seasonal Naive model.The CNN model performed better in terms of accuracy.Ahmed et al. [20] developed a deep-learning approach for two-stage daily surface soil moisture prediction (SSM) using a Gated recurrent unit (GRUs)-based recurrent neural network.The model was built by integrating MODIS sensors (satellite-based data), ground-based observations, and climate indices tested at stations in Australia's MurrayDarling Basin.The model achieved low Mean Absolute Error (MAE) values between 0.013 and 0.113 kg m −2 for the first, fifth, and seventh-day predictions.
Adab et al. [21] used four different types of machine-learning models to predict nearsurface (5 cm) soil moisture in the field on different plots.The Random Forest model performed better than the other three methods (ANN, SVM, and elastic net regression algorithm) in predicting soil moisture in the test cases.Although the prediction of soil water content and reference evapotranspiration is critical for irrigation scheduling, further analysis is needed to predict the exact timing and amount of water for irrigation.
Jimenez et al. [22] two recurrent neural network (RNN) models were used to estimate irrigation effort.Data were collected from 2017 to 2019 on a corn farm in Samson, Alabama.Hourly weather data and soil matric potential (SMP) data measured at three soil depths from 13 sensor probes installed on a loamy fine sand soil and a sandy clay loam soil were used for the study.Two neural network methods and Long Short-Term Memory (LSTM) models were used to predict irrigation schedules.The results showed that both RNN models performed well in predicting irrigation prescriptions for the soil types studied, with a coefficient of determination of R 2 greater than 0.94 and a root mean square error (RMSE) less than 1.2 mm.
Bu and Wang [23] mentioned that Deep Reinforcement Learning is a promising model for building smart farms.Deep Reinforcement Learning is a combination of reinforce-ment learning and deep learning (DL).DL is a sub-field of machine learning where the algorithms are deeper in terms of the number of hidden layers [13,14].DL algorithms are created and function similarly to machine learning.
However, these algorithms have numerous layers, each providing a different interpretation of the data.Neural networks attempt to mimic the function of human neural networks in the brain [13,14].Reinforcement learning is an area of machine learning that involves taking appropriate actions to maximize reward in a given situation.It is used by various software programs and machines to find the best possible behavior or path in a given situation [23].
Chen et al. [24] used a Deep-Q Learning (DQN) model for an irrigation decision strategy based on short-term weather forecasts for rice.The results of the DQN irrigation strategy compared with those of the conventional irrigation strategy showed a significant reduction in irrigation water volume, irrigation timing, and drainage water without yield loss.In Alibabaei et al. [25], the DQN was used to estimate the timing and amount of water for irrigation.The objective of the paper was to minimize water consumption without affecting the net return of the farmer.
The model was trained using the environmental conditions, such as the temperature, humidity, reference evapotranspiration, soil moisture, and the last irrigation amount and decided the timing and amount of water for the next irrigation.The DQN agent model increased productivity by 11% and avoided water waste by 20-30% compared to a fixed irrigation amount and threshold methods.
The DQN model is an off-policy method, i.e., in the DQN algorithm, the updating policy (strategy to select the best action) is different from the behavioral policy.The onpolicy Advantage Actor-Critic (A2C) method outperformed the DQN method in the Atari domain and on a variety of continuous motor control problems as well as for navigating random 3D mazes with visual input [26].
To determine how well the A2C model performs in this task of irrigation scheduling for a tomato field, this paper compares the performance of the model with that of DQN.The model simply estimates and tells farmers when and how much water is needed for the next irrigation.The performance of the model is compared in terms of productivity and water use with the DQN model and threshold method.Moreover, the total soil water content is compared when using the DQN and A2C models for irrigation scheduling.
The remainder of this paper is organized as follows.In Section 2, materials and methods, the general framework of the work is explained, and in the subsections, each step is explained in detail through the sequence of data set collection, Section 2.1, data processing, Section 2.2, an overview of the models used in the work, Section 2.3, and experimental setup, Section 2.4.In Section 3, the results are described and discussed.The summary of the work is included in Section 4.

Materials and Methods
The framework of this paper is shown in Figure 1.The first two steps are the same as in [25], and, in the last step, the DQN algorithm is replaced by the A2C algorithm to investigate the potential of this model for irrigation scheduling.In the first step of the framework, the big data are collected from the weather station at a site in Portugal and simulated using Decision Support System for Agrotechnology Transfer (DSSAT) software [27,28].In the second step, two DL models, called Long Short-Term Memory (LSTM), are trained to estimate the total soil water in the soil profile (mm) (SWTD) and tomato yield at the end of the season.In the third step, these trained models are used as the environment for the agent.The agent acts (chooses the amount of water), and the environment responds to this action by calculating the SWTD and tomato yield.The agent and the environment interact until the best strategy for irrigation is found.Each step of the framework is explained in the following subsections.

Data Collection
To compare the potential of DQN and A2C models in irrigation scheduling, the same data from [25] were used in this work.Climate Big data were collected by the government agency of the Ministry of Agriculture and the Sea, Direção Regional de Agricultura e Pescas do Centro, Portugal (www.drapc.gov.pt(accessed on 4 March 2020)) for the Fadagosa site in Portugal from 2010 to 2019.The soil texture of Fadagosa is either sandy or sandy loam, and the climate type is Mediterranean hot summer climate (Csa).Figure 2 shows the Fadagosa region from Google Earth.Table 1 shows the details of the climate variables retrieved from the weather station, and Figure 3 shows the daily climate variables from 2010 to 2019.The abbreviations stand for the following: SD: standard deviation, Min: minimum, Max: maximum, Avg: average, HR: relative humidity, T: temperature, WS: Wind Speed, Prec: precipitation, SR: Solar Radiation, and ET0: Reference Evaporation [29].
As it is difficult to record the tomato yield at different irrigation rates (e.g., the recording yield at no irrigation), to ensure data availability for training the model, Decision Support System for Agrotechnology Transfer (DSSAT) [27,28] was used to estimate the tomato yield at different irrigation rates.The study was conducted using the cropping simulation model, and the same calibration of DSSAT software was used as in [25].For the calculation of the irrigation regime, we considered a fixed interval of four days as a time parameter.The depth criterion was also considered as a fixed value in the interval of 0 and 60 mm.The ET0 calculator developed by the Land and Water Division of FAO was used to estimate Reference Evaporation (ET0) [30]. .Fadagosa daily dataset [29].The abbreviations are the same as in Table 1.

Data Pre-Processing
As in [25], the moving average was used to fill in the missing big data [31].Then, the multicollinear parameters were removed from the data set using the variance inflation factor (VIF) [32].The same information contained in the multicollinear parameters leads to calculation and interpretation problems.
Normalization is a technique generally applied as part of data preparation for machine learning.The purpose of normalization is to change the variables in the data set to use a single scale without distorting differences in value ranges or losing information.The values of the scaling coefficients must be calculated for the training data set and used to rescale the test data set and the predictions.This avoids contaminating the experiment with knowledge about the test data set.In this work, the min-max normalization method was used, which scales the variables in the data set between zero and one using Equation (1).
where x max and x min are the maximum and minimum of each variable.
Recurrent Neural Networks (RNNs) are a family of neural networks designed to process sequential data [13,33].A variant of conventional RNNs is Long Short-Term Memory (LSTM), which solves a well-known problem called a vanishing gradient [13,33].This occurs when the weights computed in the initial parts of the sequence lose influence as iterations progress and they respond to new inputs.As a result, the range of contextual information that can be captured by conventional RNNs is usually quite limited.Such a limitation causes these architectures to perform very poorly for longer sequences [13,33].The LSTM was developed to address this problem.It is an RNN with substructures that help to manage the memory of the recurrent neural network.Figure 4 shows the cell of an LSTM.First, the LSTM must decide what information to disregard in the cell.This is done by the forget gate using the sigmoid function (σ).It analyses the output of the previous cell h t−1 and the input of the current cell x t and produces an output of numbers between 0 and 1, where 1 represents the complete retention of that information (Equation ( 2)).
It then decides what new information it will retain.To this end, it performs two processes: First, the input gate i t formed by the sigmoid function decides which value will be updated, and then the activation function tanh generates a vector of new candidates Ct (Equation ( 3)) that can be added to the state.
Then, the forget gate f t is multiplied by C t−1 so that the information deemed unnecessary is forgotten, and i t is multiplied by C t to retain the new information that is useful (Equation ( 4)).Then, add the result of these two multiplications.Finally, the output is determined.To do this, the sigmoid layer (Equation ( 5)) decides which parts of the cell state to output, and then multiplies this by the cell state tanh(C t ) so that only the information that the network has learned is important (Equation ( 6)).
W and b are the weights and biases of the specific gate in Equations ( 2)-( 6), which should be adjusted when training the model to minimize the loss function.
Bidirectional LSTMs (BLSTM) [34] are a complement to regular LSTMs that are used to improve model performance in sequence classification problems.BLSTMs use two LSTMs to train on sequential inputs.The first LSTM is used unchanged in the input chain.The second LSTM is used in a reverse representation of the input sequence.

Advantage Actor-Critic Network
A Markov decision process (MDP) contains a tuple (S, A, R, P), where [35] • States S: is the set of environment states.• Action (A): a set of all possible actions.• A real-valued function R of S × A × S is called a reward function, which is an incentive mechanism that tells the agent which action is more valuable.

•
A transition function P from S × A × S to [0, 1], where P(s, a, s ) captures the probability of changing from state s to s after executing action a.
In an MDP model, the conditional probability distribution of the next states of the process depends only on the current state, not on the sequence of events that preceded it [23,35].
The Policy π is a mapping from the states S to the set of actions A and determines the action based on the current state [23,35].The objective of reinforced learning is for the agent to learn an optimal or near-optimal policy that maximizes the reward function.The long-term reward at time t is defined by Equation ( 7) [23,35]: where R T indicates the reward of the final state and γ is a real number between 0 and one, called the discount factor, added because of the uncertainty of the future states.Equation ( 8) results from the definition of G t and Equation (7).
The value function (V-function) measures how good it is for an agent to be in a state s following a policy and is calculated by Equation ( 9) [35]: where E π denotes the expected value if the agent follows strategy π, and s t is the state at time t.The optimal policy is defined using the value function as: The optimal policy π * is computed using an iteration algorithm over the V function.The Bellman Equation ( 11) is applied to V(s) for any state s until V(s) reaches the maximum value, denoted as V * (s).
where T(s |s; a) is the transition probability from state s to state s when the agent chooses an action a, R(s; a; s ) is the immediate reward from state s to state s when the agent chooses an action a, and γ is the discounted rate.DL algorithms estimate a nonlinear function between the dependent variables and the independent variables.If the transition or reward function for an MDP problem is not known, a deep-learning model can be used to estimate it and solve the MDP.The Advantage Actor-Critic (A2C) [26] is a DRL method that uses two different deep-learning models to perform the learning (see Figure 5).The first model is the actor, which is used to define the policy of the applied actions.The output of the actor is the probability of executing each action from state s t at time t.The second model is the critic, which estimates the value function V and evaluates all actions performed by the actor [26,36].

Environment
The value function V is the basis for choosing the optimal policy; however, this function is unknown to the agent, and thus it must learn to estimate it from the rewards it receives from interactions with the environment.The temporal difference method (TD) uses the rewards received to estimate the V function and allows the agent to learn and improve its behavior with each action performed [26].The TD error can be calculated using Equation ( 13): where s t+1 represents the state reached after performing an action starting from a state s t and receiving a corresponding reward r t+1 .After receiving this new reward, the value of the new state is used to update that of the previous state.An important parameter of learning is the discount factor, which is limited to the range of 0 and 1 and determines the importance of the future rewards; the lower the discount factor, the more important the short-term rewards are and the less important the future rewards are.When the TD error is positive, it indicates that the tendency to choose the action a t should be strengthened for the future, while when the TD error is negative, it indicates that the tendency should be weakened [26].When using experience through interactions, the temporal difference method eliminates the need for an explicit model of the system that can be applied to systems with unknown parameters or dynamics [37].
The critic model uses the TD error, Equation ( 13), to improve its estimation and the critic's policy.The critic error is defined as the squared of δ t .The actor loss is defined by Equation ( 14).The log probability of the action is scaled by the advantage (TD error), making the variance of the error smaller and the learning process more stable [38].
where θ is the weights of the actor model.A learning agent is subject to the trade-off between exploration and exploitation.Exploration means repeating the actions that are known to give good results, and exploitation means trying new actions to learn new things [35].Exploration without exploitation leads to a sub-optimal solution.In the A2C model, the entropy bonus is usually added to the loss to improve exploration [39].The role of entropy regularization is to promote exploration through various actions.A more uniform action distribution of a policy has higher entropy and, as a result, more random action; the lower the entropy, the more ordered the action.In the case of the A2C model, the entropy for the Softmax policy action π(a t |s t ) is calculated at the neural network output according to Equation (15).
where β > 0 is the entropy regularization weight that determines the trade off between exploration and exploitation.The entropy regularization weight is a hyperparameter that should be determined before training.In this paper, β was chosen to be equal to 0.001.The differences between DQN and A2C are: • The DQN model is an off-policy method, and the A2C model is an on-policy method, i.e., unlike A2C, in the DQN algorithm, the updated policy is different from the behavioral policy [26,40].

•
Unlike DQN, A2C does not use the Replay Buffer but learns the model using state, action, reward, and next state obtained at each step [26,40].

•
In DQN, the function Q is estimated; however, in A2C, the value function V and policy π are estimated [26,40].

Experimental Setup
A computer system with an Intel Core i7-9700 CPU, 32.0 GB RAM, and an NVIDIA GEFORCE RTX 2080 graphics card was used for the work.The models were implemented using the Python language.Tensorflow [41] and Keras [42] libraries were used to implement the deep-learning models.Tensorflow is an open-source library developed for machine learning, numerical computation, and many other tasks.Keras is also an open-source neural network library written in Python.It can be built on top of TensorFlow, Microsoft Cognitive Toolkit, R, Theano, or PlaidML.It is designed to enable rapid experimentation with deep neural networks and focuses on being easy to use, modular, and extensible.

States and Actions Setup
The state and actions were selected as in [25] to ensure comparability of the models.Table 2 shows the action and state sets.Two BLSTM models were implemented in [29,43], to predict tomato yield using climate big data, irrigation amount, and soil profile water content, and to estimate SWTD and ET0 from climate data, respectively.Before training a neural network, hyperparameters should be established.The hyperparameters determine the model structures and training strategy [44].For the BLSTM model, the following hyperparameters were set: the number of layers, the number of hidden units, the dropout size, the learning rate, the learning rate decay, and the batch size.These parameters are set during training based on what is called a validation set, which is used to evaluate the model during training, selecting the hyperparameters with the best validation metric [44].Table 3 shows the parameters used for each BLSTM model.As in [25], these BLSTM models were used to set up the agent's environment.The BLSTM1 was used as a function to estimate the net return at the end of the season using Equation ( 16): r = (yield) * p yield − (water) * p water (16) where p water is the price of 1 mm over 1 ha of water and and p yield is the price of 1 kg of yield.The price of irrigation per 1 mm over 1 ha and tomato prices were nearly 0.5$ and 728.2$/tonne, respectively, ( [45] and www.tridge.com(accessed on 15 July 2021)).The BLSTM2 was deployed to estimate the soil water content for the next day and to determine the next state of the agent.Algorithm 1 shows the environment created for the A2C agent, and Table 4 explains the parameters used in Algorithm 1.  2) after the last irrigation chosen by the agent.During these four days, the action is zero for the first three days and is selected by the agent on the fourth day.Since the states are time series, a two-layer LSTM with 256 nodes was used to estimate the value function V and the policy.The LSTM model receives the current state of the environment and outputs two values.One is the probability of executing each action from the current state, and the other is the value of the action executed by the agent.
For the training set, the first seven years of data were selected, and for the test set, the last two years of data were selected.

BLSTM Models Evaluation
In this work, the trained BLSTM models of [29,43] were used as features in the agent's environment.The BLSM model for tomato yield achieved an R 2 -score of 0.97 and a Root Mean Square Error (RMSE) of 366 (kg/ha) on the test data set, and the BLSTM model for predicting SWTD achieved an RMSE of 6.841 mm and an R 2 -score of 0.98 on the test data set for tomato yield.The logarithm of the Equation ( 2) was used to accelerate the convergence of the agents.Figure 7 shows the average rewards of the A2C and DQN agents during training.Every fifty episodes, the average is calculated.Upon the improvement of the average reward, the weights of the Actor-Critic network are saved.As can be seen in Figure 6, the training of the A2C agent is more stable than that of the DQN agent.This is because the DQN agent selects the action based on the epsilon greedy method, and the training of the Q function is independent of the action selected by the agent, while the A2C agent is an on-policy model, and during training, the policy is also trained to select the best possible action.The agent was tested with the 2018 and 2019 datasets.Figures 8 and 9 show the SWTD when the A2C model is used for irrigation and the volume of water selected by the agent for irrigation in 2018 and 2019 compared to the DQN agent.The A2C agent removed irrigation early in the season and begins irrigation in mid-season.In the results, the SWTD predicted by the environment of the A2C model is lower than the SWTD predicted by the environment in the DQN model.

Evaluation of the A2C Agent
Table 5 shows the comparison between the trained DQN agent and the A2C agent and the best result in terms of net return of the threshold method irrigation with a fixed amount in [25].In the threshold method, SWTD is calculated every four days and if it is below a threshold, a fixed amount of water is used for irrigation.Although the productivity in the case of the DQN model is higher than productivity in the case of the A2C model, water consumption is on average 21.5% lower with the A2C model.In addition, the net yield with the DQN method is on average 3.5% higher than with the A2C model.Thus, the A2C method uses less water; however, the net return is slightly lower.
As we mentioned in the introduction, the objective of the work [25] was to minimize water consumption without affecting the net yield of the farmer.The trained model in-creased the farmer's net yield by 11% and reduced water consumption by 20-30% compared to a fixed and a threshold method.The main objective of this work was to compare the onpolicy model with the off-policy model with the same goal as [25], namely to reduce water consumption without affecting the net return to the farmer.As can be seen from Table 5, both automatic methods (DQN and A2C) performed better in terms of water consumption and net return compared with the threshold method.Moreover, the average rainfall in 2018 was lower than the average rainfall in 2019, and both automatic models learned to irrigate more when rainfall was lower and to adjust the irrigation to climatic changes.

Conclusions
Most water is used in agriculture, and much of that water is wasted due to the lack of efficient irrigation systems.Water has become a scarce resource.Therefore, it is important to create an efficient irrigation system without compromising productivity.
In [25], the ability of the DQN model to schedule the irrigation of an agricultural field was studied.In this work, the A2C model was trained in the same way to compare the performance of A2C with DQN in scheduling irrigation.The goal was to train the agent to achieve high crop productivity with efficient water use.The agent decided when and how much water was needed for irrigation.The models were trained with seven years of data and tested with two years of data.A disadvantage of the deep-learning method is that a large amount of data is needed to train a model.To overcome this problem, the simulation software DSSAT was used.The tomato yield was simulated based on the different irrigation schedules.
The same environment for DQN was used for the A2C model.The trained A2C agent reduced water consumption by up to 20% compared to the DQN agent; however, productivity decreased slightly.These results show that an on-policy model, such as the A2C model, can achieve better performance in terms of water savings compared with an off-policy model, such as the DQN model.Therefore, the on-policy model is more appropriate than the off-policy DQN model for regions with limited water sources.Both automatic models learn to irrigate more when there is less rain in a year, and they adjust irrigation to the changing climate.
They also both outperformed the threshold method.However, training the models with both simulations and real data sets makes the model more reliable.For a real-world application, real-world data should be collected to re-train the models.In addition, the evaluation of the net return is based on the prices of these years.In this sense, these results will be incorporated into the BioD'Agro project (https://biodagro.com(accessed on 6 May 2022)).The main objective of the BioD'Agro project is to develop an information system for remote monitoring of a vineyard and to assist producers in making decisions that promote agrobiodiversity.
To this end, BioD'Agro will work by combining data from in situ sensors that integrate the parameters of the vine (health and water status), the environment (climate and soil), and functional biodiversity (flora, arthropods, and bats) with earth observation imagery and, through the use of machine learning and artificial intelligence, provide a web platform where growers can monitor the water status of their vines or the presence of pests in real time while evaluating and deciding how to manage water efficiently or control pests in an environmentally friendly way.Thus, the data collected by the IoT sensors in the vineyard will be used to train the model developed in this paper to improve irrigation scheduling.

Figure 2 .
Figure 2. Map of the Fadagosa region.

Figure 5 .
Figure 5.The A2C model interacting with environment.

Figure 6
Figure 6 shows the actor and critic loss during training.Each episode lasted 7 s.The actor's loss at the end of the training tends to zero, and the agent chooses the action that has the best reward.

Figure 7 .
Figure 7. From left to right, the A2C and DQN rewards during training.

Figure 8 .
Figure 8.Comparison of the irrigation amount of the trained DQN and A2C models.The time step starts from the beginning of the season to the end of the season every four days.

Figure 9 .
Figure 9.Comparison of SWTD of the trained DQN and A2C models.

Table 3 .
Selected hyperparameters for the tomato yield estimation model (BLSTM1) and soil moisture estimation model (BLSTM2).

Table 4 .
The parameters used in Algorithm 1.

Table 5 .
Comparison of net return of the DQN and A2C agent.The arrow next to each value indicates the increase or decrease of that value compared to the A2C method.