Appendix A
In this section, more details about MDPs, the DP value iteration method, and the RL Q-Learning algorithm implementation are given.
Given an MDP with a set of states and actions 
, a reward function 
, a state transition probability distribution 
 (from all states 
 to all their successor 
), and a discounted factor 
; the random variables 
 and 
 depend on the preceding state and action. That is, the probability of those variables occurring at the same time 
 is conditioned on the previous state 
 and the action 
 taken, as defined in Equation (A1).
        
        for all 
, 
 and 
. Therefore, the function 
 defines de dynamics of the MDP.
Recall that the objective of an MDP is to maximize the expected discounted cumulative reward in a trajectory. Equation (A2) defines the expected reward in an episodic MDP.
        
        where 
T corresponds to the last time-step of the episode. Time-step 
T leads to terminal states. Once a terminal state is reached, the episode is ended. In this formal context, a policy 
 establishes a mapping from states to actions. Hence, 
 expresses the probability of selecting an action 
 given a state 
 and following the policy 
. The value function under a policy 
 indicates how good is to be in each state 
 based on the expected discounted cumulative reward and following policy 
 to choose the sequent actions. The value of a terminal state is always 0. The Bellman equation for the value function 
 is presented in Equation (A3). This expression is directly derived from Equations (A1) and (A2). It expresses the recursive relationship between the value of a state and the values of its successor states and represents a basis for many more complex derivations in MDPs. These equations are widely applied to solve control and decision-making problems.
        
        for all 
. A state-value function 
 is closely dependent on the policy followed. Therefore, an optimal value function 
 is obtained by following an optimal policy 
, thus taking the action that maximizes the future rewards. The 
 function is defined in Equation (A4) and it is known as the Bellman optimality equation.
        
        for all 
. Similarly, the action-value function 
 defined for a policy 
 is the expected reward of taking action 
 in a state 
 and following the policy 
. Equations (A5) and (A6) express the action-value function and its Bellmann optimality equation versions, respectively.
        
        for all 
 and 
.
The Bellman equation for the value function [
24] gives a recursive decomposition that can be used to iteratively improve this function, and which at some point will converge to the optimal value function 
. The V-table stores and reuses solutions.
A DP value iteration method [
24] is used to approximate the optimal value function 
 in the V-table. Equation (A7) presents the recursive version of the Bellman equation to improve the value function. In the experiments presented, the environment is episodic and the discount factor 
 is set to 1. Moreover, the environment is deterministic and non-stochastic, due to the assumption of constant variables. Therefore, the recursive function considered for this task is the simplified version presented in the Equation (A8).
        
        for all 
and 
. When no more improvements can be made (or very small improvements happen), the value iteration algorithm returns an approximation to the optimal value function 
 that maps states to expected cumulative reward values in a tabular form (V-table). A threshold 
 is set to define stopping criteria for the algorithm since infinite iterations would be needed to converge exactly to 
. The V-table is used to compute the (near) optimal policy 
 for decision-making by choosing the action that maximizes expected transition values. Equation (A9) presents the policy derived from the value function approximation.
        
        for all 
and 
. 
The pseudo-code of the presented DP value iteration algorithm is illustrated in Algorithm A1.
		
| Algorithm A1: Value iteration | 
| 1. | 
| 2. | 
| 3. | 
| 4. | 
| 5. | 
| 6. | 
| 7.    end | 
| 8. end | 
| 9. | 
Reinforcement Learning: Q-Learning
Q-Learning is a Temporal Difference (TD) control algorithm for TD-Learning control used in RL [
24]. The learned state-action value function 
 directly approximates the optimal state-action value function 
, regardless of the followed policy. In Q-Learning, a value is learned per each state-action pair and stored in the Q-table. Q-table updates are made after every interaction step by modifying previous beliefs with the new transition information. In this case, the learned action-value table, that directly approximates 
, is obtained by a sequence of 
 functions. The state-action value sequential updates are presented in Equation (A10) by doing repetitive updates in the Bellman equation. For the sake of comparison of the methodologies, in this work we are working with an episodic problem, such as in the DP case, and 
 is set to 1.
        
        for all 
 , and 
. The 
 value is the learning rate or step size. The learning rate is used for weighting the importance of new updates in the value function. A factor of 0 would give all the weight to the past knowledge and the agent learns nothing from new experiences, while a factor of 1 gives all the weight to new experiences forgetting all past knowledge. Any intermediate value balances the importance of past knowledge with the new one. 
Q-learning is an off-policy algorithm that uses two different policies in the state-action values updates: the behaviour policy and the target policy.
The behaviour policy 
 is used to manage the trade-off between exploitation and exploration. In this work, the behaviour policy is an 
-greedy decay exploration strategy with respect to 
 where the 
 value is decayed after each learning step. Therefore, the behaviour 
 is converging to the target policy by decreasing the 
 value. The behaviour policy is expressed in Equation (A11).
        
        for all 
. The 
 is linearly decayed until a minimum value of 0.01 is reached at the last training step. Equations (A12) and (A13) presents the 
 updates.
        
        where 
, 
episodes, and 
batch_size are parameters of the process and 
 represents the number of training steps.
The target policy 
 is greedy with respect to 
, hence it is deterministic. It is used to select the action with the higher expected cumulative reward. After training, the Q-table is used to compute the (near) optimal policy 
 for decision-making by choosing the action that maximizes expected learned values. Equation (A14) presents the target policy 
 derived from the state-action value function approximation.
        
        for all 
and 
. 
The pseudo-code of the presented Q-Learning algorithm is illustrated in Algorithm A2.
		
| Algorithm A2: Q-Learning | 
| 1. | 
| 2. For each episode do: | 
| 3.    environment initialization | 
| 4.   While
				 is not terminal do: | 
| 5.      Sample action
				-greedy policy from | 
| 6.      Sample transition | 
| 7.      Update | 
| 8.      
				 decay ε value | 
| 9. | 
| 10.    end | 
| 11. end | 
| 11. Return |