Analysis of Explainable Goal-Driven Reinforcement Learning in a Continuous Simulated Environment

: Currently, artiﬁcial intelligence is in an important period of growth. Due to the technology boom, it is now possible to solve problems that could not be resolved previously. For example, through goal-driven learning, it is possible that intelligent machines or agents may be able to perform tasks without human intervention. However, this also leads to the problem of understanding the agent’s decision making. Therefore, explainable goal-driven learning attempts to eliminate this gap. This work focuses on the adaptability of two explainability methods in continuous environments. The methods based on learning and introspection proposed a probability value for success to explain the agent’s behavior. These had already been tested in discrete environments. The continuous environment used in this study is the car-racing problem. This is a simulated car racing game that forms part of the Python Open AI Gym Library. The agents in this environment were trained with the Deep Q-Network algorithm, and in parallel the explainability methods were implemented. This research included a proposal for carrying out the adaptation and implementation of these methods in continuous states. The adaptation of the learning method produced major changes, implemented through an artiﬁcial neural network. The obtained probabilities of both methods were consistent throughout the experiments. The probability result was greater in the learning method. In terms of computational resources, the introspection method was slightly better than its counterpart. B.F.; investigation, E.P., F.C. and A.A.; methodology, F.C. and B.F.; project administration, F.C.; software, E.P.; supervision, F.C. and B.F.; validation, F.C. and A.A.; visualization, E.P.; writing—original draft, E.P.; writing—review and editing, F.C., A.A. and B.F. authors


Introduction
Machine learning has become widespread within the daily life of individuals. Without our realization, more and more algorithms are analyzing and learning to help us with some tasks, such as, recommending personalized movies or music [1], looking for deleted emails [2], facial recognition [3], in medicine [4], and also in autonomous vehicles [5]. Therefore, in some circumstances, it is important that the user understand what information these machine systems or robots are providing; otherwise, they would be unreliable. For example, it would not be useful to have an algorithm that recommended movies we do not like, or, also, it would be serious if a medical system made a faulty negative diagnosis during an evaluation of a patient with cancer. In this context, the role of explainable artificial intelligence is an important role since it consists of tools, techniques, and algorithms that provide the agent with the ability to explain its action to the human intuitively [6,7].
In human-agent interaction scenarios, it is possible that a non-expert user does not understand why the agent makes a specific decision. As a result, some questions may arise on the part of the user under these circumstances: why?; why not?; how?; what?; and if? [8].
As systems may be represented by white, black, and gray-box models [9,10], explainability aims for providing explanations to non-expert users in order to understand decisions taken by an agent represented by a black-box model [11,12]. In the context of Explainable Goal-driven Reinforcement Learning (XGDRL) [13], we might consider a robot in a laboratory that turns to the right at intersection A, but the user does not understand the reason why or how the robot arrived at the solution. Then, the user could ask: Why has it turned to the right? Then, the robot could give the following answer: "I have turned to the right because it is the option with the most possibilities for reaching the goal". The problem focuses on understanding and trust established between an autonomous system and a non-expert human at the time of the interaction. That is, if the user could understand the reasons provided by the agent [14]. In this situation, the problem highlights the understanding about the decisions where, according to Sado et al. [15], research is still required in this area. Therefore, increasing the explainability of agents would benefit these systems, since any user would be capable of understanding the actions taken while facilitating trust.

Explainable Reinforcement Learning
Reinforcement learning (RL) is a machine learning paradigm that solves goal-oriented tasks through iterative interactions with the environment [16]. In each iteration step, the RL agent must choose an action for its current state that must increase the final expected reward. RL has been widely used in human-agent interaction (HAI) studies [17,18], where autonomous systems work with or for humans. In HAI scenarios where trust is important, the explainability of artificial intelligence becomes relevant. Through transparency in the systems, users can become capable of understanding and trusting the decisions made within the systems of an intelligent agent [14]. Explainable artificial intelligence is fundamental for cooperative systems between humans and machines, increasing effectiveness between them. Thus, for example, if an agent recommends to an investor to select the option "X" to buy or invest, the investor may be sure that what the agent suggests is trustworthy since it is capable of explaining the selection. Various studies have been carried out in the field of explainable artificial intelligence. For instance, Adadi and Berrada [19] did a study of this area collecting common terms and classifying the existing explainability methods. Its applicability is recognized in research in areas such as transportation, health, law, finance, and the military. On the other hand, Lamy et al. [20] proposed an explainable artificial intelligence method through the Case-based Reasoning Method (CBR) using the Weighted k-Nearest Neighbor (WkNN) algorithm applied to the diagnosis of breast cancer.
With regard to Explainable Reinfocement Learning (XRL), it seeks to provide the agent with a method to explain itself within the learning process. Different studies approach this from a recommendation system [21], where the different suggestions to be presented to users are enriched. Additionally, other studies are used to control the flight of a drone [22] where explainability involved visual and textual explanations with regard to characteristics obtained. Other studies presented explainability in a friendlier way for the user. For example, Madumal et al. [23] proposed a focus on a causal structural learning model during the goal-driven learning process. The model is used to generate explanations through counterfactual analysis resulting in an explanation of the chain of events. This was used with participants who watched agents play a game of strategy (Starcraft II) in real time. While the video was playing, the participants could ask: "why?" or "why not?" a particular action was taken. Sequeira et al. [24] proposed a method for agents through goal-driven learning through introspective analysis. Three levels of analysis were suggested. In the first place is the collection of information of interest about the surroundings. In the second place is an analysis of the interaction with the surroundings. In third place is combining the results obtained and carrying out a final analysis of these.
Cruz et al. [25] worked on explainable goal-driven learning based on memory. This method has an episodic memory that allowed explanation of decisions with regard to a probability of success that depended on the number of steps to reach the goal. However, problems occurred in complex scenarios because the memory was finite. In this sense, Cruz et al. [26] expanded this research studies by increasing the number of approaches, maintaining the approach based on memory, and adding approaches based on learning and introspection. These used an episodic scenario with deterministic and stochastic transitions. However, it was not directly applicable to continuous situations in the real world.
Recent work from Milani et al. [27] summarizes different methods of explanations in RL algorithms under a new taxonomy based on three main groups: • Feature importance (FI), which explains the context of an action or what feature influenced the action. • Learning process and MDP (LPM), which explains the experience influence over the training or the MDP components that led to a specific action. • Policy level (PL) explains the long-term behavior as a summary of transitions.
Regarding these categories, our work proposes an LPM explanation as a model domain information under a continuous state problem. The proposed method allows the justification of actions selected by the agent to improve trust existing in human-agent surroundings [28].

Methods and Proposed Architecture
Taking into account the methods based on learning and introspection presented by Cruz et al. [26], an adaptation for a continuous state space was proposed for this study. These methods used the probability of success. This is the probability of reaching the goal in order to provide an explanation with regard to the agent's decision-making. For instance, an agent executing a task might perform different actions a 1 and a 2 from a particular state s leading to different probabilities p 1 and p 2 of completing the task successfully. The probability is linked with the experience the agent has collected during the training and can be estimated using the learning-based and introspection-based methods. Therefore, the learning-based method focused on continuous learning for the probability of success through the ongoing execution of the goal-driven learning algorithm. For the introspectionbased method, the Q-values were used in order to infer the probability value of success.

Learning-Based Method
The learning-based method consisted of using the same idea of goal-driven learning shown in Equation (1), related to temporal-difference algorithms, where Q(s t , a t ) is the state-action value for a state s and action a at a time t, α is a constant step-size parameter, r t+1 is the observed reward, and γ is the discount rate parameter. Cruz et al. [26] proposed Equation (2) as a possible solution for calculating the probability of success within discrete environments, where P(s t , a t ) is the probability of success for a state s and an action a at a time t, α is a constant step-size parameter and φ t+1 is a binary value representing if the task is completed successfully at a time t + 1. However, for continuous environments, this equation caused problems because the state space presents infinite possibilities. This made it impossible to use a table of state-action pairs. As a result, we propose working with a parallel artificial neural network and with the same structure used with the algorithm for goal-driven learning that predicts the probability of success.
The artificial neural network focused on the probability of success used the same structure as that of the DQN algorithm [29], as illustrated in Figure 1. Moreover, the network was updated and used the same loss function as the DQN algorithm. The only difference was that the last dense layer used a sigmoid activation function with the goal of restricting the values in the probability range of 0 and 1. Algorithm 1 shows the implementation of the DQN method with the learning-based approach. In the algorithm, y j and y pj are the approximate target value for the DQN algorithm and the network used in this method, respectively. θ are the parameters of the networks. C is the number of steps when the target network is updated with the parameters of the main network, for each case. This number is arbitrarily chosen, for this work has been empirically set to 5. M is the number of episodes. The notation x j where x ∈ {s, a, r, done} refers to the transition elements within the D memory. The calculation of the probability using this method is illustrated between lines 16 and 18.

Dense 64x216
Output 64x12 Linear activation function / Sigmoid activation function Figure 1. Diagram of the neural network used to calculate the Q-values using the DQN algorithm and the probability of success using the learning-based method. When computing the probability of success, the neural architecture used a sigmoid activation function in the last dense layer in order to restrict the output values between 0 and 1. Three consecutive 96 × 96 gray images of the car racing game are used as inputs.

Introspection-Based Method
Cruz et al. [26] proposed Equation (3) to compute the probability of success according to the introspection-based method. This solution consists of estimating the probability using the Q-value through a mathematical transformation, so no additional memory is needed. In the equation, σ is a parameter for stochastic transitions (in this case 0). Q * (s, a) is the estimated Q-value. R T is the total reward obtained at the terminal state. [...]P S ≤1 P S ≥0 is the rectification used to restrict the value between [0, 1].
For continuous environments, a normalization of the results was proposed instead of limiting it with the equation. The change is illustrated in Equation (4) wherep S relates to the normalization of a probability andp computes the probability as shows in Equation (3) without the rectification. Algorithm 2 shows an adaptation of the introspection-based method in addition to the DQN algorithm [29]. In the algorithm, the notation used is as in Algorithm 1. The calculation for the probability of this method was carried out in line 17.
Algorithm 1 Explainable goal-driven learning approach to calculate the probability of success using the learning-based method.
1: Initialize Q and P functions 2: InitializeQ andP function objectives 3: Initialize memory D with long N 4: for episode 1 to M do 5: Initialize state queue S with initial state 6: repeat 7: ← update with -decay 8: s t ← dequeue current state from S 9: Select an action a t according to s t using policy -greedy 10: Take action a t , observe reward r t , and next state s t+1

11:
Enqueue next state s t+1 into S 12: Take a random sample of transitions (s j , a j , r j , s j+1 , done j ) from D 14: 15: Update Q with respect to y j 16: 17: Update P related to y pj 18: until s t is terminal or r t had been negative 25 times 20: end for 21: Return Q, P Algorithm 2 Explainable goal-driven reinforcement learning approach for computing the probability of success using the introspection-based method. The algorithm is mainly based on [29] and includes the probabilistic introspection-based method.
1: Initialize function Q 2: InitializeQ function objective 3: Initialize memory D with long N 4: for episode 1 to M do 5: Initialize state queue S with initial state 6: repeat 7: ← update with -decay 8: s t ← dequeue current state from S 9: Select an a t action according the s t state using -greedy policy 10: Take action a t , observe reward r t , and next s t+1 state 11: Enqueue next state s t+1 into S

12:
Store the transition (s t , a t , r t , s t+1 , done t ) in D

13:
Take a random sample of the transitions (s j , a j , r j , s j+1 , done j ) from D 14:

Experimental Scenario
The experimental scenario was implemented using the Open AI Gym library [30]. It facilitated the development of goal-driven learning algorithms. Specifically, the environment that situated the car racing game (see Figure 2) was selected since it is a continuous environment. Deep Q-Network (DQN) was used for learning, and Keras was used to create the neural network. Furthermore, the decay -greedy method was chosen for action selection. The environment consisted of a race track where a car was controlled by an autonomous agent. The objective was to take a lap around the track. The images of the environment were preprocessed by converting them to a gray scale. Following, the state was represented by three consecutive images of the game (Figure 3). Each image consisted of a matrix of 96 × 96 pixels, so the state remained represented by a matrix of 96 × 96 × 3. The initial state of the environment related to a random generation of the race track where the car could start at any point on the track. Actions were carried out with a combination of movements with the steering wheel (−1, 0, 1), acceleration (0 or 1), and braking (0 or 1), resulting in a total of 3 × 2 × 2 = 12 possible actions. Therefore, in this work, we have used a continuous state space and a discrete action space.  The reward within the environment is displayed in Equation (5) where N B represents the number of boxes on the track and f the quantity of frames used. It is important to take into account the implementation included a 50% bonus if the agent used the accelerator without braking.

Results
Two experiments were conducted for this research. The first experiment focused on adapting the methods for calculating the probability of success for both methods. One agent was trained with the DQN algorithm five times and the learning-based and introspectionbased methods were used to compute the probability of success within the experimental scenario. In the second experiment, the use of resources was analyzed. In this regard, one agent was trained using each of the proposed probabilistic methods for three runs. The parameters used during this process included: initial = 1.0, -decay decay = 0.9999, and learning rate α = 0.001.

Learning-Based Method
Results were obtained as shown in Figure 6. These consist of the averages of the probabilities of success for each episode during five training processes. For each agent, an average of the probabilities for all of the actions was computed according to a dedicated artificial neural network. In the first 75 episodes, the average of the probability remained relatively constant P ≈ 0.5. Around episode 75, the agents showed an improvement in probability, but stopped at approximately episode 100 with a value of P ≈ 0.75. Later, the probability of success fluctuated in the remainder of the training staying close to the value mentioned. This can be interpreted that around episode 500, the agent had a 75% probability of completing the task. Although for the explainability method, the best action to perform is not much relevant, this is decided using the DQN algorithm. Therefore, the results obtained show the probability of success assuming the optimal action for each possible state. For example, if a car racing player would like to ask the agent why on the first curve of the track it turned left instead of right, with the probability of success, the agent might respond: "I have 75% probability of finishing the track if I select this action." However, if it were to respond with the Q-value, "I have a Q-value ≈ 80." For the player of the game, this would make no sense.
Even when the adaptation proposed was related directly to the Q-values, a stepped behavior was observed for the case of probability. However, in the case of Q-values, these are similar to an increasing linear behavior.

Introspection-Based Method
The results for the introspection-based method are displayed in Figure 7. These consist of the average of the probabilities of success for five training processes during 500 episodes, similar to the previous method. The average of the probabilities was obtained for each agent for the twelve actions calculated with those of Equation (4). An increase in the average of the probability of success in the first 75 episodes reached a value ofP S = 0.5. Later, it continued to increase subtly, ending with a probability of success with a value ofP S = 0.62. Thus, this could be interpreted that at episode 500, the agent had a 62% probability of completing the task. A logarithmic behavior was observed supported by the function presented in Equation (3). Similarly, the agent could explain to the user that it had, for example, a 62% probability of finishing the track if it continued going straight in the last section instead of turning right. What is important in this sense is that the user understands the decision making.

Use of Resources
An additional experiment was carried out to measure the use of resources of the methods adapted to the continuous environments. This experiment was performed with an agent for each method, each agent was trained three times during 200 episodes. The goal of the experiment was to measure the RAM memory and the use of the processor (CPU). For this, the Python psutil library was consulted.
In the training environment, the agent processed and stored images. Therefore, the amount of memory used for this process was significant. As a result, both experiments were carried out in periods of 40 episodes until the number of episodes completed corresponded to each experiment. Based on these conditions for each agent, the memory used and the use of the CPU were recorded for each of these periods. In the case of memory, all of the periods recorded were added together, and in the case of the CPU, periods were averaged. The computer used for these experiments had a Windows Pro operating system, an Intel ® Core TM i5-8500 @3 GHz processor, and a two memory RAM HyperX ® FURY 8GB DDR4 2666 MHz.
The results are illustrated in Figure 8, where each method has similar values for use of the CPU. With regard to the memory used for the methods, a slight difference occurred in favor of the introspection-based method with an average of 93% memory occupied and 98% of memory occupied for the learning-based method with a lower standard deviation that indicated greater stability using this resource.

Discussion
In the learning-based method, changes were made in the tests for adapting to a continuous state space. In principle, the neural network used for probability of success was tried to provide the objective for Equation (2). Thus, the fixed discount factor of 1 was used as it appeared in the equation. In addition, the reward was not used to update the values. Instead, the binary variable corresponded to 0 if the task had not been completed and to a 1 if the task had been completed, designated as φ.
After a few of the experiments, some inconsistencies were observed in the results, such as the structure of the neural network. It had layers of functions for activating the rectified linear unit (RELU). The probability was not limited between 0 and 1. As a result, the activation of the last layer was changed to a sigmoid with the goal of obtaining values in the interval [0, 1]. This change did not have good results. Throughout the episodes, the probability remained static with a value of 0.6.
Finally, the adaptation method, proposed in Section 3, similar to the DQN algorithm, occupied a network to learn the probability of success, maintaining the discount factor and reward for calculating the probability of success. The difference in the last layer had a sigmoid as activation function.
With regard to the introspection-based method, similarly, an attempt was made to implement it as shown in Equation (3). However, the results suggested that the upper limit obstructed part of the results. Then, the normalization of the data was proposed so that it would remain in interval [0, 1]. Therefore, in Equation (3), the notation [...]P S ≤1 P S ≥0 , which represented the rectification, was replaced with the normalization function presented in Equation (4) whereP S corresponded to a set of probability success data calculated with the transformation value of Q and R T .
The learning-based method underwent the most changes due to the replacement of the probability table (discrete) with an artificial neural network (continuous) for learning the probabilities. In addition, the way of calculating the probability was proposed similar to that of the DQN algorithm. However, for the introspection method, the proposal was made to change the rectification of the probability with normalization with the values obtained.
On the other hand, in the context of the second experiment, the a priori estimated that in the learning-based method, resources would be more expensive. However, based on the proposal for this method, RAM memory use remained close to that of the introspectionbased method. Since the proposed algorithm in discrete settings occupied a table with the same dimensions as the table for the Q-values, in adapting the learning method, it was replaced by an artificial neural network that substituted memory use for processing use. Even though it was an improvement for memory use, it continued to need more memory than did the introspection-based method.
Given the obtained results including the estimated probability of success using the learning-based and the introspection-based methods as well as the use of resources, we hypothesize that the learning-based method would be preferred in simpler scenarios in which to run a parallel neural network training is not expensive as this gives a better estimation of the probability (as in the car racing game scenario). Contrarily, in more complex scenarios or when the computational resources are limited to run an additional neural network, the introspection-based method should be considered, as the estimation of the probability can be computed directly from the Q-values and, therefore, no additional memory is needed.

Conclusions
In this present study, the explainability methods were tested and adapted for a continuous environment: one based on learning and the other based on introspection. Therefore, it was necessary to adapt the methods so that it would be possible to carry them out within a continuous environment. The advantage of estimating the probability of success by the learning-based and introspection-based methods in comparison to a memory-based method [25] is that the latter implies saving the agent's transitions into a memory to compute the probability of success later on. Although that method is an effective and reliable manner to compute the probability of success, it is very inefficient as the number of transitions grows. This is especially relevant in continuous or real-world scenarios. Therefore, computing an estimation of the probability of success using either the learning-based or introspection-based methods would allow the proposed explainability framework to scale to more complex scenarios.
Two experiments were conducted focused on the explainability methods and their use of resources. In the first case, the learning-based method was associated with the idea of the probability of success learned through the training, but with a change in the way it was performed, obtaining an average of P = 0.75 at the end of training. On the other hand, the basis of the introspection-based method was to maintain and change the correction of the normalization of the data resulting in an average ofP S = 0.62. From these advances, human-agent interaction methods can be improved to become closer to reality. In essence, this could create the trust necessary so that they could function fully. In the second experiment, with respect to the use of resources, the use of the CPU did not present any significant differences between the two methods. However, the learning-based method used more memory with an average of 51.66 GB for 200 episodes in comparison to 48.48 GB for the introspection-based method, due to the operation of the second artificial neural network used with the learning-based method.
In this work, we have used the well-known DQN algorithm, however, more efficient RL algorithms should be considered in the future. Although agent performance was not the aim of this work, we hypothesize that a better base RL method may also lead to better explanations. Some algorithms interesting to further explore are Soft Actor-Critic [31] or Rainbow [32] that use previous deep RL methods such as double Q-learning, prioritized replay, dueling networks, multi-step learning, distributional RL, and noisy nets including recommended parameters. Moreover, as we have only used the car racing game scenario with a continuous state space, additional experiments are needed for a comprehensive evaluation in continuous environments, for instance, including those evaluated by Mnih et al. [33] or with continuous action spaces [34]. Nevertheless, our work presents a baseline to further evaluate goal-driven explainability reinforcement learning methods in continuous scenarios.