# Iterative Oblique Decision Trees Deliver Explainable RL Models

^{1}

^{2}

^{*}

## Abstract

**:**

## 1. Introduction

- It is conceptually simple, as it translates the problem of explainable RL into a supervised learning setting.
- DTs are fully transparent and (at least for limited depth) offer a set of easily understandable rules.
- The approach is oracle-agnostic: it does not rely on the agent being trained by a specific RL algorithm, as only a training set of state–action pairs is required.

## 2. Related Work

## 3. Methods

#### 3.1. Environments

#### 3.2. Deep Reinforcement Learning

#### 3.3. Decision Trees

#### 3.4. DT Training Methods

#### 3.4.1. Episode Samples (EPS)

- A DRL agent (“oracle”) is trained to solve the problem posed by the studied environment.
- The oracle acting according to its trained policy is evaluated for a set number of episodes. At each time step, the state of the environment and action of the agent are logged until a total of ${n}_{s}$ samples are collected.
- A decision tree (CART or OPCT) is trained from the samples collected in the previous step.

#### 3.4.2. Bounding Box (BB)

- The oracle is evaluated for a certain number of episodes. Let ${L}_{i}$ and ${U}_{i}$ be the lower and upper bound of the visited points in the ith dimension of the observation space and ${\mathrm{IQR}}_{i}$ their interquartile range (difference between the 75th and the 25th percentile).
- Based on the statistics of visited points in the observation space from the previous step, we take a certain number ${n}_{s}$ of samples from the uniform distribution within a hyperrectangle of side lengths $\left[{L}_{i}-1.5\phantom{\rule{0.166667em}{0ex}}\xb7\phantom{\rule{0.166667em}{0ex}}{\mathrm{IQR}}_{i},{U}_{i}+1.5\phantom{\rule{0.166667em}{0ex}}\xb7\phantom{\rule{0.166667em}{0ex}}{\mathrm{IQR}}_{i}\right]$. The side length is clipped, should it exceed the predefined boundaries of the environment.
- For all investigated depths d, OPCTs are trained from the dataset consisting of the ${n}_{s}$ samples and the corresponding actions predicted by the oracle.

#### 3.4.3. Iterative Training of Explainable RL Models (ITER)

Algorithm 1 Iterative Training of Explainable RL Models (ITER) | |

Require:
Oracle policy ${\pi}_{O}$, maximum iteration ${I}_{max}$, number ${n}_{b}$ of base samples, numberof trees ${T}_{max}$, evaluation episodes ${n}_{eps}$, DT depth d, number of samples added in each iteration ${n}_{samp}$. $\mathcal{O}$: observations, $\mathcal{A}$: actions. | |

1: ${S}_{c}\leftarrow ({\mathcal{O}}_{b},{\mathcal{A}}_{b})$ | ▹ collect ${n}_{b}$ samples by running ${\pi}_{O}$ in environment |

2: $\mathcal{M}\leftarrow \varnothing $ | ▹ the set $\mathcal{M}$ collects triples for each tree |

3: for
$i=0,\dots ,{I}_{max}$
do | |

4: for $j=1,\dots ,{T}_{max}$ do | |

5: ${T}_{i,j}\leftarrow \mathtt{train}\_\mathtt{tree}({S}_{c},d)$ | ▹ tree ${T}_{i,j}$ |

6: ${\mathcal{M}}_{j}\leftarrow \left({T}_{i,j},\mathtt{eval}\_\mathtt{tree}({T}_{i,j},{n}_{eps})\right)$ | ▹ triple ${\mathcal{M}}_{j}$ (eval_tree returns ${\mathcal{O}}_{{T}_{i,j}},{\mathcal{R}}_{{T}_{i,j}}$) |

7: $\mathcal{M}.\mathtt{append}\left({\mathcal{M}}_{j}\right)$ | |

8: end for | |

9: $\left({T}_{i,*},{\mathcal{O}}_{{T}_{i,*}},{\mathcal{R}}_{{T}_{i,*}}\right)\leftarrow \mathtt{pick}\_\mathtt{best}\left(\mathcal{M}\right)$ | ▹ pick triple ${\mathcal{M}}_{j}$ with highest reward ${\mathcal{R}}_{{T}_{i,j}}$ |

10: ${\widehat{\mathcal{O}}}_{{T}_{i,*}}\leftarrow \mathtt{choice}\left({\mathcal{O}}_{{T}_{i,*}},{n}_{samp}\right)$ | ▹ pick ${n}_{samp}$ random observations |

11: ${S}_{new}\leftarrow \left({\widehat{\mathcal{O}}}_{{T}_{i,*}},{\pi}_{O}\left({\widehat{\mathcal{O}}}_{{T}_{i,*}}\right)\right)$ | |

12: ${S}_{c}\leftarrow {S}_{c}\cup {S}_{new}$ | |

13: end for | |

14: return ${T}_{{I}_{max},*}$ | ▹ return the best tree from all iterations |

#### 3.5. Experimental Setup

## 4. Results

#### 4.1. Solving Environments

#### 4.2. Computational Complexity

#### 4.3. Decision Space Analysis

#### 4.4. Explainability for Oblique Decision Trees in RL

#### 4.4.1. Decision Trees: Compact and Readable Models

#### 4.4.2. Distance to Hyperplanes

**Equilibrium nodes**(or attracting nodes): nodes with distance near to zero for all observation points after a transient period. These nodes occur only in those environments where an equilibrium has to be reached.**Swing-up nodes**: nodes that do not stabilize at a distance close to zero. These nodes are responsible for swing-up, for bringing energy into a system. The trees for environments MountainCar, MountainCarContinuous, and Acrobot consist only of swing-up nodes because, in these environments, the goal state to reach is not an equilibrium.

#### 4.4.3. Sensitivity Analysis

## 5. Discussion

- To align our work in the context of XAI, we may use the guidance of the taxonomy proposed by (Schwalbe and Finzel [35], Section 5):
- The
**problem definition**: The**task**consists of a classification or regression (for discrete or continuous actions, respectively) of tabular, numerical data (observations). We intend a**post hoc**explanation of DRL oracles, by finding**decision trees**described that are inherently transparent and simulatable in that a human can follow the simple mathematical and hierarchical structure of a DT’s decision-making process. - Our algorithm as an
**explanator**requires a model (the oracle and also the environment’s dynamics) as**input**, while still being portable by not relying on a specific type of oracle. The**output**data of our algorithm are shallow DTs as surrogate models mimicking the decision-making process of the DRL oracles. This leads to a drastic reduction of model parameters (see Table 4). Given their simple and transparent nature, DTs lend themselves to rule-based output and to further explanations by example or by counterfactuals, and analysis of feature importance (as we showed in Section 4.4.3). - As the fundamental
**functionally grounded metric**of our algorithm, we compare the DT’s average return in the RL environment with the oracle’s performance. This measures the fidelity to the oracle in the regions relevant to successfully solve the environment and the accuracy with respect to the original task of the environment. We do not evaluate**human-grounded**metrics. (Human-grounded metrics are defined by Schwalbe and Finzel [35] as metrics involving human judgment of the explainability of the constructed explanator (here: DT): how easy is it for humans to build a mental model of the DT, how accurately does the mental model approximate the explanator model, and so on.) Instead, we trust that the simple structure and mathematical expressions used in the DTs generated by our algorithms, as well as the massively reduced number of model parameters, lead to an interpretability and effectiveness of predictions that could not be achieved with DRL models. In Section 4.4, we have explained aspects connected to this in more detail.

- Can we understand why the combination ITER + OPCT is successful, while the combination EPS + OPCT is not? We have no direct proof, but from the experiments conducted, we can share these insights: Firstly, equilibrium problems such as CartPole or Pendulum exhibit a severe oversampling of some states (i.e., the upright pendulum state). If, as a consequence, the initial DT is not well-performing, ITER allows to add the correct oracle actions for problematic samples and to improve the tree in critical regions of the observation space. Secondly, for environment Pendulum we observed that the initial DT was “bad” because it learned only from near-perfect oracle episodes. It had not seen any samples outside the near-perfect episode region and hypothesized the wrong actions in the outside regions (see Figure 3). Again, ITER helps; a “bad” DT is likely to visit these outside regions and then add samples combined with the correct oracle actions to its sample set.We conclude that algorithm ITER is successful because it strikes the right balance between being too exploratory, like algorithm BB (which might sample from constraint-violating regions or from regions unknown to the oracle), and not exploratory enough, like algorithm EPS (which might sample only from the narrow region of successful oracle trajectories). Algorithm ITER solves the sampling challenges mentioned in Section 1. It works well based on its iterative nature, which results in self-correction of the tree. At every iteration, the trajectories generated by the tree form a dataset of state–action pairs. The oracle corrects the wrong actions, and the tree can improve its behavior based on the enlarged dataset.
- We investigated another possible variant of an iterative algorithm. In this variant, observations were sampled while the oracle was training (and probably also visited unfavorable regions of the observation space). After the oracle had completed its training, the samples were labeled with the action predictions of the fully trained oracle. These labeled samples were presented to DTs in a similar iterative algorithm as described above. However, this algorithm was not successful.
- A striking feature of our results is that DTs (predominantly those trained with ITER) can outperform the oracle they were trained from. This seems paradoxical at first glance, but the decision space analysis for MountainCar in Section 4.3 has shown the likely reason. In Figure 5, the decision spaces of the oracles are overly complex (similar to overfitted classification boundaries, although overfitted is not the proper term in the RL context). With their significantly reduced degrees of freedom, the DTs provide simpler, better generalizing models. Since they do not follow the “mistakes” of the overly complex DRL models, they exhibit a better performance. It fits to this picture that the MountainCarContinuous results in Figure 2 are best at $d=1$ (better than oracle) and slowly degrade towards the oracle performance as d increases; the DT obtains more degrees of freedom and mimics the (slightly) non-optimal oracle better and better. The fact that DTs can be better than the DRL model they were trained on is compatible with the reasoning of Rudin [20], who stated that transparent models are not only easier to explain but often also outperform black box models.
- We applied algorithm ITER to OPCTs. Can CARTs also benefit from ITER? The solved-depths shown in column ITER + CART of Table 2a add up to ${35}^{+}$, which is nearly as bad as EPS + CART (${38}^{+}$). Only the solved-depth for Pendulum improved somewhat from ${10}^{+}$ to 7. We conclude that the expressive power of OPCTs is important for being successful in creating shallow DTs with ITER.
- What are the limitations of our algorithms? In principle, the three algorithms EPS, BB, and ITER are widely applicable since they are model-agnostic concerning the DRL and DT methods used and applicable to any RL environment. However, the limitation of the current work is that, so far, we have only tested environments with one-dimensional action spaces (discrete or continuous). There is no principled reason why the algorithms should not be applicable to multi-dimensional action spaces (OPCTs can be trained for multi-dimensional outputs as well). Still, the empirical demonstration of successful operation in multi-dimensional action spaces is a topic for future work. One obvious limitation of the three algorithms is that they need a working oracle. A less obvious limitation is that algorithm EPS only needs static, pre-collected samples from oracle episodes (in other applications, these might be “historical” samples). In contrast, the algorithms BB and ITER really need an oracle function to query for the right action at arbitrary points in the observation space. Which points are needed is not known until the algorithms BB and ITER are run.
- A last point worth mentioning is the intrinsic trustworthiness of DTs. They partition the observation space in a finite (and often small) set of regions, where the decision is the same for all points. (This feature includes smooth extrapolation to yet-unknown regions.) DRL models, on the other hand, may have arbitrarily complex decision spaces. If such a model predicts action a for point x, it is not known which points in the neighborhood of x will have the same action.

## 6. Conclusions

## Author Contributions

## Funding

## Data Availability Statement

## Conflicts of Interest

## Appendix A. Reproduction of PIRL via NDPS Results

`v0`), and MountainCar in ([13], Appendix B, Figures 8–10) as well as returns obtained by optimal policies found by two versions of their algorithm in rows Ndps-SMT and Ndps-BOPT of ([13], Appendix A, Table 6). We implemented the provided policies in Python and evaluated them on 100 episodes in each respective environment. (In the case of MountainCar, there must have been a typographic error as the reported policy produced the lowest possible return of $\overline{R}=-200\pm 0$. Therefore, we switched the actions, leading to the results reported below).

**Table A1.**Results of two versions of NDPS (SMT and BOPT) as published in [13], and the average return $\overline{R}$ and standard deviation of our implementations of the reported policies in 100 episodes.

Verma et al. [13] | Our Reproduction | ||
---|---|---|---|

Environment | ${\overline{\mathit{R}}}_{\mathit{SMT}},\phantom{\rule{0.166667em}{0ex}}\phantom{\rule{0.166667em}{0ex}}{\overline{\mathit{R}}}_{\mathit{BOPT}}$ | $\overline{\mathit{R}}\pm \mathit{\sigma}$ | ${\overline{\mathit{R}}}_{\mathit{solved}}$ |

Acrobot-v1 | $\phantom{0}-84.16,-127.21$ | −93.29 ± 25.19 | $-100$ |

CartPole-v0 | $\phantom{-}183.15,\phantom{-}143.21$ | 46.66 ± 15.34 | $\phantom{-}195$ |

MountainCar-v0 | $-108.06,-143.86$ | −163.02 ± 3.48 | $-110$ |

## Appendix B. Effect of Observation Space Normalization on the Distances to Hyperplane

## Abbreviations

BB | Bounding Box algorithm |

CART | Classification and Regression Trees |

DRL | Deep Reinforcement Learning |

DQN | Deep Q-Network |

DT | Decision Tree |

EPS | Episode Samples algorithm |

ITER | Iterative Training of Explainable RL models |

OPCT | Oblique Predictive Clustering Tree |

PPO | Proximal Policy Optimization |

RL | Reinforcement Learning |

SB3 | Stable-Baselines3 |

TD3 | Twin Delayed Deep Deterministic Policy Gradient |

## References

- Engelhardt, R.C.; Lange, M.; Wiskott, L.; Konen, W. Sample-Based Rule Extraction for Explainable Reinforcement Learning. In Proceedings of the Machine Learning, Optimization, and Data Science, Certosa di Pontignano, Italy, 18–22 September 2022; Nicosia, G., Ojha, V., La Malfa, E., La Malfa, G., Pardalos, P., Fatta, G.D., Giuffrida, G., Umeton, R., Eds.; Springer: Cham, Switzerland, 2023; Volume 13810, pp. 330–345. [Google Scholar] [CrossRef]
- Adadi, A.; Berrada, M. Peeking inside the black-box: A survey on explainable artificial intelligence (XAI). IEEE Access
**2018**, 6, 52138–52160. [Google Scholar] [CrossRef] - Molnar, C.; Casalicchio, G.; Bischl, B. Interpretable Machine Learning—A Brief History, State-of-the-Art and Challenges. In Proceedings of the ECML PKDD 2020 Workshops, Ghent, Belgium, 14–18 September 2020; Koprinska, I., Kamp, M., Appice, A., Loglisci, C., Antonie, L., Zimmermann, A., Guidotti, R., Özgöbek, Ö., Ribeiro, R.P., Gavaldà, R., Eds.; Springer: Cham, Switzerland, 2020; pp. 417–431. [Google Scholar] [CrossRef]
- Molnar, C. Interpretable Machine Learning: A Guide for Making Black Box Models Explainable. 2022. Available online: https://christophm.github.io/interpretable-ml-book (accessed on 25 May 2023).
- Puiutta, E.; Veith, E.M.S.P. Explainable Reinforcement Learning: A Survey. In Proceedings of the Machine Learning and Knowledge Extraction, Dublin, Ireland, 25–28 August 2020; Holzinger, A., Kieseberg, P., Tjoa, A.M., Weippl, E., Eds.; Springer: Cham, Switzerland, 2020; pp. 77–95. [Google Scholar] [CrossRef]
- Heuillet, A.; Couthouis, F.; Díaz-Rodríguez, N. Explainability in deep reinforcement learning. Knowl.-Based Syst.
**2021**, 214, 106685. [Google Scholar] [CrossRef] - Milani, S.; Topin, N.; Veloso, M.; Fang, F. A survey of explainable reinforcement learning. arXiv
**2022**, arXiv:2202.08434. [Google Scholar] [CrossRef] - Lundberg, S.M.; Erion, G.; Chen, H.; DeGrave, A.; Prutkin, J.M.; Nair, B.; Katz, R.; Himmelfarb, J.; Bansal, N.; Lee, S.I. From local explanations to global understanding with explainable AI for trees. Nat. Mach. Intell.
**2020**, 2, 56–67. [Google Scholar] [CrossRef] [PubMed] - Liu, G.; Schulte, O.; Zhu, W.; Li, Q. Toward Interpretable Deep Reinforcement Learning with Linear Model U-Trees. In Proceedings of the Machine Learning and Knowledge Discovery in Databases, Dublin, Ireland, 10–14 September 2018; Berlingerio, M., Bonchi, F., Gärtner, T., Hurley, N., Ifrim, G., Eds.; Springer: Cham, Switzerland, 2019; Volume 11052, pp. 414–429. [Google Scholar] [CrossRef] [Green Version]
- Mania, H.; Guy, A.; Recht, B. Simple random search of static linear policies is competitive for reinforcement learning. In Advances in Neural Information Processing Systems; Bengio, S., Wallach, H., Larochelle, H., Grauman, K., Cesa-Bianchi, N., Garnett, R., Eds.; Curran Associates, Inc.: New York, NY, USA, 2018; Volume 31. [Google Scholar]
- Coppens, Y.; Efthymiadis, K.; Lenaerts, T.; Nowé, A.; Miller, T.; Weber, R.; Magazzeni, D. Distilling deep reinforcement learning policies in soft decision trees. In Proceedings of the IJCAI 2019 Workshop on Explainable Artificial Intelligence, Macao, China, 10–16 August 2019; pp. 1–6. [Google Scholar]
- Frosst, N.; Hinton, G.E. Distilling a Neural Network Into a Soft Decision Tree. In Proceedings of the First International Workshop on Comprehensibility and Explanation in AI and ML, Bari, Italy, 16–17 November 2017. [Google Scholar]
- Verma, A.; Murali, V.; Singh, R.; Kohli, P.; Chaudhuri, S. Programmatically Interpretable Reinforcement Learning. In Proceedings of the 35th International Conference on Machine Learning, Stockholm, Sweden, 10–15 July 2018; Proceedings of Machine Learning Research; Dy, J., Krause, A., Eds.; PMLR: New York, NY, USA, 2018; Volume 80, pp. 5045–5054. [Google Scholar]
- Qiu, W.; Zhu, H. Programmatic Reinforcement Learning without Oracles. In Proceedings of the Tenth International Conference on Learning Representations, ICLR, Virtual, 25–29 April 2022. [Google Scholar]
- Ross, S.; Gordon, G.; Bagnell, D. A Reduction of Imitation Learning and Structured Prediction to No-Regret Online Learning. In Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, Ft. Lauderdale, FL, USA, 11–13 April 2011; Proceedings of Machine Learning Research; Gordon, G., Dunson, D., Dudík, M., Eds.; PMLR: New York, NY, USA, 2011; Volume 15, pp. 627–635. [Google Scholar]
- Bastani, O.; Pu, Y.; Solar-Lezama, A. Verifiable Reinforcement Learning via Policy Extraction. In Advances in Neural Information Processing Systems; Bengio, S., Wallach, H., Larochelle, H., Grauman, K., Cesa-Bianchi, N., Garnett, R., Eds.; Curran Associates, Inc.: New York, NY, USA, 2018; Volume 31. [Google Scholar]
- Zilke, J.R.; Loza Mencía, E.; Janssen, F. DeepRED—Rule Extraction from Deep Neural Networks. In Proceedings of the Discovery Science, Bari, Italy, 19–21 October 2016; Calders, T., Ceci, M., Malerba, D., Eds.; Springer: Cham, Switzerland, 2016; Volume 9956, pp. 457–473. [Google Scholar] [CrossRef]
- Schapire, R.E. The strength of weak learnability. Mach. Learn.
**1990**, 5, 197–227. [Google Scholar] [CrossRef] [Green Version] - Freund, Y.; Schapire, R.E. A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting. J. Comput. Syst. Sci.
**1997**, 55, 119–139. [Google Scholar] [CrossRef] [Green Version] - Rudin, C. Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead. Nat. Mach. Intell.
**2019**, 1, 206–215. [Google Scholar] [CrossRef] [PubMed] [Green Version] - Brockman, G.; Cheung, V.; Pettersson, L.; Schneider, J.; Schulman, J.; Tang, J.; Zaremba, W. OpenAI Gym. arXiv
**2016**, arXiv:1606.01540. [Google Scholar] [CrossRef] - Lovatto, A.G. CartPole Swingup—A Simple, Continuous-Control Environment for OpenAI Gym. 2021. Available online: https://github.com/0xangelo/gym-cartpole-swingup (accessed on 25 May 2023).
- Schulman, J.; Wolski, F.; Dhariwal, P.; Radford, A.; Klimov, O. Proximal Policy Optimization Algorithms. arXiv
**2017**, arXiv:1707.06347. [Google Scholar] [CrossRef] - Mnih, V.; Kavukcuoglu, K.; Silver, D.; Graves, A.; Antonoglou, I.; Wierstra, D.; Riedmiller, M. Playing Atari with Deep Reinforcement Learning. arXiv
**2013**, arXiv:1312.5602. [Google Scholar] [CrossRef] - Fujimoto, S.; van Hoof, H.; Meger, D. Addressing Function Approximation Error in Actor-Critic Methods. In Proceedings of the 35th International Conference on Machine Learning, Stockholm, Sweden, 10–15 July 2018; Proceedings of Machine Learning Research; Dy, J., Krause, A., Eds.; PMLR: New York, NY, USA, 2018; Volume 80, pp. 1587–1596. [Google Scholar]
- Raffin, A.; Hill, A.; Gleave, A.; Kanervisto, A.; Ernestus, M.; Dormann, N. Stable-Baselines3: Reliable Reinforcement Learning Implementations. J. Mach. Learn. Res.
**2021**, 22, 12348–12355. [Google Scholar] - Breiman, L.; Friedman, J.H.; Olshen, R.A.; Stone, C.J. Classification And Regression Trees; Routledge: New York, NY, USA, 1984. [Google Scholar]
- Pedregosa, F.; Varoquaux, G.; Gramfort, A.; Michel, V.; Thirion, B.; Grisel, O.; Blondel, M.; Prettenhofer, P.; Weiss, R.; Dubourg, V.; et al. Scikit-learn: Machine Learning in Python. J. Mach. Learn. Res.
**2011**, 12, 2825–2830. [Google Scholar] - Stepišnik, T.; Kocev, D. Oblique predictive clustering trees. Knowl.-Based Syst.
**2021**, 227, 107228. [Google Scholar] [CrossRef] - Alipov, V.; Simmons-Edler, R.; Putintsev, N.; Kalinin, P.; Vetrov, D. Towards practical credit assignment for deep reinforcement learning. arXiv
**2021**, arXiv:2106.04499. [Google Scholar] [CrossRef] - Woergoetter, F.; Porr, B. Reinforcement learning. Scholarpedia
**2008**, 3, 1448. [Google Scholar] [CrossRef] - Roth, A.E. (Ed.) The Shapley Value: Essays in Honor of Lloyd S. Shapley; Cambridge University Press: Cambridge, UK, 1988. [Google Scholar] [CrossRef]
- Bach, S.; Binder, A.; Montavon, G.; Klauschen, F.; Müller, K.R.; Samek, W. On pixel-wise explanations for non-linear classifier decisions by layer-wise relevance propagation. PLoS ONE
**2015**, 10, e0130140. [Google Scholar] [CrossRef] [Green Version] - Montavon, G.; Binder, A.; Lapuschkin, S.; Samek, W.; Müller, K.R. Layer-Wise Relevance Propagation: An Overview. In Explainable AI: Interpreting, Explaining and Visualizing Deep Learning; Samek, W., Montavon, G., Vedaldi, A., Hansen, L.K., Müller, K.R., Eds.; Springer: Cham, Switzerland, 2019; pp. 193–209. [Google Scholar] [CrossRef]
- Schwalbe, G.; Finzel, B. A comprehensive taxonomy for explainable artificial intelligence: A systematic survey of surveys on methods and concepts. Data Min. Knowl. Discov.
**2023**, 1–59. [Google Scholar] [CrossRef]

**Figure 2.**Return $\overline{R}$ as a function of DT depth d for all environments and all presented algorithms. The solved-threshold is shown as dash-dotted green line, the average oracle performance as dashed orange line, and the DTs performances as solid lines with average and $\pm 1\sigma $ of ten repetitions as shaded area. Note: (i) For Acrobot we show two plots to include the BB curve. (ii) Good performance in CartPole leads to overlapping curves: oracle, BB, and ITER are all constantly at $\overline{R}=500$.

**Figure 3.**Evolution of the training dataset with iterations for OPCT of depth 5 for Pendulum (observables mapped back to 2D). The plots show from left to right the observations added to the dataset in each iteration (by the oracle in iteration 0 and by the tree in iterations 1–3).

**Figure 4.**Evolution of OPCT performance with iterations. Shown are mean $\pm 1\sigma $ of 10 repetitions. Generally, the return increases notably in the first few iterations. For most environments, a significant improvement after 10 iterations is not observed. (The OPCTs’ depths are chosen according to Table 2a.)

**Figure 5.**Decision surfaces for MountainCar (MC, discrete actions) and MountainCarContinuous (MCC, continuous actions). Column 1 shows the DRL agents. The OPCTs in column 2 (depth 1) and column 3 (depth 2) were trained with algorithm ITER. The number in brackets in each subplot title is the average return $\overline{R}$. The black dots represent trajectories of 100 evaluation episodes with same starting conditions across the different agents. They indicate the regions actually visited in the observation space.

**Figure 6.**OPCT of depth 1 obtained by ITER solving MountainCar with a return of $\overline{R}=-107.47$ in 100 evaluation episodes.

**Figure 7.**Distance to hyperplanes for LunarLander. The diagrams show root node, left child, and right child from top to bottom. Small points: node is “inactive” for this observation (the observation is not routed through this node). Big points: node is “active”.

**Figure 8.**Distance plots for Pendulum. Three of the diagrams show the distance to the hyperplanes of root node 0, left child node 1, and right child node 2. The lower left plot shows the distance to the equilibrium point $(1,0,0)$ in observation space. The vertical line at $t=64$ marks the step where the distance for node 0 approaches zero.

**Figure 10.**OPCT sensitivity analysis for CartPole: (

**a**) including all weights, (

**b**) with velocity weight ${w}_{v}$ set to zero. Parameter

`angle`is the most important,

`velocity`is least important.

**Figure 11.**OPCT sensitivity analysis for LunarLander: (

**a**) including all weights, (

**b**) with the weights for the legs set to zero. ${v}_{y}$ is most important, the ground-contacts of the legs are least important.

**Table 1.**Overview of environments used. A task is considered solved when the average return in 100 evaluation episodes is $\ge {R}_{solved}$. Environments with an undefined official threshold ${R}_{solved}$ have been assigned a reasonable one. This number is then reported in parenthesis.

Environment | Observation Space ($\mathcal{O}$) | Action Space ($\mathcal{A}$) | ${\mathit{R}}_{\mathbf{solved}}$ |
---|---|---|---|

Acrobot-v1 | First angle $cos\left({\theta}_{1}\right)\in [-1,1]$, First angle $sin\left({\theta}_{1}\right)\in [-1,1]$, Second angle $cos\left({\theta}_{2}\right)\in [-1,1]$, Second angle $sin\left({\theta}_{2}\right)\in [-1,1]$, Angular velocity $\dot{{\theta}_{1}}={\omega}_{1}\in [-4\pi ,4\pi ]$, Angular velocity $\dot{{\theta}_{2}}={\omega}_{2}\in [-9\pi ,9\pi ]$ | $a\in $ Apply torque{-1 (0), 0 (1), 1 (2)} | $-100$ |

CartPole-v1 | Position cart $x\in [-4.8,4.8]$, Velocity cart $v\in (-\infty ,\infty )$, Angle pole $\theta \in [-0.42,0.42]$, Velocity pole $\omega \in (-\infty ,\infty )$ | $a\in $ Accelerate{left (0), right (1)} | 475 |

CartPole-SwingUp-v1 | Position cart $x\in [-2.4,2.4]$, Velocity cart $v\in (-\infty ,\infty )$, Angle pole $cos\left(\theta \right)\in [-1,1]$, Angle pole $sin\left(\theta \right)\in [-1,1]$, Angular velocity pole $\omega \in (-\infty ,\infty )$ | $a\in $ Accelerate{left (0), not (1), right (2)} | undef. $\left(840\right)$ |

LunarLander-v2 | Position $x\in [-1.5,1.5]$, Position $y\in [-1.5,1.5]$, Velocity ${v}_{x}\in [-5,5]$, Velocity ${v}_{y}\in [-5,5]$, Angle $\theta \in [-\pi ,\pi ]$, Angular velocity $\dot{\theta}=\omega \in [-5,5]$, Contact center leg ${l}_{l}\in \{0,1\}$, Contact right leg ${l}_{r}\in \{0,1\}$ | $a\in $ Fire{not (0), left engine (1), main engine (2), right engine (3)} | 200 |

MountainCar-v0 | Position $x\in [-1.2,0.6]$, Velocity $v\in [-0.07,0.07]$ | $a\in $ Accelerate{left (0), not (1), right (2)} | $-110$ |

MountainCarContinuous-v0 | Same as MountainCar-v0 | $a\in $ Accelerate $[-1,1]$ | 90 |

Pendulum-v1 | Angle $cos\left(\theta \right)\in [-1,1]$, Angle $sin\left(\theta \right)\in [-1,1]$, Angular velocity $\omega \in [-8,8]$ | $a\in $ Apply torque $[-2,2]$ | undef. $(-170)$ |

**Table 2.**The numbers show the minimum depth of the respective algorithm for (

**a**) solving an environment and (

**b**) surpassing the oracle. ${(\dots )}^{+}$ means no DT of investigated depths could reach the threshold. See main text for explanation of (1) or (3). Bold numbers mark the best result(s) in each row.

(a) Solving an environment | ||||||
---|---|---|---|---|---|---|

Algorithms | ||||||

EPS | EPS | BB | ITER | ITER | ||

Environment | Model | CART | OPCT | OPCT | CART | OPCT |

Acrobot-v1 | DQN | 1 | 1 | ${10}^{+}$ | 1 | 1 |

CartPole-v1 | PPO | 3 | 3 | 1 | 3 | 1 |

CartPole-SwingUp-v1 | DQN | 10${}^{+}$ | 10${}^{+}$ | 10${}^{+}$ | 10${}^{+}$ | 7 |

LunarLander-v2 | PPO | 10${}^{+}$ | 10${}^{+}$ | 2 | 10${}^{+}$ | 2 |

MountainCar-v0 | DQN | 3 | 3 | 5 | 3 | 1 |

MountainCarContinuous-v0 | TD3 | 1 | 1 | (1) | 1 | 1 |

Pendulum-v1 | TD3 | 10${}^{+}$ | 9 | 6 | 7 | 5 |

Sum | 38${}^{+}$ | 37${}^{+}$ | 35${}^{+}$ | 35${}^{+}$ | 18 | |

(b) Surpassing the oracle | ||||||

Acrobot-v1 | DQN | 3 | 3 | ${10}^{+}$ | 2 | 1 |

CartPole-v1 | PPO | 5 | 6 | 1 | (3) | 1 |

CartPole-SwingUp-v1 | DQN | ${10}^{+}$ | ${10}^{+}$ | ${10}^{+}$ | ${10}^{+}$ | ${10}^{+}$ |

LunarLander-v2 | PPO | ${10}^{+}$ | ${10}^{+}$ | 2 | ${10}^{+}$ | 3 |

MountainCar-v0 | DQN | 3 | 5 | 6 | 3 | 4 |

MountainCarContinuous-v0 | TD3 | ${10}^{+}$ | 1 | 4 | 2 | 1 |

Pendulum-v1 | TD3 | ${10}^{+}$ | 9 | 8 | 10 | 6 |

Sum | ${51}^{+}$ | 44${}^{+}$ | 41${}^{+}$ | ${40}^{+}$ | 26${}^{+}$ |

**Table 3.**Computation times of our algorithms for some environments (we selected those with different observation space dimensions D). Shown are the averaged results from 10 runs. ${t}_{run}$: average time elapsed for one run covering all depths $d=1,\dots ,10$. ${t}_{OPCT}$: average time elapsed to train and evaluate 10 OPCTs. ${n}_{s}$: total number of samples used. Note how the bad performance of BB in the Acrobot environment leads to longer episodes and therefore longer evaluation times ${t}_{OPCT}$.

Environment | dim D | Algorithm | ${\mathit{t}}_{\mathit{run}}\phantom{\rule{0.166667em}{0ex}}\left[\mathit{s}\right]$ | ${\mathit{t}}_{\mathit{OPCT}}\phantom{\rule{0.166667em}{0ex}}\left[\mathit{s}\right]$ | ${\mathit{n}}_{\mathit{s}}$ |
---|---|---|---|---|---|

Acrobot-v1 | 6 | EPS | 124.82 ± 1.96 | 11.71 ± 0.95 | 30,000 |

BB | 477.70 ± 7.89 | 47.56 ± 2.92 | 30,000 | ||

ITER | 1189.14 ± 28.70 | 10.76 ± 1.02 | 30,000 | ||

CartPole-v1 | 4 | EPS | 187.39 ± 3.64 | 18.06 ± 2.29 | 30,000 |

BB | 202.65 ± 1.26 | 19.11 ± 1.13 | 30,000 | ||

ITER | 2040.16 ± 21.46 | 18.50 ± 1.64 | 30,000 | ||

MountainCar-v0 | 2 | EPS | 85.39 ± 1.50 | 7.93 ± 0.57 | 30,000 |

BB | 104.92 ± 2.05 | 10.27 ± 0.54 | 30,000 | ||

ITER | 849.90 ± 19.79 | 7.68 ± 0.61 | 30,000 |

Number of Parameters | |||||
---|---|---|---|---|---|

Environment | Dim D | Model | Oracle | ITER | Depth d |

Acrobot-v1 | 6 | DQN | 136,710 | 9 | 1 |

CartPole-v1 | 4 | PPO | 9155 | 7 | 1 |

CartPole-SwingUp-v1 | 5 | DQN | 534,534 | 890 | 7 |

LunarLander-v2 | 8 | PPO | 9797 | 31 | 2 |

MountainCar-v0 | 2 | DQN | 134,656 | 5 | 1 |

MountainCarContinuous-v0 | 2 | TD3 | 732,406 | 5 | 1 |

Pendulum-v1 | 3 | TD3 | 734,806 | 156 | 5 |

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

## Share and Cite

**MDPI and ACS Style**

Engelhardt, R.C.; Oedingen, M.; Lange, M.; Wiskott, L.; Konen, W.
Iterative Oblique Decision Trees Deliver Explainable RL Models. *Algorithms* **2023**, *16*, 282.
https://doi.org/10.3390/a16060282

**AMA Style**

Engelhardt RC, Oedingen M, Lange M, Wiskott L, Konen W.
Iterative Oblique Decision Trees Deliver Explainable RL Models. *Algorithms*. 2023; 16(6):282.
https://doi.org/10.3390/a16060282

**Chicago/Turabian Style**

Engelhardt, Raphael C., Marc Oedingen, Moritz Lange, Laurenz Wiskott, and Wolfgang Konen.
2023. "Iterative Oblique Decision Trees Deliver Explainable RL Models" *Algorithms* 16, no. 6: 282.
https://doi.org/10.3390/a16060282