# Improved Exploration in Reinforcement Learning Environments with Low-Discrepancy Action Selection

^{*}

## Abstract

**:**

## 1. Introduction

#### Literature Review

## 2. Markov Decision Process Preliminaries

## 3. Low-Discrepancy Sequences

#### 3.1. Classic Discrepancy Measure

#### 3.2. Coefficient of Variation Measure

## 4. Low-Discrepancy Action Selection

- The new state–action pair is as dissimilar from previous ones as possible;
- Boundaries of the space have small but non-zero selection probabilities so that extreme actions can be tested without being overrepresented;
- Action selection is reasonably computationally efficient.

Algorithm 1: The LDAS algorithm. |

## 5. Experimental Results

#### 5.1. MountainCarContinuous-v0

#### 5.2. LunarLanderContinuous-v2

#### 5.3. CarRacing-v0

#### 5.4. Results

## 6. Conclusions

## Author Contributions

## Funding

## Institutional Review Board Statement

## Informed Consent Statement

## Data Availability Statement

## Acknowledgments

## Conflicts of Interest

## Abbreviations

MDP | Markov Decision Process |

RL | Reinforcement Learning |

LDAS | Low-Discrepancy Action Selection |

PR | Pseudo Random |

OU | Ornstein–Uhlenbeck |

RGB | Red Green Blue |

## References

- Pathical, S.; Serpen, G. Comparison of subsampling techniques for random subspace ensembles. In Proceedings of the 2010 International Conference on Machine Learning and Cybernetics, Qingdao, China, 11–14 July 2010; Volume 1, pp. 380–385. [Google Scholar] [CrossRef]
- Kumar, S.; Mohri, M.; Talwalkar, A. Sampling Methods for the Nyström Method. J. Mach. Learn. Res.
**2012**, 13, 981–1006. [Google Scholar] - Schneider, M. Probability Inequalities for Kernel Embeddings in Sampling without Replacement. In Proceedings of the 19th International Conference on Artificial Intelligence and Statistics, Cadiz, Spain, 9–11 May 2016; Gretton, A., Robert, C.C., Eds.; PMLR: Cadiz, Spain, 2016; Volume 51, pp. 66–74. [Google Scholar]
- Cannon, A.; Ettinger, J.M.; Hush, D.; Scovel, C. Machine Learning with Data Dependent Hypothesis Classes. J. Mach. Learn. Res.
**2002**, 2, 335–358. [Google Scholar] - Feng, X.; Kumar, A.; Recht, B.; Ré, C. Towards a Unified Architecture for In-RDBMS Analytics. In Proceedings of the SIGMOD ’12, 2012 ACM SIGMOD International Conference on Management of Data, Scottsdale, AZ, USA, 20–24 May 2012; Association for Computing Machinery: New York, NY, USA, 2012; pp. 325–336. [Google Scholar] [CrossRef] [Green Version]
- Gürbüzbalaban, M.; Ozdaglar, A.; Parrilo, P. Why Random Reshuffling Beats Stochastic Gradient Descent. arXiv
**2015**, arXiv:1510.08560. [Google Scholar] [CrossRef] [Green Version] - Sutton, R.S.; Barto, A.G. Reinforcement Learning: An Introduction; MIT Press: Cambridge, MA, USA, 2018. [Google Scholar]
- Kearns, M.; Singh, S. Near-Optimal Reinforcement Learning in Polynomial Time. Mach. Learn.
**2002**, 49, 209–232. [Google Scholar] [CrossRef] [Green Version] - Carden, S.W.; Walker, S.D. Exploration Using Without-Replacement Sampling of Actions Is Sometimes Inferior. Mach. Learn. Knowl. Extr.
**2019**, 1, 41. [Google Scholar] [CrossRef] [Green Version] - Nichols, B.D. A Comparison of Action Selection Methods for Implicit Policy Method Reinforcement Learning in Continuous Action-Space. In Proceedings of the 2016 International Joint Conference on Neural Networks (IJCNN), Vancouver, BC, Canada, 24–29 July 2016. [Google Scholar] [CrossRef] [Green Version]
- Nocedal, J.; Wright, S.J. Springer series in operations research and financial engineering. In Numerical Optimization; Springer: New York, NY, USA, 1999. [Google Scholar] [CrossRef] [Green Version]
- Nelder, J.A.; Mead, R. A Simplex Method for Function Minimization. Comput. J.
**1965**, 7, 308–313. [Google Scholar] [CrossRef] - Nichols, B.D.; Dracopoulos, D.C. Application of Newton’s Method to action selection in continuous state-and action-space reinforcement learning. In Proceedings of the ESANN 2014 Proceedings—22nd European Symposium on Artificial Neural Networks, Computational Intelligence and Machine Learning, Bruges, Belgium, 23–25 April 2014; pp. 141–146. [Google Scholar]
- Lazaric, A.; Restelli, M.; Bonarini, A. Reinforcement Learning in Continuous Action Spaces through Sequential Monte Carlo Methods. In Advances in Neural Information Processing Systems; Platt, J., Koller, D., Singer, Y., Roweis, S., Eds.; Curran Associates, Inc.: Red Hook, NY, USA, 2007; Volume 20. [Google Scholar]
- Xu, X.; Liu, C.; Hu, D. Continuous-action reinforcement learning with fast policy search and adaptive basis function selection. Soft Comput.
**2010**, 15, 1055–1070. [Google Scholar] [CrossRef] - Hanna, J.P.; Niekum, S.; Stone, P. Importance sampling in reinforcement learning with an estimated behavior policy. Mach. Learn.
**2021**, 110, 1267–1317. [Google Scholar] [CrossRef] - van der Corput, J.G. Verteilungsfunktionen. I. Proc. Akad. Wet. Amst.
**1935**, 38, 813–821. [Google Scholar] - Halton, J.H. On the efficiency of certain quasi-random sequences of points in evaluating multi-dimensional integrals. Numer. Math.
**1960**, 2, 84–90. [Google Scholar] [CrossRef] - Kuipers, L.; Niederreiter, H. Uniform Distribution of Sequences; Dover Books on Mathematics, Dover Publications: Mineola, NY, USA, 2012. [Google Scholar]
- Uhlenbeck, G.E.; Ornstein, L.S. On the Theory of the Brownian Motion. Phys. Rev.
**1930**, 36, 823–841. [Google Scholar] [CrossRef] - Van Rossum, G.; Drake, F.L., Jr. Python Reference Manual; Centrum voor Wiskunde en Informatica: Amsterdam, The Netherlands, 1995. [Google Scholar]
- Brockman, G.; Cheung, V.; Pettersson, L.; Schneider, J.; Schulman, J.; Tang, J.; Zaremba, W. OpenAI Gym. arXiv
**2016**, arXiv:1606.01540. [Google Scholar] - The Garage Contributors. Garage: A Toolkit for Reproducible Reinforcement Learning Research. 2019. Available online: https://github.com/rlworkgroup/garage (accessed on 10 May 2022).
- R Core Team. R: A Language and Environment for Statistical Computing; R Foundation for Statistical Computing: Vienna, Austria, 2022. [Google Scholar]
- Wickham, H.; Averick, M.; Bryan, J.; Chang, W.; McGowan, L.D.; François, R.; Grolemund, G.; Hayes, A.; Henry, L.; Hester, J.; et al. Welcome to the tidyverse. J. Open Source Softw.
**2019**, 4, 1686. [Google Scholar] [CrossRef] - Duchi, J.; Hazan, E.; Singer, Y. Adaptive Subgradient Methods for Online Learning and Stochastic Optimization. J. Mach. Learn. Res.
**2011**, 12, 2121–2159. [Google Scholar] - Kingma, D.P.; Ba, J. Adam: A Method for Stochastic Optimization. arXiv
**2014**, arXiv:1412.6980. [Google Scholar]

**Figure 1.**A visual comparison of 300 points from a Halton sequence and a pseudo-randomly generated sequence.

**Figure 2.**A comparison of discrepancy measures for a Halton sequence and a pseudo-randomly generated sequence.

**Figure 4.**A comparison of discrepancy for pseudo-random and LDAS sequences for varying number of state and action dimensions.

**Figure 5.**A visual representation of the MountainCarContinuous-v0 environment. Using position and velocity of the car as inputs, the agent must decide how to apply force with the goal of reaching the flag at the top of the hill.

**Figure 6.**A visual representation of the LunarLanderContinuous-v2 environment. The agent uses information on the position and velocity of the lander (seen on the right of the image) and fires thrusters with the goal of landing safely between the flags.

**Figure 7.**A visual representation of the CarRacing-v0 environment. The agent controls the throttle, brake, and steering of the car (seen in the bottom-center of the image) with the goal of encountering as much of the track as possible quickly.

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |

© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

## Share and Cite

**MDPI and ACS Style**

Carden, S.W.; Lindborg, J.O.; Utic, Z.
Improved Exploration in Reinforcement Learning Environments with Low-Discrepancy Action Selection. *AppliedMath* **2022**, *2*, 234-246.
https://doi.org/10.3390/appliedmath2020014

**AMA Style**

Carden SW, Lindborg JO, Utic Z.
Improved Exploration in Reinforcement Learning Environments with Low-Discrepancy Action Selection. *AppliedMath*. 2022; 2(2):234-246.
https://doi.org/10.3390/appliedmath2020014

**Chicago/Turabian Style**

Carden, Stephen W., Jedidiah O. Lindborg, and Zheni Utic.
2022. "Improved Exploration in Reinforcement Learning Environments with Low-Discrepancy Action Selection" *AppliedMath* 2, no. 2: 234-246.
https://doi.org/10.3390/appliedmath2020014