Improved Exploration in Reinforcement Learning Environments with Low-Discrepancy Action Selection

: Reinforcement learning (RL) is a subdomain of machine learning concerned with achieving optimal behavior by interacting with an unknown and potentially stochastic environment. The exploration strategy for choosing actions is an important component for enabling the decision agent to discover how to obtain high rewards. If constructed well, it may reduce the learning time of the decision agent. Exploration in discrete problems has been well studied, but there are fewer strategies applicable to continuous dynamics. In this paper, we propose a Low-Discrepancy Action Selection (LDAS) process, a novel exploration strategy for environments with continuous states and actions. This algorithm focuses on prioritizing unknown regions of the state-action space with the intention of ﬁnding ideal actions faster than pseudo-random action selection. Results of experimentation with three benchmark environments elucidate the situations in which LDAS is superior and introduce a metric for quantifying the quality of exploration.


Introduction
In most statistics and machine learning applications, a greater value is placed on new, unseen information. There is a strong intuition that obtaining new data will be more valuable than data that has already been observed. In discrete settings, withoutreplacement sampling prioritizes the selection of unobserved individuals, and is superior in many aspects to with-replacement sampling. In machine learning, applications of without-replacement sampling include choosing ensemble classifiers [1], constructing lowrank approximations of large matrices [2], kernel embeddings [3], computational learning theory [4], and as a general purpose tool in evaluating data analysis applications [5]. For example, a strategy called random reshuffling has been shown to improve convergence of stochastic gradient descent [6].
Reinforcement learning is a subdomain of machine learning concerned with learning optimal behavior in a potentially stochastic environment from data obtained by interacting with that environment. A key element of the reinforcement learning problem is exploration: selecting actions that are not currently considered optimal for the given state in order to learn more about the environment in the hope of finding superior actions. Exploration strategies are a large portion of the reinforcement learning literature and new approaches continue to be introduced. A common component of many is uniform selection of actions using pseudo-random numbers that are independent of previous actions. It is possible that this selection method may choose an action which is similar to previously selected actions, when intuition suggests something new and as dissimilar as possible is desirable. Some strategies, such as optimistic value initialization [7], will try to avoid this and force selection of previously unseen actions, but many strategies do not.
When the action space is discrete, variants of without-replacement sampling can be applied in a straightforward way [8]. Surprisingly, recent research has shown that when using a simple metric such as expected time to reach a goal state, it is possible to construct environments for which without-replacement is inferior [9]. However, these environments are somewhat pathological and unlikely to be encountered by chance, so without-replacement sampling continues to be a promising avenue of research for improving reinforcement learning.
When state and action spaces are continuous, without-replacement sampling does not have the same meaning, but it is still heuristically desirable to try actions that are as different from previously seen actions as possible. With this in mind, a reinforcement learning agent that prioritizes obtaining information that is dissimilar to previously obtained information during the exploration stage may produce superior results during the learning stage.
In this paper, we introduce a new method for environments in which both states and actions are continuous: Low-Discrepancy Action Selection (LDAS). LDAS takes account of previously visited state-action pairs and tries to select an action that is as dissimilar from previous state-actions as possible. We also introduce a metric for comparing the discrepancy of exploration strategies relative to each other, and perform experiments in three reinforcement learning environments that compare three action selection methods.
Section two introduces the mathematical framework that underlies reinforcement learning. Section three defines the concept of discrepancy and introduces metrics for comparison. Section four combines the two previous concepts and defines an algorithm for selecting actions in a low-discrepancy manner. Section five presents experimental results, and section six concludes the paper.

Literature Review
A review of four action-selection strategies for continuous dynamics [10] includes one method based on discretization of the action space and three grouped as optimization techniques based on gradient descent [11,12], and Newton's Method [13]. By becoming progressively finer-grained in selective regions of the action space, discretization strategies can approximate continuous actions efficiently. For example, Sequential Monte Carlo (SMC) methods [14] samples actions from a prior distribution and resamples according to an estimated utility, thus promoting a thorough exploration of the most favorable actionspace areas and permitting SMC-learning to locate real continuous actions. The method is model-free and may learn to follow stochastic policies.
The previous methods are based on action-value functions and value iteration. The other paradigm in reinforcement learning, policy iteration, is also represented in literature for continuous processes. Continuous-action Approximate Policy Iteration (CAPI) [15] can search the set of policies (functions from the state to action space) directly. Two adaptive models are suggested to approximate the value function, and simulations demonstrate that the model converges to a near-optimal policy in a few iterations.
Recent papers approach continuous dynamics in innovative ways, such as sampling based on the likelihood ratio between the action probabilities of a target policy and the data-producing behavior policy [16]. The behavior policy action probabilities are replaced with their maximum likelihood estimators. These estimators reduce Monte Carlo sampling variance, result in faster learners for policy gradient algorithms, and are shown to be consistent.

Markov Decision Process Preliminaries
Reinforcement learning is formalized using the mathematical structure of a Markov Decision Process (MDP) [7]. A decision-making entity called an agent observes a system or environment in discrete time. We call each point in time a "step". At time t, the agent observes the environment to be in state S t ∈ S, where S represents the set of all possible states. The agent makes a decision by choosing an action A t ∈ A(S t ), where A(S t ) is the set of all allowable actions from state S t , and is a subset of the set of all possible actions A. Then, the system transitions to a new state S t+1 and the agent receives a numeric reward R t ∈ R, where R ⊂ R is the set of all possible rewards. Consider first the case that S, A, and R are discrete. Transition probabilities and reward distributions are governed by the dynamics function p : S × R × S × A → [0, 1] where p(s , r|s, a) is the probability of transitioning to state s and receiving reward r when the environment is in state s and the agent chooses action a. The objective is to find a policy: a function π : S → A that maximizes some measure of long-term reward.
A fundamental issue is how the agent should select actions, and it is commonly known as the exploration-exploitation dilemma. Should the agent exploit its current knowledge of the system dynamics by selecting the action thought to be optimal in the current state, or should the agent explore new actions in hope of finding one better than what is already known? This paper will focus on exploration and how to select actions so that the state-action space is investigated evenly and completely.
If states are not discrete but continuous, the state space needs to be equipped with a sigma-algebra σ(S ) and the resulting dynamics function needs to be measurable. The primary objective of this paper is efficient exploration of the environment, and we limit the scope of the paper to that extent.

Low-Discrepancy Sequences
Intuitively, the discrepancy of a sequence is a measure of deviation from an ideal uniformly distributed sequence. High-discrepancy sequences exhibit clustering and large empty regions, while low-discrepancy sequences have evenly spaced points and are not too close. Having a measure of discrepancy is essential for comparing action selection strategies. It will help one determine if selection is sufficiently uniform (as would be compatible with the "without-replacement" heuristic) or if some state-actions are repeated (as could happen under "with-replacement" sampling.) A simple and well-known low-discrepancy sequence is the van der Corput sequence [17], constructed by reversing the sequence of natural numbers of a certain base across the decimal point. The multidimensional extension, the Halton sequence [18], is obtained by letting each dimension be a van der Corput sequence with co-prime bases. Figure 1 displays 300 points from a Halton sequence and pseudo-random uniform sequence side-by-side. Under pseudo-random generation, the observed density is uneven, with some regions containing many points and some containing few. In contrast, the Halton sequence is closer to an ideal uniform density.

Halton
Pseudo−random Unfortunately, these sequences are not readily applicable to action selection for reinforcement learning. They require that each dimension of a point can be chosen freely. However, in reinforcement learning, only the action dimensions are freely choosable, while the state dimensions are a consequence (either stochastically or deterministically) of the system dynamics. Before discussing development of a method for choosing actions in a low-discrepancy manner, it will be beneficial to consider how to measure discrepancy.

Classic Discrepancy Measure
A classic measure of discrepancy is based on the difference between the proportion of a sequence in an interval and its length. It is particularly useful for theoretical analysis of sequences [19]. We first give the definition for one dimension. Let {x n } N n=1 be a finite sequence taking values in the interval [0, 1] and E ⊆ [0, 1]. Let A(E; N) count the number of the points of {x n } N n=1 contained in E. Then, the discrepancy of the sequence is For multiple dimensions, the definition is generalized as follows. Let I k = [0, 1] k represent the k-dimensional unit hypercube, and let J run through all sub-intervals of I k of the form The intuition behind this concept is that discrepancy is high when there are large hyperrectangles that contain a small proportion of the total points, or small hyper-rectangles with a high proportion of points.

Coefficient of Variation Measure
An alternative method for calculating discrepancy is based on distance from each point to its nearest neighbor. On average, this distance should be as large as possible, and it should vary as little as possible. This ensures the density in a region is somewhat constant and points are spread evenly. Then discrepancy can be measured by the ratio of standard deviation to the mean, which is the well-known coefficient of variation (CV) applied to minimum distances.
One property of the coefficient of variation is scale invariance. For example, suppose a new sequence is created by multiplying every point in an existing sequence by one-half. In that case, the two sequences will have the same coefficient of variation, even though the second one does not fill the space as fully. This property is undesirable for measuring discrepancy. For this reason, we make a slight modification and square the mean. Under this change, if two sequences have identical distributions except for scale, the one with greater distance between the points will have lower discrepancy and be preferred. To keep terminology simple, we still refer to this as the CV measure of discrepancy.
For the experiments forthcoming in this paper, we use the CV measure. We found it easier to calculate and more closely related to our intuition of discrepancy for sequences when visualized. Figure 2 shows the difference between low-and high-discrepancy sequences under the CV measure grows rather than shrinks as the number of points increases. To create this figure, ten Halton and ten pseudo-random two-dimensional sequences of length 500 were generated. The discrepancies D N and CV N of the first N points were calculated for each sequence from N = 5 to N = 500. The average discrepancy of the ten sequences is calculated and plotted.

Low-Discrepancy Action Selection
This section combines the concepts of exploration in an MDP and low-discrepancy sequences, with the goal of exploring so that discrepancy in the state-action space is minimized. This is not as simple as directly using a Halton or similar sequence to choose actions because it does not account for the state dimensions, which are not free for the agent to choose. Instead, a new method is needed which can observe the history of state-actions visited and the present state, and use this information to select an action such that: • The new state-action pair is as dissimilar from previous ones as possible; • Boundaries of the space have small but non-zero selection probabilities so that extreme actions can be tested without being overrepresented; • Action selection is reasonably computationally efficient.
The basic idea behind our approach is to place a candidate action randomly, but move it so that some measure of distance from previously visited state-actions is maximized via numerical methods. Many variations and specific measures were tested, such as each previous point emitting a field modeled on electrical charges to repel the new point, or gradient descent to minimize a function defined as the cumulative sum of multivariate Gaussian densities centered at all previous points. These approaches produced unsatisfactory discrepancies both visually and numerically.
The method we eventually arrived at is simple: the agent finds the previous stateaction that is closest to the present, and moves the action component directly away according to a slowly decaying learning rate. This continues until the action stabilizes or the boundary of the action space is reached. This procedure, termed LDAS (Low-Discrepancy Action Selection), is described in Algorithm 1. Notice that when determining the nearest point, both states and actions are used, but only the action component is updated.
This pseudo code represents the core of the algorithm. In practice, a few minor alterations were made to improve performance. For example, early testing revealed that too many points were placed at the boundaries of the action dimensions. We reasoned that new actions were being pushed outwards from points in the interior of the space, with no counterbalancing pressure from the exterior of the space. The issue was corrected by temporarily placing so-called "phantom points" just beyond the boundaries. The distance of phantom points from the edge is a decreasing function of the sample size. Another simplification was normalizing the action space to the unit hypercube before applying the algorithm, then transforming back to the original scale at the end. Finally, to limit memory and computational burden, we placed a cap on the number of stateactions stored in sa_hist. The default value is 500, and was used in all experiments in this paper. Exact details are available in the complete code in the GitHub repository at https://github.com/swcarden/LDAS (accessed on 10 May 2022).

Algorithm 1: The LDAS algorithm.
Input: A threshold > 0; integer number of iterations n ≥ 1; learning rate r > 0; learning rate decay 0 < rd < 1; initial state s 0 ; integer window w ≥ 1 of recent actions against which to compare. Output: A state-action history matrix sa_hist.
Initialize pseudo-random a 0 ; Initialize sa_hist with (s 0 , a 0 ); for i ≤ n do keep_going ← TRUE; Get state s i from interacting with environment from (s i−1 , a i−1 ); Generate pseudo-random a i ; r_temp ← r; while keep_going do Sweep sa_hist, find distances between (s i , a i ) and each (s j , a j ) for j < i; Set a * to be action from closest state-action pair;  Visual evaluation of LDAS in higher dimensions is difficult, so we turn to Equation (1). Data were generated according to the same conditions as in Figure 3, but allowing for an increased number of state and action dimensions, each running from one to three, with 100 points generated. After each timestep beginning with the fifth (as discrepancy is less meaningful for such short sequences), the discrepancy of the sequence was calculated and recorded. This entire process was repeated 100 times. Finally, the average discrepancy over the 100 runs was calculated at each timestep. The results are presented in Figure 4. Several things can be learned from this figure. First, we see that LDAS produces lower-discrepancy results than PR for any number of dimensions. Second, as the number of dimensions increases, discrepancy tends to decrease regardless of selection method. This is sensible as points are more spread out in larger spaces and are less likely to cluster. Third, the difference between LDAS and PR decreases as the dimensionality increases, for the same reason as in the previous observation. Finally, while discrepancy for PR depends on the total number of dimensions, it does not matter if they are state or action dimensions. However, LDAS can attain lower discrepancy when more dimensions are assigned to actions. Observe that discrepancy under LDAS is lower for three-action and one-state dimensions compared to three-state and one-action dimensions.

Experimental Results
The experiments in the previous section can be seen as a trivial MDP in which states are randomly selected independent of actions. In this section, experiments are run on three common benchmark RL environments that allow for continuous actions.
In addition to comparing LDAS against pseudo-random uniform action selection, we also tested the Ornstein-Uhlenbeck (OU) method, which is inspired by Brownian motion and in discrete time is equivalent to an autoregressive time series with drift [20]. It purposely correlates actions and so, in a sense, is the opposite of LDAS, and is beneficial in some motion-based environments by introducing a sort of inertia. Letting µ be a drift parameter, σ > 0 be a noise parameter, θ > 0 an inertia parameter, dt a time increment, and Z a standard normal random variable, the change in action a is governed by The default values in the Python library are µ = 0, σ = 0.3, and θ = 0.15, and these are what we used.
Experiments were run in Python version 3.8.4 [21]. For reinforcement learning environments, we used the Gym library [22], version .17.2. For exploration strategies, we used the Garage library [23], version 2021.3.0. Data analysis and visuals were produced in R 4.1.1 [24] with the aid of the tidyverse package [25]. Simulations and analysis were completed on our university's high-performance computing cluster running CentOS Linux release 7.9.2009. The Garage library contains classes EpsilonGreedyPolicy and AddOrnsteinUhlenbec-kNoise. For uniform pseudo-random action selection, we used EpsilonGreedyPolicy with epsilon equal to one. To implement LDAS, we modified the EpsilonGreedyPolicy class to perform LDAS on exploration and set epsilon equal to one. The modification can be found on our GitHub page.

MountainCarContinuous-v0
The first Gym environment tested is MountainCarContinuous-v0 or simply Mountain Car, and is one of the classic RL environments. A car at the bottom of a valley is trying to reach the top of a mountain, but does not have enough power to drive directly up the mountain and must learn to build momentum by alternating directions. Observe Figure 5. The state space is two-dimensional, and consists of position and velocity. The simulation technically allows both components of the state space to be any real numbers, though realistically only values near the origin are feasible. The action space is the one-dimensional interval [−1, 1], representing force from the engine.

LunarLanderContinuous-v2
The LunarLanderContinuous-v2 environment simulates a spacecraft that needs to land on the surface of the moon between two goal posts. Observe Figure 6.

CarRacing-v0
The CarRacing-v0 or Race Car environment randomly generates a track for a car to race on from a top-down perspective. Observe Figure 7.
The state space is a pixel representation, and so it is much larger than the previous environments. An RGB value gives three components for each pixel in a 96 × 96 resolution, resulting in 96 × 96 × 3 = 27,648 dimensions.
The action space is three-dimensional and consists of the steering in [−1, 1], gas in [0, 1], and brake in [0, 1]. For steering, a negative value turns the car to the left, a positive value turns it to the right, and a value of 0 keeps it moving straight ahead. For gas and brake, the value is the proportion of maximum acceleration or braking applied.

Results
For each environment, each of the three action selection methods (PR, LDAS, OU) was tested for 100 episodes for 100 steps each. After each timestep beginning with the fifth (as discrepancy is less meaningful for such short sequences), the discrepancy of the sequence was calculated and recorded. This entire process was repeated 100 times. Finally, the average discrepancy over the 100 runs was calculated at each timestep. The results are presented in Figure 8. The results confirm that in the two lower-dimensional problems, LDAS exploration results in a more even coverage of the state-action space. However, in the Race Car environment, the discrepancy estimates are almost identical for all three methods, with OU at a slight advantage. As the 27,648-dimensional state space is so large, state-action pairs have a low probability of being nearby each other until many steps have occurred. The purposefully correlated actions of OU can be advantageous for this problem when they result in the car consistently accelerating forward, whereas PR and LDAS may sporadically alternate between low and high acceleration and braking. If the experiment were run for a larger number of steps, it is possible that a difference in the discrepancy estimates may appear. This and Figure 4 suggest, agreeing with intuition, that advantages of LDAS will weaken in high-dimensional spaces where points will naturally be distant with high probability.
Lower discrepancy in exploration comes at a computational cost. Figure 9 displays the average algorithm run time per step for the three methods. As expected, LDAS incurs a higher computational cost for all environments, and increasingly so for the more complex environments.
Remember that these experiments only consider exploration and data collection, and do not yet take into account the effect on learning a policy from the data via a neural network or similar model. Time used in data generation is typically dwarfed by time spent learning a policy. Therefore, while LDAS will take longer to generate the same amount of data as PR or OU, the data should be of higher quality in the sense that they represent more of the state-action space, and the learning phase may require less data or converge to an optimal policy more quickly. We conjecture that for certain environments, LDAS exploration will be a net benefit, and increased computation time in the exploration phase will be offset by a greater decrease in computation time in the learning phase. Investigating this conjecture will be the target of future research projects.

Conclusions
This paper has presented LDAS, an algorithm for exploring MDPs with continuous states and actions. It combines the concepts of low-discrepancy sequence generation with action selection in RL. We performed experiments on three benchmark RL environments to demonstrate improved uniformity of state-action sampling. For two of the three environments, Mountain Car and Lunar Lander, LDAS had a significantly lower discrepancy than PR and OU. For the Race Car environment, all of the methods have roughly the same discrepancy, with OU at a slight advantage. The reason lies with the number of dimensions for the environments. In the lower-dimensional problems, there is a high probability that pseudo-random actions will result in clusters, so there is much room for improvement by LDAS. On the other hand, for the high-dimensional problem, state-actions will naturally be more sparse and spread out, so the benefit of LDAS is reduced. This suggests that LDAS is a more attractive exploration strategy for low-dimensional problems than high-dimensional ones.
One limitation of LDAS is increased computation time. The average time per step increased roughly five-fold for low-dimensional problems, and roughly twenty-fold for high-dimensional problems. However, for the low-dimensional problems, the improved uniformity of exploration may be worth it. Future experiments will train an agent on data collected by LDAS, PR, and OU exploration methods to investigate the effect on quality of the resulting policy and computation time in the learning phase.
Several avenues of research remain open for the future. The foremost is discovering whether agents trained on low-discrepancy data learn to complete their task better, faster, or with fewer data requirements. This will be the ultimate test as to whether LDAS should become a standard exploration method. If so, the computational efficiency merits additional investigation, and whether LDAS can be improved, so that the increased computational time is less of a drawback.
At its core, LDAS is a gradient method for optimizing distance from the nearest point, but it is basic in the sense of deterministically and exhaustively calculating all distances, and using constant, prespecified parameters for learning rate and learning rate decay. The last decade has seen advances in adaptive gradient methods for stochastic objective functions, such as AdaGrad (adaptive gradients) [26] and Adam (adaptive moments) [27].
A more advanced version of LDAS could calculate distances from a sample of previous points and use an adaptive gradient method to quickly approximate the action optimal for exploration. With this change, LDAS could potentially be significantly faster at the cost of a small performance decrease.
This paper took the perspective of completely bifurcated exploration and learning phases, but many algorithms integrate the tasks, attempting to balance exploration and exploitation simultaneously. A worthy question is how to modify LDAS to assign higher selection probabilities to actions with a higher estimated value. The field of low-discrepancy exploration is still ripe for research.