Digital Commons@Georgia Southern Digital Commons@Georgia Southern

: In many statistical and machine learning applications, without-replacement sampling is considered superior to with-replacement sampling. In some cases, this has been proven, and in others the heuristic is so intuitively attractive that it is taken for granted. In reinforcement learning, many count-based exploration strategies are justiﬁed by reliance on the aforementioned heuristic. This paper will detail the non-intuitive discovery that when measuring the goodness of an exploration strategy by the stochastic shortest path to a goal state, there is a class of processes for which an action selection strategy based on without-replacement sampling of actions can be worse than with-replacement sampling. Speciﬁcally, the expected time until a speciﬁed goal state is ﬁrst reached can be provably larger under without-replacement sampling. Numerical experiments describe the frequency and severity of this inferiority.


Introduction
The idea that learning is more efficient or estimates will be more precise when new, previously unseen information is processed has a strong intuitive appeal, and is often verified by theory. Consider the following elementary examples likely to be encountered by an undergraduate student. In sampling theory, consider estimating the mean of a collection of N values with variance σ 2 by taking a sample of size n, n < N. The variance of the sample mean when sampling with-replacement is σ 2 n , whereas the variance when sampling without-replacement is N−n N−1 σ 2 n , which is strictly smaller for n > 1. For a probabilistic example, consider an urn with N marbles, exactly one of which is white. When searching for the white marble by drawing from the urn with-replacement, the number of draws is a geometric random variable and has expected value N. When searching by drawing without-replacement, the number of draws is a discrete uniform random variable and has expected value N+1 2 . In more complex situations, it is generally difficult to analyze the behavior of without-replacement sampling due to the dependence between chosen items, so the argument in favor of without-replacement sampling or its variants is often made with numerical evidence. Examples include choosing ensemble classifiers [1], constructing low-rank approximations of large matrices [2,3], kernel embeddings [4], computational learning theory [5], and as a general purpose tool in evaluating data analysis applications [6]. A well-studied case is stochastic incremental gradient descent, for which there is a growing body of theoretical analysis [7][8][9][10] in favor of random reshuffling, which makes multiple passes through the data set, sampling without-replacement at each epoch.
Consider specifically the subdomain of reinforcement learning, in which an agent must learn how to behave optimally in an unknown Markovian environment [11]. Formally, the problem is cast as a Markov Decision Process [12], a sequential decision model in which a system can reside in one of a given set of states, S. At discrete points in time called epochs, the agent observes the state and chooses from among a set of actions, A. The system then transitions to the next state according to transition probabilities that depend on the current state and action, and a numeric reward is received. In the most general case, the reward is a random variable with a distribution that depends on the current state and action. The agent collects data by observing the state, choosing an action, and observing the resulting transition and reward. This data is used to construct a function from the state space to the action space, known as a policy π : S → A, so that using action π(s) when in state s will optimize a given measure of reward.
In order to guarantee that the learned policy converges to the optimal policy, it is required that the number of observations goes to infinity [13]. In the case of discrete states and actions, each state-action pair must be observed an infinite number of times [14]. In the case of continuous states and actions, the distribution of observed states and actions must have a density that is positive everywhere [15]. A practical corollary is that for an algorithm to learn well from a finite amount of data, this data should contain observations that represent the state-action space as completely as possible. Therefore, a key component of a reinforcement learning algorithm is how they explore, that is, how actions are chosen so that the data has enough variety for meaningful learning.
Exploration in reinforcement learning has been, and remains, an active research topic since the inception of the field. Some of the most common strategies include -greedy [16], in which the agent chooses the best known action with probability 1 − and a random action with probability ; softmax exploration [17], in which a distribution (usually the Boltzmann distribution) links the probability of selecting an action to its estimated value; "optimism in the face of uncertainty" [18], which assumes that actions with unknown or uncertain rewards are better than known actions, encouraging their use by a greedy agent; and statistical approaches such as maintaining a confidence interval for the value of each state-action pair, and choosing the action with the greatest upper confidence bound [19]. Another class of exploration strategies keeps a history of the actions used in each state. Upon visiting a state, recency-based exploration [20,21] incentivizes the use of actions that have not been recently used from that state, which is especially helpful for continued learning in changing environments. Similarly, count-based exploration [22][23][24] incentivizes actions that have been used less frequently.
Of particular interest to this paper is a count-based strategy used by Kearns and Singh [25] in their "Explicit Explore or Exploit" (E 3 ) algorithm. This algorithm is notable for achieving near-optimal performance in a time with a proven polynomial bound. The "Explore" part of the algorithm is a simple yet appealing strategy that the authors call balanced wandering. Under this strategy, a record is kept of how many times each action has been used for each state. When a state is visited, the action is chosen uniformly from among the actions that have been used in that state the least number of times. Note that this is equivalent to the random reshuffling used in without-replacement stochastic incremental gradient descent, but on a per-state basis. Upon subsequent visits to a particular state, actions will be sampled without-replacement until all actions have been tried, at which point sampling begins anew.
There are several reasons to believe that balanced wandering may provably be a uniform improvement over exploration using purely random, with-replacement action selection. First, balanced wandering is a relatively simple exploration strategy which has already been proven amenable to theoretical analysis [25]. Second, it can be viewed as an application of random reshuffling to reinforcement learning, which has been proven to be an improvement for stochastic gradient descent. Third, as argued in the opening paragraph, the intuition that without-replacement sampling yields more and better information than with-replacement sampling is undeniably strong.
The research presented in this paper is the result of an investigation initially intending to prove that balanced wandering is uniformly superior to with-replacement sampling of actions. However, we were surprised to discover that the hypothesis in general is false, and there exists a class of Markov decision processes for which balanced wandering is worse in terms of expected time until a goal state is reached.
The latter half of the project then became a search for understanding how this seeming paradox can occur, and finding a counter-proof to the hypothesis that balanced wandering is always superior.
The remainder of this paper is organized as follows. Section 2 sets up the mathematical preliminaries, describes the two exploration strategies under investigation, defines the metric by which they are compared, and states the hypothesis under investigation. Section 3 proves that for the smallest decision processes, balanced wandering is indeed a strict, uniform improvement over with-replacement action selection. Section 4 finds and proves conditions on transition probabilities that are sufficient for balanced wandering to be worse, and presents an intuitive explanation as to how the paradox may occur. Section 5 contains the result of a numerical experiment to investigate the frequency and magnitude of the paradox. Section 6 concludes with ideas for future research.

Preliminaries
Throughout this paper, N will denote the number of states and M the number of actions of a Markov Decision Process (hereafter MDP). Let S = {1, 2, . . . , N} be a finite set of states in which the system may reside. Let state 1 be an initial state the system begins in, and state N a goal state the agent is trying to reach. Let A = {1, 2, . . . , M} be a finite set of actions from which the agent may choose. Associated with each state-action pair (s, a) ∈ S × A is a probability distribution on the states, Pr(s, ·, a) : S → [0, 1], which governs the transition to the next state. For example, Pr(i, j, k) denotes the probability of transitioning to state j when action k is used from state i. The transition probabilities are assumed to be stationary in time and have the Markov property: they are conditionally independent of past states and actions given the current state and action.
MDPs typically include a reward received at each epoch. This paper focuses on the exploration portion of the learning problem, so rewards can be ignored. If this bothers the reader, we suggest supposing that all states emit a reward of zero except for the goal state N, which emits a positive reward. This would be pertinent to a learning algorithm for a process with a single reward state; no meaningful learning can take place until the first epoch at which a state with a positive reward is encountered, so it is desirable to find a state with positive rewards as quickly as possible.
Consider the following two methods for selecting actions from a finite set: • Random Action Selection (hereafter denoted RAS): Whenever a state is visited, the agent chooses the action randomly and uniformly from the set of all actions. • Balanced Wandering (hereafter denoted BW): For each state, a history is kept of the number of times each action has been tried. Whenever a state is visited, the agent inspects the history of actions tried from that state, and creates a subset of the actions that have been tried the least number of times. The agent chooses the action randomly and uniformly from the actions in that subset.
There is an equivalent formulation of BW which will be helpful in a forthcoming analysis. Notice that under BW, for a given state and any positive integer n, each action must be tried n times before any action can be tried n + 1 times. Then, as M represents the number of actions, each state is initially assigned a random permutation of the M actions to be executed in the first M visits to that state. After visit M, a new random permutation of the M actions is assigned to the next M visits to that state, and so on.
Define τ R and τ B to be the hitting times for the goal state, that is, the first time at which the system reaches state N, under each exploration strategy. That is, In Sections 3 and 4, we assume that state N is accessible from all states, so P(τ R = ∞) = P(τ B = ∞) = 0. Because the state space is finite, standard Markov chain theory implies E[τ R ], In Section 5, we will consider the possibility of an absorbing class of states or dead ends [26] from which the goal state cannot be reached.

Hypothesis 1.
For an MDP with an arbitrary number of states N and arbitrary number of actions M, the expected hitting time for the goal state under BW is less than or equal to the expected hitting time for the goal state under RAS. That is, The following two sections will show that this hypothesis is true for N = 2 states, but false in general for N ≥ 3.
By using expected hitting time to measure the effectiveness of an exploration strategy, we are essentially casting the problem as a special case of a stochastic shortest path (SSP) scenario [27]. An SSP is a type of MDP, often considered as a weighted directed graph, with the goal of selecting actions so that the journey from an initial state to a target state has the minimum possible expected sum of weights for traversed edges. Because we are seeking the strategy with expected minimum hitting time, this corresponds to an SSP where each edge is the transition time between states, so all edges have a weight of one.

A Proof for N = 2
Consider the simple MDP which consists of a finite state space S = {1, 2} and a finite action space A = {1, 2, . . . , M}. Let state 1 be the initial state of the system, and state 2 be the goal state. Then, τ R is the random variable denoting the number of epochs until the first time the system transitions to state 2. Similarly, τ B denotes the number of epochs until first reaching state 2 under BW.
Before stating and proving the theorem, we review some established definitions and results that will be needed in the proof. Definition 1. Let x 1 , . . . , x n be real numbers. For integer k ≤ n, the elementary symmetric polynomial of degree k in x 1 , . . . , x n , denoted e k (x 1 , . . . , x n ), is the sum of all products of k distinct elements from x 1 , . . . , x n . That is, e 0 (x 1 , . . . , x n ) := 1, It is the sum of ( n k ) terms, where each term is the product of an unordered sample without-replacement of size k. The polynomials corresponding to k = 1 and k = n are the sum of the terms and the product of the terms respectively, e 1 = ∑ x i and e n = ∏ x i .

Definition 2.
The elementary symmetric mean of degree k of x 1 , . . . , x n , denoted E k (x 1 , . . . , x n ), is the mean of the terms in the elementary symmetric polynomial of the same degree. .
Notice that E 1 (x 1 , . . . , x n ) is the arithmetic mean, and E n (x 1 , . . . , x n ) is the geometric mean raised to the power n.
See Biler [28] for a proof. Notice that this is a refinement of the well-known inequality between the arithmetic mean and the geometric mean, which can be recovered by comparing the first and last terms in the sequence of inequalities. Theorem 2 can now be proved.
This inequality is strict unless every action has an equal probability of transitioning to state 2.
Proof. The first step is recognizing that tail probabilities for the hitting times can be expressed as elementary symmetric means of the transition probabilities under each action. For compactness of notation, denote the probability of transitioning from state 1 to state 2 using action m by p m := Pr(1, 2, m), and the probability of remaining in state 1 under action m by q m = 1 − p m . First, consider RAS. Because every action has probability 1 M of being chosen, the overall probability of transitioning from state 1 to state 2 is given by the Law of Total Probability as which is simply the average of the probabilities of transitioning from state 1 to state 2 under each action. Then τ R , the number of epochs until the first successful transition under random action selection, is a geometric random variable with parameter p, and has tail probabilities Now consider tail probabilities under BW. Recall that this action selection scheme is equivalent to assigning a permutation of the M actions to each state to be executed sequentially. For a value k ≤ M, the event {τ B > k} means that the first k actions that were tried did not result in a successful transition to state 2. Partitioning this event according to the first k actions in the permutation and applying the Law of Total Probability shows that this probability can also be expressed in terms of elementary symmetric means. Let C k denote the set of all combinations of k integers out of the first M integers, representing the first k actions used. Each combination of k actions is equally likely, so the probability of each set in the partition is 1/( M k ). Formally, A direct application of the Maclaurin Inequality shows that, for k ≤ M, The next step is to use this inequality to obtain an inequality between conditional expectations for the hitting times.
Finally, consider the unconditional expectation under each action selection scheme.

Solve this expression for
The same logic is used to show A term-by-term comparison shows that E[τ B ] ≤ E[τ R ], which completes the proof.

A Counter-Proof for N ≥ 3
The argument used in the preceding proof does not extend to N ≥ 3 states. After several other proof strategies also failed, we began to wonder if the hypothesis is true in general. This section states and proves conditions that show analytically Hypothesis 1 is false in general. Because there are now an arbitrary number of states, we return to the general notation for transition probabilities, in which Pr(i, j, k) represents the probability of transitioning to state j when action k is used from state i. First, a quantity is defined which will be useful in the proof.

1.
If there exists c ∈ (0, 1) such that Pr(n, N, m) ≥ c for all n ∈ S, m ∈ A (1) Furthermore, there exist values of c in a neighborhood of one such that an MDP satisfying (1) and (2) Before starting the formal proof, see that Equation (2) has an interpretation that helps with an intuitive understanding of how the paradox may happen. BW is inferior relative to RAS when the left side is large, which happens when the actions having the greatest probability to transition directly from the initial state to the goal state also are likely to remain in the initial state. Therefore, if the use of one of these actions results in the system remaining in the initial state, it would be desirable to use the same action again. Under RAS, it is possible to use the same action from the initial state multiple times in a row, but BW forces the use of other, inferior actions. When this effect is large enough,

Proof.
A well-known fact from probability theory states that the expected value of a non-negative integer-valued random variable X can be found as a sum over tail probabilities, E[X] = ∑ ∞ k=0 Pr(X > k) ( [29], p.3). Applying this to the hitting times under RAS and BW, we have Notice that the first two differences in the sum are zero and have been dropped. For k = 0, both tail probabilities are one. For k = 1, using the rule of complements, Pr(τ B > 1) − Pr(τ R > 1) = Pr(τ R = 1) − Pr(τ B = 1) = 0 because all actions are available, thus the probabilities of transitioning directly to the goal are equivalent. Then, moving the k = 2 term out of the sum, The strategy will be to choose transition probabilities so that the left side is positive and large, and the right side has a small absolute value. Consider first the left side. Elementary but tedious calculations, which are deferred to the Appendix A, show that Now consider the right side. Assumption (1) says that no matter what the current state and action are, there is always a probability of at least c of transitioning directly to the goal. Therefore, Pr(τ R > k) and Pr(τ B > k) are each strictly less than (1 − c) k , as is the absolute value of their difference. Using this with well-known formulas for sums and partial sums of geometric series, we obtain Combining the expressions for the left and right sides, we see that Equation (2) is sufficient for It still remains to show existence; that there are probabilities and a value of c satisfying Equations (1) and (2). Now consider the MDP with probabilities defined in Equation (3). Clearly, these probabilities satisfy Equation (1). The left side of Equation (2) only has one non-zero term, so combining with p −1 = c, it can be simplified as follows: It remains to show there is a value c ∈ (0, 1) such that Call the left hand side f (c). Note that f (1) = 0. Find first and second derivatives at c = 1: A continuous function with f (1) = 0, f (1) = 0 and f (1) > 0 is necessarily positive for some value c < 1.

Numerical Experiments
Now that Hypothesis 1 is known to be false in general, several questions are immediately raised. First, we need a procedure for constructing a random MDP, which essentially reduces to repeatedly generating discrete distributions where the probabilities are themselves random. This is accomplished by generating independent uniform random variables, and standardizing so that they sum to one. Algorithm 1 gives pseudocode for this procedure, which the reader can find implemented in the language R [30] in the Supplementary Material as the function makeProbArray().
Algorithm 1: Pseudocode for randomly generating the probabilities for an MDP with a specified number of states and actions.
Input : N, the number of states; M, the number of actions. Output : Pr(·, ·, ·) : S × S × A → [0, 1], a randomly generated probability transition structure for an MDP. for i ← 1 to N do for j ← 1 to M do Generate N independent observations from a Uniform(0,1) distribution, x 1 , . . . , x N .
Once the transition probabilities are generated, finding E[τ R ] is straightforward. Under RAS, the system is an ordinary Markov chain with transition probabilities Finding the expected hitting time for a specified state in a Markov chain is a standard technique (see, for example, Section 2.11 of Resnick [29]) and so is stated without derivation. Force the goal state N to be absorbing by setting p NN = 1 and p Ni = 0 for i = 1, . . . , N − 1. Let Q be the matrix containing transition probabilities between transient states, I the identity matrix of the same size as Q, and 1 a column vector of all ones with the same number of rows as Q. Then, is a vector with entry k containing the expected time until absorption starting from state k. Therefore,  An entry of one in row i and column j of H indicates that action j has already been used in the current permutation of actions for state i, therefore action j is currently unavailable from state i. Likewise, an entry of zero in row i and column j means action j is available from state i. The restriction that a row cannot consist entirely of ones corresponds to BW assigning a new permutation of actions once the current pass has ended, meaning the row would consist entirely of zeros.
For example, for the following history matrix, if the system is in state 1, then action 2 must be used, at which point the first row reverts to all zeros. If the system is in state 2, action 1 must be used, and the second row resets. If the system is in state 3, actions 1 and 2 are eligible, so each action has a 50% chance of being selected. The entry in the third row for the chosen action would change to 1.
The state space for the induced process is S = S × H; that is, for the Markov property to hold, one must know both the state from the original process and information about the actions used from each state. The number of states in S is much larger than the number of states in S. There are 2 M − 1 possibilities for each row of the history matrix, and there are N rows, so there are (2 M − 1) N possible history matrices. Finally a state in S must also include one of the N states from S, so there are N(2 M − 1) N states total in S . Now we define transitions between states in the induced process. Let i, j ∈ S, and H k , H l ∈ H, so that (i, H k ), (j, H l ) ∈ S . First recognize that it will not be possible to transition from (i, H k ) to (j, H l ) in one epoch unless the history matrix for the second state reflects the most recently used action but is otherwise identical to the history matrix from the first state. This motivates the following definition. Definition 5. Let (i, H k ), (j, H l ) ∈ S , and m ∈ A. We say that H l is m-compatible with (i, H k ) if H k has an entry of zero in row i and column m, and changing that entry to a one results in H l .
Intuitively, m-compatibility means that action m is available to be selected from the current state, and that H l is the resulting history matrix when action m is used (remembering the possibility that if this causes row i to consist entirely of ones, it resets to a row entirely of zeros).
Let n (i,H k ) be the number of entries in row i of H k that are equal to zero. That is, n (i,H k ) is the number of actions available to choose from when in state i. Then, transition probabilities in the induced process are defined by Equation (7) can be understood in three parts. First, the requirement that induced states are compatible ensures that the history matrix is properly updated. Second, the denominator accounts for the number of actions available to choose from. Third, the numerator accounts for the transition probabilities from the original MDP. In the Supplementary Materials, the construction of the transition matrix in the induced MDP is implemented in the function makeInducedChain(). Once the transition matrix is obtained, the submatrix corresponding to transitions between transient states can be extracted, and finally E[τ B ] can be obtained from the calculation in Equation (6). These final steps are implemented in the function expectedHittingTimeBW().
The results of the numerical investigation are presented in Figure 2. For each combination of values of N and M, Algorithm 1 was used to generate 1000 MDPs. Notice that because the size of the induced state space S grows so rapidly with N and M, results are only given for N + M ≤ 7. For N + M > 7, our hardware could not complete the experiment in a reasonable amount of time.
For each MDP, the difference E[τ B ] − E[τ R ] was calculated. The figure displays a histogram of the differences, which all share commonalities. The peak is consistently slightly less than zero, indicating that typically RAS is slightly better than BW. The distribution has a left skew, so there are MDPs for which RAS is significantly better. Finally, very little of the distribution is to the right of zero, so while Hypothesis 1 can be false, it is rare and BW is not worse by a large degree.
In addition to the histograms, Figure 2 gives the mean difference, the percentage of differences that are greater than zero (signifying how often Hypothesis 1 is false), and the maximum observed distance, giving a sense of how strongly Hypothesis 1 can be violated. The mean difference is consistently close to −.085, and deviations are small relative to the sampling error, so there is little evidence that the mean difference changes substantially with N and M. On the other hand, the percentage of positive differences and the maximum difference appear to decline slightly as N, the number of states increases, and decline sharply as M, the number of actions increases. Therefore, it seems that the frequency and magnitude of violations of Hypothesis 1 decrease as the MDP becomes more complex.
Until now, we have assumed that no matter what state the system is in, the goal state is accessible. That is, there is always a sequence of states and actions with a positive probability of reaching the goal. In many real life applications, the system may enter a dead end [26], which is a state from which it is impossible to reach the goal. If such states are allowed, then the SSP implicitly becomes a multi-objective optimization problem [31], as the agent must simultaneously try to avoid the dead ends while still minimizing the expected cost of reaching the goal.
In this investigation, we handle dead ends by allowing the system to reset to the initial state, but with a penalty of additional time to do so. Formally, this is implemented by adding a return state, R. If the system enters a dead end, it will transition to the return state with probability one, and then transition to the initial state with probability one, at which point the search for the goal state begins anew. This modification can be observed in the following example. The matrix on the left represents the transition probabilities under an arbitrary action, with state 2 being absorbing. If state 2 is absorbing for all actions, then state 2 is a dead end. On the right, the matrix has been modified so that state 2 leads to the return state R. As mentioned previously, the state space for the induced process under BW grows rapidly, and exact calculation of the expected hitting time rapidly becomes infeasible. This is even more so when dead ends are allowed, as the additional return state increases the size of the induced process. For this reason, we could not replicate the entire experiment shown in Figure 2 allowing dead ends, but we did discover that processes still exist for which BW is worse than RAS. Figure 3 gives an example of such a process, and code for reconstructing it is in the Supplementary Materials.

Conclusions
This paper has compared two strategies for exploring Markov decision processes with the goal of reaching a specified state in the shortest average number of epochs. Though intuition suggests that the strategy using without-replacement sampling of actions would be uniformly superior, we have shown that this is only true for processes with two states.
As with most non-intuitive discoveries, more questions are raised than answered, and there are many avenues for possible future research. Some of these questions are addressed here.
How can behavior under BW be efficiently investigated when the number of states and actions is large? The state space of the induced process grows rapidly, and it soon becomes impractical to use the standard method for calculating the expected hitting time which requires inversion of the transition matrix. We have also attempted to estimate the hitting time statistically by repeated simulation, but the sampling variance is large relative to the difference E[τ B ] − E[τ R ], making it difficult to have confidence in any inferences. Further investigation will require the application or development of techniques beyond those used in this paper.
Though it is difficult to analyze large processes exactly, we can make a conjecture. From the numerical experiments, it seems that the difference between RAS and BW (in terms of the percentage of times BW is worse and the maximum degree by which RAS is better) decreases as the number of actions increases. An insight gained from Equation (2) gives reason to believe that this may be true. The term p −m , the probability of transitioning when action m is unavailable, is key. However, as the number of actions increases, p −m will necessarily differ little for each value of m. This corresponds to the notion that sampling with and without replacement become more similar as the number of elements that can be sampled increases. Therefore, we conjecture that the phenomena studied in this paper will be less relevant for larger processes.
Is the use of count-based exploration practically harmful in any state-of-the-art applications? This has not yet been investigated. For future research, the authors intend to implement each of RAS and BW as exploration strategies in benchmark problems and see if the behavior learned is significantly different.
Are there other measures beyond expected hitting time for which BW can be proven strictly better? By using the expected hitting time criteria for a single goal state, we are admittedly limiting ourselves to the study of stochastic shortest path MDPs. Perhaps a more complete metric would be the performance of an agent trained on the data collected by each exploration strategy. This type of analysis is deferred for future research.
In Theorem 3, Equation (1) explicitly requires the goal be accessible from all states, so the same proof strategy cannot be extended to processes with dead ends. Computations also become more difficult with dead ends. At this point, we know that Hypothesis 1 can still be violated when dead ends exist, but nothing else is known for certain. Dead ends provide another avenue for future research.
Theorem 3 is essentially an existence proof which finds sufficient conditions for Hypothesis 1 to be false; can it be extended to finding necessary conditions as well, providing a complete characterization? There is still much work to be done before a complete understanding of optimal exploration in Markovian environments is reached.