Scaling Up Q-Learning via Exploiting State–Action Equivalence

Recent success stories in reinforcement learning have demonstrated that leveraging structural properties of the underlying environment is key in devising viable methods capable of solving complex tasks. We study off-policy learning in discounted reinforcement learning, where some equivalence relation in the environment exists. We introduce a new model-free algorithm, called QL-ES (Q-learning with equivalence structure), which is a variant of (asynchronous) Q-learning tailored to exploit the equivalence structure in the MDP. We report a non-asymptotic PAC-type sample complexity bound for QL-ES, thereby establishing its sample efficiency. This bound also allows us to quantify the superiority of QL-ES over Q-learning analytically, which shows that the theoretical gain in some domains can be massive. We report extensive numerical experiments demonstrating that QL-ES converges significantly faster than (structure-oblivious) Q-learning empirically. They imply that the empirical performance gain obtained by exploiting the equivalence structure could be massive, even in simple domains. To the best of our knowledge, QL-ES is the first provably efficient model-free algorithm to exploit the equivalence structure in finite MDPs.


Introduction
Reinforcement learning (RL) aims to develop computer systems with the ability to learn how to behave optimally, or nearly so, in an unknown dynamic environment. An RL task typically involves an agent interacting with the environment, which is often modeled as a Markov decision process (MDP), and the agent's goal is to find a policy maximizing some notion of reward. In most settings in RL, the MDP is initially unknown beyond its state and action spaces. Hence, the agent aims to learn a near-optimal policy using the experiences collected from the environment.
A classical setting in RL is off-policy learning [1], where one tries to learn the optimal action-value function (i.e., Q-function) through the data collected under some behavior or logging policy. Perhaps the most famous off-policy learning algorithm is the celebrated Q-learning algorithm [2], whose improved variants, combined with deep neural networks as function approximators, played key roles in many recent breakthroughs in RL [3,4]. More precisely, Q-learning and its variants fall under the category of model-free methods, in which one tries to directly estimate the optimal value function-without estimating the true model (MDP)-from the collected experience, from which a near-optimal policy could be straightforwardly derived. This approach is in stark contrast to the model-based counterpart, where one first attempts to estimate the unknown model parameters (i.e., MDP parameters that include transition probabilities and rewards) from the collected experience and then finds an optimal policy in the estimated model.
Precisely speaking, these algorithms are guaranteed to return a near-optimal policy, with respect to a prescribed accuracy, with high probability if the amount of collected experience exceeds a certain (algorithm-dependent) function of the MDP parameters (and relevant input parameters). Most of these works study unstructured tabular MDPs, where the advertised sample complexity bounds scale, among other things, with the size of the state-action space. Thus, despite their appealing performance guarantees, most of these algorithms only work reasonably well when the size of the underlying MDP is small. On the other hand, many practical tasks can be modeled by MDPs with huge state spaces (or even infinite), but they often exhibit some structural properties. Ignoring such structural properties and directly applying the above algorithms would lead to a prohibitively large sample complexity bound, which may imply a huge learning phase in the worst case. Alternatively, one could leverage the structure in the MDP to speed up the exploration. In fact, exploiting the structure allows the agent to use the collected observations from the environment to reduce the uncertainty in the model parameter for many similar state-action pairs at each time slot. As a result, the learning performance would depend on the effective size of the state space (or the effective number of unknown parameters). Various notions of structures have been studied in MDPs, which include the Lipschitz continuity of MDP parameters (e.g., rewards and transition functions) [10][11][12][13], factorization structure [14][15][16], and equivalence relations [17][18][19][20][21][22]. These works reveal that exploiting the underlying structure in the environment in various RL tasks leads to massive empirical performance gain (over structure-oblivious algorithms) and to significantly improved performance bounds. However, exploiting structure often poses additional challenges. This work is motivated by tabular RL problems, where the (potentially large) stateaction space admits a natural partitioning such that within each element of the partition (or class), the state-action pairs have similar transition probabilities. There exist several ways to characterize the similarity between the transition distribution of two state-action pairs. Here, we consider a notion implying that they are (almost) identical up to some permutation. As we shall see in later sections, this notion of structure induces an equivalence relation in the state-action space. This model has been considered in prior work [18,23,24], where model-based algorithms were presented to exploit such a structure in the context of regret minimization in episodic or average-reward MDPs. However, their proposed ideas and techniques are specific to a model-based approach, where the model parameters are directly estimated, and cannot be used to incorporate the knowledge of the equivalence structure into a model-free algorithm. Model-free algorithms have played a pivotal role in the recent success of RL to solving complex tasks arising in real-world applications (e.g., autonomous driving and continuous control [25]). Hence, it sounds promising to study the gain one could obtain using model-free methods when leveraging the equivalence structure, thereby extending the theoretical analysis in [18,24] beyond model-based methods.
Contributions. We make the following contributions. We study off-policy learning in discounted finite MDPs, admitting some equivalence structure in their state-action space. We introduce a new model-free algorithm, called QL-ES (Q-learning with equivalence structure), which is a variant of (asynchronous) Q-learning tailored to exploit the equivalence structure in the MDP, when a prior knowledge on the structure is provided to the agent. We report a non-asymptotic PAC-type sample complexity bound for QL-ES, thereby establishing its sample efficiency. This bound also allows us to quantify the superiority of QL-ES over Q-learning analytically. As it turns out, the sample efficiency gain of QL-ES over Q-learning is captured by an MDP-dependent quantity ξ that is defined in terms of the associated covering times in the MDP; see Section 5 for details. Analytically establishing the dependence of the gain ratio ξ on the number S of states in a given MDP seems difficult, although it is possible to numerically compute it. Nonetheless, we present a simple example where ξ = O(S), thus showcasing that QL-ES in some domains may require much fewer (by a factor of S) samples than Q-learning. Furthermore, we numerically compute ξ for a few families of MDPs built using standard environments (with increasing S), thereby showcasing the theoretical superiority of QL-ES over Q-learning. Through extensive nu-merical experiments on standard domains, we show that Q-function estimates under QL-ES converge much faster than those obtained from (structure-oblivious) Q-learning. These results demonstrate that the empirical performance gain from exploiting the equivalence structure could be massive, even in simple domains. To our best knowledge, QL-ES is the first provably efficient model-free algorithm to exploit the equivalence structure in MDPs.

Related Work
Similarity and equivalence in MDPs. There is a rich literature on learning and exploiting various notions of structure in MDPs, where the aim is to leverage structure to alleviate the computational cost of finding an optimal policy (in the known MDP setting) or to speed up exploration (in the RL setting). Many such algorithms fall into the category of state abstraction (or aggregation) [26,27]. Approximate homomorphism is proposed to construct beneficial abstract models in MDPs [28]. In the known MDP setting, Refs. [29,30] appear to be the first presenting the notion of equivalence between states based on stochastic bi-simulation. The authors of [31,32] use bi-simulation metrics as quantitative analogues of the equivalence relations to partition the state space by capturing similarities. In the RL setting, Refs. [18][19][20]33,34] investigate model-based algorithms that rely on the grouping of similar states (or state-action pairs) to speed up exploration. Ref. [20] is the first to present an average-reward RL algorithm (in the regret setting), where the confidence intervals of similar states are aggregated. Ref. [18] studies regret minimization in average-reward MDPs with equivalence structure and presents the C-UCRL algorithm, which is capable of exploiting the structure. The regret bound for C-UCRL depends on the number of classes in the MDP rather than the size of the state-action space. A similar equivalence structure was studied in [17] in the context of multi-task RL, where similarities of the transition dynamics across tasks were extracted and exploited to speed up learning. Ref. [24] studies the efficiency of hierarchical RL in the regret setting in scenarios where the hierarchical structure is defined with respect to the notion of equivalence; more precisely, it assumes that the underlying MDP can be decomposed into equivalent sub-MDPs-i.e., smaller MDPs with identical reward and transition functions up to some known bijection mappings. Closest to our work, in terms of the structure definition, is [18]. However, we restrict ourselves to a model-free approach where the model-based machinery presented in [18] does not apply. Finally, we mention that there is some literature on exploiting equivalence in deep RL (e.g., [21,22]). However, none of these works study provably efficient learning methods to our best knowledge. Q-learning and its variants. We provide a very brief overview of the works studying theoretical analysis of Q-learning and its variants. Q-learning [2] has been around for more than three decades as a cheap and popular model-free method to solve finite, unknown discounted MDPs without estimating the model. Its convergence was investigated in an asymptotic flavor [35,36], and more recently in the non-asymptotic (finite-sample) regime in a series of work, including [9,[37][38][39][40]. To the best of our knowledge, Ref. [9] reports the sharpest PAC-type sample complexity bound for the classical Q-learning. Some of these works present variants of Q-learning with improved sample complexity bounds using a variety of techniques, such as acceleration and variance reduction [9,40,41]. Although the concept of equivalence in MDPs is not new, there is no work reporting PAC-type sample complexity bounds for model-free algorithms combined with equivalence relations, to our knowledge.

Problem Formulation
In this section, we present some necessary background and formulate the reinforcement learning problem considered in this paper. We use the following notations throughout. For a set B, ∆(B) denotes the set of all probability distributions over B. For an event E, IE denotes the indicator function of E: namely, it equals 1 if E holds, and 0 otherwise.

Discounted Markov Decision Processes
Let M = (S, A, P, R, γ) be an infinite-horizon discounted MDP [42], where S denotes a discrete state space with cardinality S, A denotes a discrete action space with cardinality A, and γ ∈ (0, 1) is a discount factor. P : S × A → ∆(S ) represents the transition function such that P(s |s, a) denotes the probability of transiting to state s when action a ∈ A is chosen in state s ∈ S. Further, R : S × A → [0, 1] denotes the reward function supported on [0, 1] such that R(s, a) denotes the reward distribution when choosing action a ∈ A in state s ∈ S. We denote by R(s, a) the mean of R(s, a). A stochastic (or randomized) policy π : S → ∆(A) is a mapping that maps a state to a probability distribution over A. For a policy π, the value function of π is a mapping V π : S → R defined as where for all t ≥ 0, a t ∼ π(s t ), s t+1 ∼ P(·|s t , a t ), and r t ∼ R(s t , a t ), and where the expectation is taken with respect to the randomness in rewards, next states, and actions sampled from π. The action-value function of a policy π, denoted by Q π : S × A → R, is defined as The optimal value function is denoted by V and satisfies V (s) = max π V π (s) for all s ∈ S. It is well known that in any finite MDP, there exists a stationary deterministic policy π : S → A such that V π = V , which is called an optimal policy [42]. Similarly, the optimal state-action value function is defined as Q (s, a) = max π Q π (s, a) for all (s, a) ∈ S × A. An optimal policy π satisfies π (s) ∈ arg max a Q (s, a) for all s ∈ S. Furthermore, Q is the unique solution to the optimal Bellman equation [42]:

The Off-Policy Learning Problem and Q-Learning
We consider the off-policy learning problem as follows. The agent is provided with some dataset D collected according to some behavior or logging policy π b . Precisely speaking, D takes the form of trajectory {(s t , a t , r t )} t≥0 , where s 0 is some initial state, and where for each t ≥ 0, a t ∼ π b (s t ), s t+1 ∼ P(·|s t , a t ), and r t ∼ R(s t , a t ). The agent is given an accuracy parameter ε > 0 and a failure probability parameter δ ∈ (0, 1), and its goal is to find an ε-optimal policy using as few samples as possible from D.
We need to impose some assumptions on the behavior policy π b to ensure that it is possible to learn a near-optimal policy only using D efficiently with PAC-type guarantees. To state the assumptions, we introduce some necessary definitions, which are borrowed from standard textbooks on Markov chains (e.g., [43]) but are also standard in the theoretical analysis of Q-learning (e.g., [9,39]). Let X be a finite set. The total variation distance between two distributions µ and ν defined over Now, consider an ergodic Markov chain (X t ) t≥1 with state space X and transition function p ∈ ∆(X ), and let µ be the unique stationary distribution of the chain. The Markov chain is said to be uniformly ergodic if there exist some ρ < 1 and M < ∞ such that for all t > 0, where p t (·|x) is the distribution of X t given X 0 = x.
Similar to [9], we assume that the Markov chain induced by π b is uniformly ergodic. This property ensures that all the states are visited infinitely often, and that convergence to the stationary distribution is performed at a geometric pace. This property is needed for the result presented in Section 5.
The Q-learning algorithm. The Q-learning algorithm [35] is perhaps the most famous model-free algorithm for learning an optimal policy in unknown tabular MDPs. As a model-free method, it directly learns the optimal Q-function Q of the MDP (without estimating P and R), which can be used to derive a policy. The algorithm maintains an estimate Q t of the optimal Q at each time step t. Specifically, it starts from an arbitrary choice for Q 0 ∈ R S×A and updates Q t , at each t ≥ 0, as where α t is a suitably chosen learning rate. Precisely speaking, the update Equation (1) corresponds to the asynchronous variant of Q-learning. The classical asymptotic performance analysis of Q-learning (in, for example, [36]) indicates that if (i) π b is exploratory enough such that all state-action pairs are visited infinitely often and (ii) (α t ) t≥0 satisfies the following conditions, known as the Robbins-Monro conditions [36,44]: then Q t → t→∞ Q almost surely, for any choice of Q 0 ∈ R S×A . For example, one such choice of the learning rate is as follows: The pseudo-code of Q-learning is described in Algorithm 1, where the learning rate sequence (α t ) t≥0 is considered as input.
Sample action a t ∼ π b (s t ) and observe r t ∼ R(s t , a t ) and s t+1 ∼ P(·|s t , a t ).

end for
It is worth remarking that some studies consider off-policy learning in the online setting, where data are collected from the environment while executing the algorithm. In such online settings, it is possible to choose a t according to an adaptive (randomized) policy π t (usually defined as a function of the current estimate Q t ), instead of sampling it from a fixed behavior policy. In doing so, the aim is to balance exploration and exploitation so as to collect higher rewards while learning the Q-function. A notable example is the -greedy policy, where at time t, a t is chosen greedily with respect to Q t (s t , ·) with probability 1 − , and chosen uniformly at random (from A) with probability . In the theoretical part of this paper, we consider a fixed behavior policy.

Similarity and Equivalence Classes
We now present a definition of the equivalence structure considered in this paper. We start by stating the following definition of similarity in finite MDPs as introduced in [18]. A similar definition is provided in [23].
We refer to σ s,a as the profile mapping (or for short, profile) for (s, a), and denote by σ = (σ s,a ) s,a the set of profile mappings across S × A.
We stress that σ s,a in Definition 1 may not be unique in general. The case of 0-similarity is of particular interest: it is evident from Definition 1 that if (s, a) and (s , a ) are 0-similar, then P(·|s, a) and P(·|s , a ) are identical up to some permutation from S × A to S × A. Furthermore, 0-similarity induces a partition of the state-action space S × A as formalized below.
Definition 2 (Equivalence structure [18]). 0-similarity is an equivalence relation and induces a canonical partition of S × A. We refer to such a canonical partition as equivalence structure and denote it by C. We further define C := |C|.
We provide an example to help understand Definition 2. Consider the RiverSwim environment [45] with 6 states and A = {L, R} (see Figure 1). The two pairs (s 1 , R) and (s 6 , R) are 0-similar since P(·|s 1 , R) = [0.6, 0.4, 0, 0, 0, 0] and P(·|s 6 , R) = [0, 0, 0, 0, 0.6, 0.4], so there exist permutations σ s 1 ,R and σ s 6 ,R such that P(σ s 1 ,R (·)|s 1 , R) = P(σ s 6 ,R (·)|s 6 , R). Additionally, all pairs (s i , L), i = 1, . . . , 6 are 0-similar, and so are (s i , R), i = 2, . . . , 5. We thus identify an equivalence structure C of S × A as follows: Note that for any finite MDP, Definition 2 trivially holds with C = S × A. There are many interesting environments that non-trivially admit the notion of equivalence structure in Definition 2. In such MDPs, it is often the case that the size C of the structure is much smaller than the size of the state-action space, i.e., C SA. For example, in a generic RiverSwim with S states, one has C = 3. Another example admitting an equivalence structure is the classical grid-world MDP, which is detailed in Section 6.2. Off-policy learning in MDPs with equivalence structures. In this work, we assume that the underlying MDP M admits an equivalence structure C as introduced above. In other words, the transition function P is such that S × A can be partitioned into C := |C| classes, where the pairs in each class c ∈ C are 0-similar. We make the following assumption regarding the agent's prior knowledge about C. Assumption 1. The agent has prior knowledge on C.
Let c(s, a) denote the class that the pair (s, a) belongs to. Assumption 1 implies that the agent knows c(s, a) for any pair (s, a) and the associated profile mapping σ s,a . Note, however, that the agent does not know the actual transition probabilities. Armed with such prior knowledge, we are interested in devising a model-free algorithm that is capable of leveraging the structure in M to improve the learning performance. We expect that the corresponding speed-up in learning the optimal Q-function could be significant in MDPs with C SA. We also make the following assumption regarding the reward function to ease the presentation. (This assumption has often been made in the literature on theoretical RL (e.g., [6,46]) since the main challenge in RL arises from unknown transition probabilities.) Assumption 2. The agent knows the reward function R.

The QL-ES Algorithm
In this section, we present a variant of Q-learning that exploits the equivalence structure in the environment to speed up the learning of the optimal Q-function. We call this algorithm QL-ES, which is short for 'Q-learning with equivalence structure'.
QL-ES follows the same machinery of QL but is also built on the idea that the knowledge on C and the corresponding profile mappings allows for using the triplet (s t , a t , s t+1 ) collected at each time t to update potentially multiple entries of Q t . Precisely speaking, the Q-function update for a given pair (s, a) requires a sample from R(s, a) and a sample from P(·|s, a). Since the agent perfectly knows C, it can determine c(s t , a t ), i.e., the class (s t , a t ) belongs to. Hence, it knows all other pairs belonging to the same class as (s t , a t ). Then using s t+1 (sampled from P(·|s t , a t )) at time t, the agent can construct samples for all other pairs in c(s t , a t ) as follows. If (s, a) ∈ c(s t , a t ), then there is a mapping σ s,a and a state s In other words, the sample s t+1 obtained from P(·|s t , a t ) is equivalent to obtaining a fresh sample s (sa) t+1 from P(·|s, a). As σ s,a and σ s t ,a t are known, the agent finds s (sa) t+1 := σ −1 s,a (σ s t ,a t (s t+1 )), where σ −1 denotes the inverse mapping of σ. In other words, s (sa) t+1 acts as a counterfactual next-state for (s, a) ∈ c(s t , a t ) thanks to the knowledge on C.
The agent thus uses (s t , a t , s t+1 ) to update Q t for all (s, a) ∈ c(s t , a t ). In summary, we update Q t as follows: For all t ≥ 0, where (α t ) t≥0 is a sequence of suitably chosen learning rates, as in (1). The pseudo-code of QL-ES is provided in Algorithm 2.
Find c(s t , a t ). for (s, a) ∈ c(s t , a t ) do s (sa) t+1 = σ −1 s,a (σ s t ,a t (s t+1 )) Compute Q t+1 using (2) end for end for When the underlying MDP admits some equivalence structure, QL-ES performs multiple updates of Q-function at any slot, in contrast to structure-oblivious Q-learning that updates only the Q-function of the current state-action pair. Thus, we expect learning the optimal Q-function under QL-ES to be faster than Q-learning; this will be corroborated by the numerical experiments in Section 6. It is also worth mentioning that QL-ES is never worse than Q-learning, as for the trivial partition C = S × A, which holds for any finite MDP, QL-ES reduces to Q-learning.

Remark 1.
The multiple updates used in QL-ES can be straightforwardly combined with many other variants of Q-learning, such as Speedy Q-learning [46] and UCB-QL [41].
We finally remark that some works in the literature on Q-learning use learning rates of the form α t = f (N t (s t , a t )), where N t (s, a) = ∑ t−1 τ=0 I{(s τ , a τ ) = (s, a)} and where f is some suitable function f satisfying the Robbins-Monro conditions, e.g., α t = 1 N t (s t ,a t )+1 . Such learning rates in the case of QL-ES can be modified to α t = f N t (c(s t , a t )) , where for any c ∈ C, N t (c) := ∑ t−1 τ=0 I{(s τ , a τ ) ∈ c}.

Theoretical Guarantee for QL-ES
In this section, we investigate the theoretical guarantee of QL-ES in terms of sample complexity in the PAC setting. Specifically, we are interested in characterizing the deviation between the optimal Q-function Q and its estimate Q T computed by QL-ES after T time steps. A relevant notion of deviation often studied in the literature (see, e.g., [37,38]) is the ∞ -distance between Q T and Q : which captures the worst error (with respect to Q ) among various pairs. One may study the rate at which the error function Q − Q T ∞ decays as a function of T. Alternatively, one may characterize the PAC sample complexity defined as the number T of steps needed until Q T satisfies Q − Q T ∞ ≤ ε with probability at least 1 − δ, for pre-specified ε and δ.
We consider the latter case. Let us first recall the classic definition of cover time t cover , which is a standard notion in the literature on Markov chains as well as those studying theoretical guarantees of Q-learning (and its variants) [9,38,39]. Let t 1 ≥ 0 and let t 2 > t 1 denote the first time step such that all state-action pairs are visited at least once with probability at least 1 2 . Then, the cover time t cover is defined as the maximum value of t 2 − t 1 over all initial pairs (s t 1 , a t 1 ). Note that t cover depends on both the MDP M and the behavior policy π b . More precisely, it depends on the mixing properties of the Markov chain induced by π b on M. Further, we have t cover ≥ SA.
Next, we introduce a notion of cover time for equivalence classes, which is relevant to the performance analysis of QL-ES. We believe it can be of independent interest. Definition 3. Let M be a finite MDP and C be an equivalence structure in M. Given t 1 ≥ 0, let t 2 > t 1 denote the first time step such that for each c ∈ C, some state-action pair in c is visited at least once with probability at least 1 2 . Then, the cover time with respect to the equivalence structure C in M, denoted by t cover,C , is defined as the maximum value of t 2 − t 1 over all initial choices of c(s t 1 , a t 1 ) (i.e., the class the initial pair (s t 1 , a t 1 ) belongs to).
The following theorem provides a non-asymptotic sample complexity for QL-ES. It concerns constant learning rates, i.e., α t = α for all t ≥ 0, where α may depend on ε and δ, among other things. Theorem 1. There exist some universal constants κ 0 , κ 1 such that for any δ ∈ (0, 1) and ε ∈ (0, 1 1−γ ], we have Q − Q T ∞ ≤ ε with probability greater than 1 − δ, provided that the number T of steps and learning rate α jointly satisfy A proof of this theorem is provided in Appendix A. Our proof is an adaptation of the of the proof of Theorem 2 in [9], which concerns the sample complexity of Q-learning. Comparison with sample complexity of Q-learning. Theorem 1 tells us that the number of steps to have Q − Q T ∞ ≤ ε with high probability depends on t cover,C ε −2 (1 − γ) −5 (up to some logarithmic factors), where t cover,C , defined in Definition 3, is the cover time with respect to C. Comparing this result against the sample of complexity of Q-learning (e.g., Theorem 2 in [9]) reveals that using QL-ES yields an improvement over Q-learning by a factor of const. × ξ, where ξ := ξ(M, C, π b ) := t cover t cover,C .
This ratio ξ is a problem-dependent constant (depending on both M and (C, σ)). It also depends on the behavior policy π b in view of the definitions of the cover times. It is evident that ξ ≥ 1 for any choice of M and C. For a given MDP M, the ratio ξ(M, C, π b ) can be numerically computed; we report numerical values of ξ for several domains in Section 6.3. On the other hand, deriving the analytical bounds on the ratio ξ(M, C, π b ) for any M appears to be complicated and tedious, if possible at all. Nonetheless, it is possible to construct simple problem instances, where one can derive analytical bounds on ξ. Figure 2 portrays one such example; this example is a simple Markov chain but can be easily extended to become an MDP. Easy calculations show that t cover = (S − 1)/δ and t cover,C = 1 so that ξ = (S − 1)/δ. Hence, one here has ξ = O(S). This simple example demonstrates that the gain of QL-ES over Q-learning in some domains could be as large as O(S), the size of the state space. Additionally, Theorem 1 reveals that in such domains, the theoretical sample complexity bound of QL-ES does not depend on S but on C, the number of classes in C. We refer the reader to the results in Section 6.3, where we present numerical bounds on ξ in some MDPs, which serve as the standard domain in the RL literature.

Simulation Results
This section is devoted to reporting numerical experiments conducted to examine the performance of QL-ES against the (structure-oblivious) Q-learning algorithm. First, we present the considered evaluation metrics and environments. Then we present numerical assessment of ξ for these environments. Finally, we report empirical sample complexities of QL-ES and Q-learning in the environments.

Evaluation Metrics
We consider two evaluation metrics in the experiments: (ii) Total Policy Error defined as π − π greedy t 1 , where π greedy t denotes the greedy policy w.r.t. Q t , i.e., π greedy t (s) := arg max a Q t (s, a) for all s.
The metric (i), which is in line with the definition of sample complexity studied in Section 5, captures the maximum difference between Q t and Q over all state-action pairs and allows us to empirically study the convergence speed of Q t toward Q . The second metric captures the quality of the estimate Q t in terms of inferred policies. Evidently, the quantity π − π greedy t 1 returns the number of states at which π greedy t prescribes a sub-optimal action. Hence, the metric (ii) may capture how bad the policy derived from Q t (i.e., π greedy t ) would be, compared to π , had we stopped at time step t. Equivalently, we may compute the metric (ii) via Working with (4) is preferred, as then, one may not worry about how ties (in arg max) are broken when either π t or π is not unique.

Environments
We consider two environments: RiverSwim and GridWorld. These are classical MDPs widely used in the RL literature. Both render suitability to demonstrate the numerical performance of QL-ES since each allows us to define a family of MDPs with progressive difficulty levels.
RiverSwim and variants. A generic RiverSwim MDP with L states is shown in Figure 1, which extends the classical 6-state RiverSwim presented in [45]. This MDP is constructed so that efficient exploration is required to obtain the optimal policy. The larger the number L of states, the more exploration is required. The L-state RiverSwim (with L ≥ 3) admits an equivalence structure with C = 3 regardless of L. We consider RiverSwim instances with various L so as to have MDPs with progressive difficulty levels while having a fixed number of classes. In some experiments, we consider a slightly modified version of RiverSwim, which we shall call Perturbed RiverSwim. It is identical to RiverSwim (Figure 1) except that in any state s i , where i < L is even, p(s i |s i , R) = 0.65 and p(s i+1 |s i , R) = 0.3. It is clear that there are C = 4 classes in a L-state Perturbed RiverSwim.
GridWorld. We also consider 2-room and 4-room grid-world MDPs with different grid sizes. Figure 3 shows a 7 × 7 2-room and a 9 × 9 4-room grid-worlds, respectively. In both environments, the agent starts at the upper-left corner (in red) and is supposed to reach the lower-right corner (in yellow), where it is given a reward of 1 and then sent back to the initial red state. At each step, the agent has four possible actions (hence, A = 4): Going up, left, down, or right. Black squares indicate the wall where the agent is not able to penetrate through. After executing a given action, the agent has a probability of 0.1 to stay in the same state, has a probability of 0.7 to move to the desired direction, and has a probability of 0.06 and 0.14 to move to the other two possible directions. If the wall blocks the agent, it stays where it is, and the transition probability of the next state is added to that of the current state. It is clear that the grid-world MDPs above admit some equivalence structure. In the case of 2-room (respectively, 4-room), the state-action space is of size 84 (respectively, 160), while the number of classes remains 8 in both. In Table 1, we also present six examples of grid-world environments with walls defined according to the way mentioned above. In the introduced 2-room and 4-room MDPs, the number of state-action pairs changes with the increase in the grid size, while the number of classes remains fixed.

Bounds on the Ratio ξ
We recall from Section 5 that the theoretical gain of QL-ES over Q-learning in terms of sample efficiency is captured by the problem-dependent quantity ξ = t cover t cover,C . In this subsection, we compute ξ for the introduced environments with the aim of providing insights into the growth of ξ as the number S of states grows. Specifically, we consider RiverSwim, Perturbed RiverSwim (introduced in Section 6.2), and GridWorld MDPs, each with growing number of states. In each case, we report empirical values for t cover and t cover,C together with the corresponding 95% confidence intervals. The empirical t cover is computed as the median value (across 100 independent runs for every possible initial state-action pair) of the number of steps it takes to discover all state-action pairs starting from a given initial state-action pair. A similar procedure is used for t cover,C .
Tables 2-4 summarize empirical values of t cover and t cover,C (together with the associated 95% confidence intervals denoted by CI) for RiverSwim, Perturbed RiverSwim, and 2-room GridWorld, respectively, with varying number of states in each case. In the case of GridWorld, we ran a uniform agent (sampling each action uniformly), wheres in RiverSwim MDPs, the agent samples R (resp. L) with probability 0.8 (resp. 0.2).
These results reveal that t cover,C is much smaller than t cover in all cases. Furthermore, they indicate that while t cover grows rapidly as S increases (in any family of the MDPs considered), t cover,C experiences a much smaller growth. As for the ratio ξ, we report ξ LCB as the lower confidence bound obtained by dividing the lower value in the CI for t cover by the upper value in the CI for t cover,C . This is a rather conservative estimate of the true ξ but ensures that ξ ≤ ξ LCB with probability at least 0.9. The reported values demonstrate that in these environments, ξ grows rapidly as the size of state space grows. This observation verifies that the theoretical gain of QL-ES over Q-learning can be significant. Table 2. Empirical values of t cover , t cover,C , and ξ LCB for RiverSwim with S states.   Table 3. Empirical values of t cover , t cover,C , and ξ LCB for Perturbed RiverSwim with S states.   Table 4. Empirical values of t cover , t cover,C , and ξ LCB for 2-room GridWorld with S states.

Experimental Results with Exact Equivalence Structure
We now turn to reporting experimental results for QL-ES and Q-learning in RiverSwim and GridWorld. In the following figures, QL indicates the standard Q-learning algorithm (Algorithm 1). We used a constant learning rate α = 0.05 and -greedy policy (with values of to be specified later). Furthermore, all the results are averaged over 100 independent runs, and the corresponding 95% confidence intervals are shown. Figure 4 presents the max-norm Q-value error and total policy error under both QL-ES and Q-learning in a 6-state RiverSwim, where we set γ = 0.9 and = 0.5. It is evident that QL-ES significantly outperforms Q-learning. The Q-value error under Q-learning decays at a very slow rate until about about 4 × 10 5 steps. After this step, the decay rate increases tangibly. In contrast, the Q-value error under QL-ES decays at a much faster speed. Under Q-learning, the total policy error remains above 5 until time step 4 × 10 5 , which implies that only one state has learned its optimal policy. On the contrary, the total policy error under QL-ES drops to the vicinity of 0 very quickly. These results verify that the empirical gain of leveraging the equivalence structures in MDPs, in terms of the number of samples, can be significant.
To demonstrate the scalability of QL-ES, in Figure 5, we present the Q-value error under QL-ES in RiverSwim instances with 6, 20, and 40 states. As the figure shows, although the error in the 6-state RiverSwim starts decaying much earlier than the others, all of them exhibit a similar rate of decay. Moreover, the curves corresponding to 20-state and 40-state instances are almost indistinguishable. This result showcases MDPs where the sample complexity of QL-ES does not scale with the size of the state-action space and is mostly determined by the number of classes.   As in RiverSwim MDPs, QL-ES significantly outperforms Q-learning in the gridworld environments. For Q-learning, the Q-value error remains considerable, even for 10 6 samples. Although both QL-ES and Q-learning do not fully learn an optimal policy by the end of the run, the total policy error decays much faster under QL-ES. Overall, the results demonstrate that exploiting equivalence structure is beneficial in grid-world MDPs. Comparing Figures 6 and 7, it is evident that QL-ES still obtains relatively better performance than Q-learning with the increase in state space. Moreover, similar trends are expected when conducting this experiment in larger grid-world MDPs.

The Gain in the Case of θ-Similar Pairs
We now investigate the case where the MDP may not admit any equivalence structure but admits θ-similarity across its state-action space; see Definition 1. To this effect, we introduce Modified RiverSwim and Modified GridWorld defined as follows. The Modified RiverSwim is identical to 6-state RiverSwim (Figure 1 Figure 8 shows the results in Modified RiverSwim (with γ = 0.95) and Modified GridWorld (γ = 0.85). It is evident that in both cases, QL-ES still achieves smaller Q-value error than Q-learning. 6.6. The Impact of Partially Using the Structure Considering MDPs with huge state-action spaces, it is necessary to take into account the feasibility of using only a few equivalent pairs. This thus naturally leads to the question as to whether only using a few equivalent pairs would lead to a reasonable performance gain. Therefore, here we investigate the convergence speed when choosing a subset of equivalent state-action pairs at each time step, rather than considering all the state-action pairs in the same class.
As shown in Figure 9, we can still obtain reasonable performance only considering a few equivalent pairs. The numbers in brackets represent how many equivalent pairs are used. In RiverSwim with 6 states (SA = 12), the performance of QLES(3) and QLES(4) is comparable to QL-ES. Interestingly, we already observe a significant improvement over Q-learning using QLES(1), i.e., when using only one additional observation in the Q-learning update. Meanwhile, the total policy error is less than or close to one. This shows that the optimal policy is correctly learned in almost all the states.
In 20-state RiverSwim (SA = 40), algorithms with few equivalent pairs can still achieve excellent performance, albeit not as good as QL-ES. Additionally, as Figure 10 shows, using more equivalent pairs leads to better sample efficiencies in MDPs with large state space. Concerning QLES(3)-QLES(12), the Q-value error gets smaller when more equivalent pairs are used.
From the perspective of policy, even QLES(3) and QLES(6) manage to find optimal actions significantly faster than Q-learning. In addition, QLES(10) is far superior to QLES (6) because of the quite smaller Q-value error and total policy error. The results show that algorithms with a suitable number of equivalent pairs are sufficient to learn an optimal policy in most states reasonably fast.

Conclusions
We studied off-policy learning in discounted Markov decision processes, where some equivalence structure exists in the state-action space. We presented a model-free algorithm called QL-ES, which is a natural extension of the classical asynchronous Q-learning but capable of exploiting the equivalence structure. We presented a high-probability sample complexity bound for QL-ES, and discussed how it improves that of Q-learning. As demonstrated, there exist problem instances on which the improvement over Q-learning could be a multiplicative factor of S, the size of the state space. Through extensive numerical experiments in standard domains, we demonstrated that QL-ES significantly improves over (structure-oblivious) Q-learning. These results revealed that exploiting state-action equivalence favors faster convergence of the Q-function and policy learning in large MDPs. A limitation of our approach is the need for the prior knowledge on the structure. To the best of our knowledge, existing methods for learning the equivalence structure are all model-based. Hence, an interesting question is whether it is possible to exploit the equivalence structure without prior knowledge on the structure using only model-free algorithms. Devising such model-free algorithms (or otherwise establishing an impossibility result) is an interesting yet challenging topic for future work. Another interesting direction for future work is to investigate ways to combine the knowledge of the equivalence structure with function approximation methods. Finally, another avenue for future work is to study model-free algorithms for the regret minimization setting in average-reward MDPs (e.g., [47,48]).
Let us introduce the short-hand c t := c(s t , a t ) for any t ≥ 0. In order to analyze the update in QL-ES in a matrix-form way, we introduce and we will use Q t for the Q-value matrix (of size S × A) and P for the transition matrix (of size SA × S). Notice that Λ t ∈ [0, 1) SA×SA is diagonal, and P t ∈ [0, 1] SA×S . Then we have: where V t ∈ R S is the vector of value estimates at time t, whose s-th element is given by max a Q t (s, a). Finally, we are interested in bounding ∆ t = Q t − Q .
The following lemma showcases the role of the cover time, and the proof is given in Appendix B.1: Lemma A1. Define the event K l = {∃c ∈ C s.t. c is not visited within iterations (lt cover,all , (l + 1)t cover,all ]} and L = T t cover,all . Then: P ∪ L l=0 K l ≤ δ.
Appendix A.2. Proof of Theorem 1 As established in the proof of [9] (Theorem 2) (see Equation (39) there), we have which, using an inductive argument, yields The proof for Lemma A5 relies once again on computational arguments and an induction, as showcased in Appendix B.4.
Finally, the following theorem is the last milestone of the proof, obtained from the two previous lemmas.
An easy recursive derivation gives P(H 1 ∩ · · · ∩ H L ) ≤ 2 −L . Finally: P(∃c ∈ C not visited between (0, t cover,all ]) ≤ P H 1 ∩ · · · ∩ H log 2 from which we easily deduce the result by using a straightforward union bound.

Appendix B.2. Proof of Lemma A2
The proof is an adaptation of the proof of Lemma 1 in [9] to the case of Q-value updates in QL-ES. Similar to the proof of Lemma 1 in [9], we begin with looking at the (s, a)-th element of β 1,t : (1 − α) K t (s,a)−k α(P t k (s,a)+1 (s, a) − P(s, a))V , where t k (s, a) denotes the time step when c(s, a) is visited for the k-th time, and where K t (s, a) denotes the number of times (s, a) is updated during the t first time steps: K t (s, a) = max{k | t k (s, a) ≤ t}. We simplify the notation below with t k (s, a) = t k . Now, we claim that ∀ (s, a) ∈ S × A, |β 1,t | ≤ γ α log CT δ V ∞ .
Firstly, the vectors P t k +1 (s, a), k = 1, . . . , K are independent and identically distributed for any (s, a) ∈ S × A and any K ∈ N . Indeed, for any i 1 , . . . i K ∈ S, P σ −1 s,a (σ s t k ,a t k (s t k +1 )) = i k , ∀ 1 ≤ k ≤ K = P σ −1 s,a (σ s t k ,a t k (s t k +1 )) = i k , ∀ 1 ≤ k ≤ K − 1 and σ −1 s,a (σ s t K ,a t K (s t K +1 )) = i K = ∑ m∈N P σ −1 s,a (σ s t k ,a t k (s t k +1 )) = i k , ∀ 1 ≤ k ≤ K − 1 and t K = m and σ −1 s,a (σ s m ,a m (s m+1 )) = i K (i) = ∑ m∈N P σ −1 s,a (σ s t k ,a t k (s t k +1 )) = i k , ∀ 1 ≤ k ≤ K − 1 and t K = m × P σ −1 s,a (σ s m ,a m (s m+1 )) = i K , (s, a) ∈ c m (ii) = P s,a (i K ) ∑ m∈N P σ −1 s,a (σ s t k ,a t k (s t k +1 )) = i k , ∀ 1 ≤ k ≤ K − 1 and t K = m = P s,a (i K )P σ −1 s,a (σ s t k ,a t k (s t k +1 )) = i k , ∀ 1 ≤ k ≤ K − 1 where (i) uses the Markov property, (s t K , a t K ) ∈ c(s, a) from the definition of t K = t K (s, a), (ii) uses the equivalence property between (s, a) and (s m , a m ), and P s,a is the transition probability from (s, a). By induction, one obtains P σ −1 s,a (σ s t k ,a t k (s t k +1 )) = i k , ∀ 1 ≤ k ≤ K = K ∏ j=1 P s,a i j which proves the sought independence. Then, following identical lines as in the proof of Lemma 1 in [9] (Lemma 1), we can use Hoeffding's inequality to bound β 1,t , which yields (1 − α) K−k α(P t k +1 (s, a) − P(s, a))V ≤ α log CT The proof is concluded by taking the union bound over all classes c ∈ C and all 1 ≤ K ≤ T.
Using (A15), we obtain With the definition of u t = v t ∞ , we arrive at the desired result.

Appendix B.4. Proof for Lemma A5
This proof follows identical steps as in the proof of Lemma 4 in [9] except that the quantities N n i (s, a) used there should be defined as the number of visits to the equivalence class of the state-action pair (s, a) between iteration i and iteration n (including i and n).