Entropic Regularization of Markov Decision Processes

Belousov, Boris; Peters, Jan

doi:10.3390/e21070674

Open AccessArticle

Entropic Regularization of Markov Decision Processes

by

Boris Belousov

^1,*

and

Jan Peters

^1,2

¹

Department of Computer Science, Technische Universität Darmstadt, 64289 Darmstadt, Germany

²

Max Planck Institute for Intelligent Systems, 72076 Tübingen, Germany

^*

Author to whom correspondence should be addressed.

Entropy 2019, 21(7), 674; https://doi.org/10.3390/e21070674

Submission received: 14 June 2019 / Revised: 6 July 2019 / Accepted: 8 July 2019 / Published: 10 July 2019

(This article belongs to the Special Issue Entropy Based Inference and Optimization in Machine Learning)

Download

Browse Figures

Versions Notes

Abstract

:

An optimal feedback controller for a given Markov decision process (MDP) can in principle be synthesized by value or policy iteration. However, if the system dynamics and the reward function are unknown, a learning agent must discover an optimal controller via direct interaction with the environment. Such interactive data gathering commonly leads to divergence towards dangerous or uninformative regions of the state space unless additional regularization measures are taken. Prior works proposed bounding the information loss measured by the Kullback–Leibler (KL) divergence at every policy improvement step to eliminate instability in the learning dynamics. In this paper, we consider a broader family of f-divergences, and more concretely

α

-divergences, which inherit the beneficial property of providing the policy improvement step in closed form at the same time yielding a corresponding dual objective for policy evaluation. Such entropic proximal policy optimization view gives a unified perspective on compatible actor-critic architectures. In particular, common least-squares value function estimation coupled with advantage-weighted maximum likelihood policy improvement is shown to correspond to the Pearson

χ^{2}

-divergence penalty. Other actor-critic pairs arise for various choices of the penalty-generating function f. On a concrete instantiation of our framework with the

α

-divergence, we carry out asymptotic analysis of the solutions for different values of

α

and demonstrate the effects of the divergence function choice on common standard reinforcement learning problems.

Keywords:

maximum entropy reinforcement learning; actor-critic methods; f-divergence; KL control

1. Introduction

Sequential decision-making problems under uncertainty are described by the mathematical framework of Markov decision processes (MDPs) [1]. The core problem in MDPs is to find an optimal policy—a mapping from states to actions which maximizes the expected cumulative reward collected by an agent over its lifetime. In reinforcement learning (RL), the agent is additionally assumed to have no prior knowledge about the environment dynamics and the reward function [2]. Therefore, direct policy optimization in the RL setting can be seen as a form of stochastic black-box optimization: the agent proposes a query point in the form of a policy, the environment evaluates this point by computing the expected return, after that the agent updates the proposal and the process repeats [3]. There are two conceptual steps in this scheme known as policy evaluation and policy improvement [4]. Both steps require function approximation in high-dimensional and continuous state-action spaces due to the curse of dimensionality [4]. Therefore, statistical learning approaches are employed to approximate the value function of a policy and to perform policy improvement based on the data collected from the environment.

In contrast to traditional supervised learning, in reinforcement learning, the data distribution changes with every policy update. State-of-the-art generalized policy iteration algorithms [5,6,7,8] are mindful of this covariate shift problem [9], taking active measures to account for it. To smoothen the learning dynamics, these algorithms limit the information loss between successive policy updates as measured by the KL divergence or approximations thereof [10]. In the optimization literature, such approaches are categorized as proximal (or trust region) algorithms [11].

The choice of the divergence function determines the geometry of the information manifold [12]. Recently, in particular in the area of implicit generative modeling [13], the choice of the divergence function was shown to have a dramatic effect both on the optimization performance [14] and the perceptual quality of the generated data when various f-divergences were employed [15]. In this paper, we carry over the idea of using generalized entropic proximal mappings [16] given by an f-divergence to reinforcement learning. We show that relative entropy policy search [6], framed as an instance of stochastic mirror descent [17,18] as suggested by [10], can be extended to use any divergence measure from the family of f-divergences. The resulting algorithm provides insights into the compatibility of policy and value function update rules in actor-critic architectures, which we exemplify on several instantiations of the generic f-divergence with representatives from the parametric family of

α

-divergences [19,20,21].

2. Background

This section provides the necessary background on policy gradients [3] and entropic penalties [16] for later derivations and analysis. Standard RL notation [22] is used throughout.

2.1. Policy Gradient Methods

Policy search algorithms [3] commonly use the gradient estimator of the following form [23]

\hat{g} = {\hat{E}}_{t} [\nabla_{θ} log π_{θ} {\hat{A}}_{t}^{w}]

(1)

where

π_{θ} (a | s)

is a stochastic policy and

{\hat{A}}_{t}^{w} (s_{t}, a_{t})

is an estimator of the advantage function at timestep t. Expectation

{\hat{E}}_{t} [\dots]

indicates an empirical average over a finite batch of samples, in an algorithm that alternates between sampling and optimization. The advantage estimate

{\hat{A}}_{t}^{w}

in (1) can be obtained from an estimate of the value function [24,25], which in its turn is found by least-squares estimation. Specifically, if

V^{w} (s)

denotes a parametric value function, and if

{\hat{V}}_{t} = \sum_{k = 0}^{\infty} γ^{k} R_{t + k}

is taken as its rollout-based estimate, then the parameters w can be found as

w = arg min_{\tilde{w}} {\hat{E}}_{t} [∥ V^{\tilde{w}} (s_{t}) - {\hat{V}}_{t} ∥^{2}] .

(2)

The advantage estimate

{\hat{A}}_{t}^{w} = \sum_{k = 0}^{\infty} γ^{k} δ_{t + k}^{w}

is then obtained by summing the temporal difference errors

δ_{t}^{w} = R_{t} + γ V^{w} (s_{t + 1}) - V^{w} (s_{t})

, also known as the Bellman residuals. Treating

{\hat{A}}_{t}^{w}

as fixed for the purpose of policy improvement, we can view (1) as the gradient of an advantage-weighted log-likelihood; therefore, the policy parameters

θ

can be found as

θ = arg max_{\tilde{θ}} {\hat{E}}_{t} [log π_{\tilde{θ}} {\hat{A}}_{t}^{w}] .

(3)

Thus, actor-critic algorithms that use the gradient estimator (1) to update the policy can be viewed as instances of the generalized policy iteration scheme, alternating between policy evaluation (2) and policy improvement (3). In the following, we will see that the actor-critic pair (2) and (3), that combines least-squares value function fitting with linear-in-the-advantage-weighted maximum likelihood policy improvement, is just one representative from a family of such actor-critic pairs arising for different choices of the f-divergence penalty within our entropic proximal policy optimization framework.

2.2. Entropic Penalties

The term entropic penalties [16] refers to both f-divergences and Bregman divergences. In this paper, we will focus on f-divergences, leaving generalization to Bregman divergences for future work. The f-divergence [26] between two distributions P and Q with densities p and q is defined as

D_{f} (p ∥ q) = E_{q} [f (\frac{p}{q})]

where f is a convex function on

(0, \infty)

with

f (1) = 0

and P is assumed to be absolutely continuous with respect to Q. For example, the KL divergence corresponds to

f_{1} (x) = x log x - (x - 1)

, with the formula also applicable to unnormalized distributions [27]. Many common divergences lie on the curve of

α

-divergences [19,20] defined by a special choice of the generator function [21]

f_{α} (x) = \frac{(x^{α} - 1) - α (x - 1)}{α (α - 1)}, α \in R .

(4)

The

α

-divergence

D_{α} = D_{f_{α}}

will be used as the primary example of the f-divergence throughout the paper. For more details on the

α

-divergence and its properties, see Appendix A. Noteworthy is the symmetry of the

α

-divergence with respect to

α = 0.5

, which relates reverse divergences as

D_{0.5 + β} (p ∥ q) = D_{0.5 - β} (q ∥ p)

.

3. Entropic Proximal Policy Optimization

Consider the average-reward RL setting [2], where the dynamics of an ergodic MDP are given by the transition density

p (s^{'} | s, a)

. An intelligent agent can modulate the system dynamics by sampling actions a from a stochastic policy

π (a | s)

at every time step of the evolution of the dynamical system. The resulting modulated Markov chain with transition kernel

p_{π} (s^{'} | s) = \int_{A} p (s^{'} | s, a) π (a | s) d a

converges to a stationary state distribution

μ_{π} (s)

as time goes to infinity. This stationary state distribution induces a state-action distribution

ρ_{π} (s, a) = μ_{π} (s) π (a | s)

, which corresponds to visitation frequencies of state-action pairs [1]. The goal of the agent is to steer the system dynamics to desirable states. Such objective is commonly encoded by the expectation of a random variable

R : S \times A \to R

called reward in this context. Thus, the agent seeks a policy that maximizes the expected reward

J (π) = E_{ρ_{π} (s, a)} [R (s, a)]

.

In reinforcement learning, neither the reward function R nor the system dynamics

p (s^{'} | s, a)

are assumed to be known. Therefore, to maximize (or even evaluate) the objective

J (π)

, the agent must sample a batch of experiences in the form of tuples

(s, a, r, s^{'})

from the dynamics and use an empirical estimate

\hat{J} = {\hat{E}}_{t} [R (s_{t}, a_{t})]

as a surrogate for the original objective. Since the gradient of the expected reward with respect to the policy parameters can be written as [28]

\nabla_{θ} J (π_{θ}) = E_{ρ_{π_{θ}} (s, a)} [\nabla_{θ} log π_{θ} (a | s) R (s, a)]

with a corresponding sample-based counterpart

\nabla_{θ} \hat{J} = {\hat{E}}_{t} [\nabla_{θ} log π_{θ} (a_{t} | s_{t}) R (s_{t}, a_{t})],

one may be tempted to optimize a sample-based objective

{\hat{E}}_{t} [log π_{θ} (a_{t} | s_{t}) R (s_{t}, a_{t})]

on a fixed batch of data

{{(s, a, r, s^{'})}_{t}}_{t = 1}^{N}

till convergence. However, such an approach ignores the fact that sampling distribution

ρ_{π_{θ}} (s, a)

itself depends on the policy parameters

θ

; therefore, such greedy optimization aims at a wrong objective [6]. To have the correct objective, the dataset must be sampled anew after every parameter update—doing otherwise will lead to overfitting and divergence. This problem is known in statistics as the covariate shift problem [9].

3.1. Fighting Covariate Shift via Trust Regions

A principled way to account for the change in the sampling distribution at every policy update step is to construct an auxiliary local objective function that can be safely optimized till convergence. Relative entropy policy search (REPS) algorithm [6] proposes a candidate for such an objective

J_{η} (π) = E_{ρ_{π}} [R] - η D_{1} (ρ_{π} ∥ ρ_{π_{0}})

(5)

with

π_{0}

being the current policy under which the data samples were collected, policy

π

being the improvement policy that needs to be found, and

η > 0

being a ‘temperature’ parameter that determines how much the next policy can deviate from the current one. The original formulation employs a relative entropy trust region constraint

D_{1}

with radius

ε

instead of a penalty, which allows for finding the optimal temperature

η

as a function of the trust region radius

ε

.

Importantly, the objective function (5) can be optimized in closed form for policy

π

(i.e., treating the policy itself as a variable and not its parameters, in contrast to standard policy gradients). To that end, several constraints on

ρ_{π}

are added to ensure stationarity with respect to the given MDP [6]. In a similar vein, we can solve Problem (5) with respect to

π

for any f-divergence with a twice differentiable generator function f.

3.2. Policy Optimization with Entropic Penalties

Following the intuition of REPS, we introduce an f-divergence penalized optimization problem that the learning agent must solve at every policy iteration step

\begin{matrix} \underset{π}{maximize} & J_{η} (π) = E_{ρ_{π}} [R] - η D_{f} (ρ_{π} ∥ ρ_{π_{0}}) \\ subject to & \int_{A} ρ_{π} (s^{'}, a^{'}) d a^{'} = \int_{S \times A} ρ_{π} (s, a) p (s^{'} | s, a) d s d a, \forall s^{'} \in S, \\ \int_{S \times A} ρ_{π} (s, a) d s d a = 1, \\ ρ_{π} (s, a) \geq 0, \forall (s, a) \in S \times A . \end{matrix}

(6)

The agent seeks a policy that maximizes the expected reward and does not deviate from the current policy too much. The first constraint in (6) ensures that the policy is compatible with the system dynamics, and the latter two constraints ensure that

π

is a proper probability distribution. Please note that

π

enters Problem (6) indirectly through

ρ_{π}

. Since the objective has the form of free energy [29] in

ρ_{π}

with an f-divergence playing the role of the usual KL, the solution can be expressed through the derivative of the convex conjugate function

f_{*}^{'}

, as shown for general nonlinear problems in [16],

ρ_{π} (s, a) = ρ_{π_{0}} (s, a) f_{*}^{'} (\frac{R (s, a) + \int_{S} V (s^{'}) p (s^{'} | s, a) d s^{'} - V (s) - λ + κ (s, a)}{η}) .

(7)

Here,

{V (s), λ, κ (s, a)}

are the Lagrange dual variables corresponding to the three constraints in (6), respectively. Although we get a closed-form solution for

ρ_{π}

, we still need to solve the dual optimization problem to get the optimal dual variables

\begin{matrix} \underset{V, λ, κ}{minimize} & g (V, λ, κ) = η E_{ρ_{π_{0}}} [f_{*} (\frac{A^{V} (s, a) - λ + κ (s, a)}{η})] + λ \\ subject to & κ (s, a) \geq 0, \forall (s, a) \in S \times A, \\ arg f_{*} \in {range}_{x \geq 0} f^{'} (x), \forall (s, a) \in S \times A . \end{matrix}

(8)

Remarkably, the advantage function

A^{V} (s, a) = R (s, a) + \int_{S} V (s^{'}) p (s^{'} | s, a) d s^{'} - V (s)

emerges automatically in the dual objective. The advantage function also appears in the penalty-free linear programming formulation of policy improvement [1], which corresponds to the zero-temperature limit

η \to 0

of our formulation. Thanks to the fact that the dual objective in (8) is given as an expectation with respect to

ρ_{π_{0}}

, it can be straightforwardly estimated from rollouts. The last constraint in (8) on the argument of

f_{*}

is easy to evaluate for common

α

-divergences. Indeed, the convex conjugate

f_{α}^{*}

of the generator function (4) is given by

f_{α}^{*} (y) = \frac{1}{α} {(1 + (α - 1) y)}^{\frac{α}{α - 1}} - \frac{1}{α}, for y (1 - α) < 1 .

(9)

Thus, the constraint on

arg f_{*}

in (4) is just a linear inequality

y (1 - α) < 1

for any

α

-divergence.

3.3. Value Function Approximation

For small grid-world problems, one can solve Problem (8) exactly for

V (s)

. However, for larger problems or if the state space is continuous, one must resort to function approximation. Assume we plug an expressive function approximator

V^{w} (s)

in (8), then vector w becomes a new vector of parameters in the dual objective. Later, it will be shown that minimizing the dual when

η \to \infty

is closely related to minimizing the mean squared Bellman error.

3.4. Sample-Based Algorithm for Dual Optimization

To solve Problem (8) in practice, we gather a batch of samples from policy

π_{0}

and replace the expectation in the objective with a sample average. Please note that in principle one also needs to estimate the expectation of the future rewards

\int_{S} V (s^{'}) p (s^{'} | s, a) d s^{'}

. However, since the probability of visiting the same state-action pair in continuous space is zero, one commonly estimates this integral from a single sample [3], which is equivalent to assuming deterministic system dynamics. Inequality constraints in (8) are linear and they must be imposed for every

(s, a)

pair in the dataset.

3.5. Parametric Policy Fitting

Assume Problem (8) is solved on a current batch of data sampled from

π_{0}

and thus the optimal dual variables

{V (s), λ, κ (s, a)}

are given. Equation (7) allows one to evaluate the new density

ρ_{π} (s, a)

on any pair

(s, a)

from the dataset. However, it does not yield the new policy

π

directly because representation (7) is variational. A common approach [3] is to assume that the policy is represented by a parameterized conditional density

π_{θ} (a | s)

and fit this density to the data using maximum likelihood.

To fit a parametric density

π_{θ} (a | s)

to the true solution

π (a | s)

given by (7), we minimize the KL divergence

D_{1} (ρ_{π} ∥ ρ_{π_{θ}})

. Minimization of this KL is equivalent to maximization of the weighted maximum likelihood

\hat{E} [f_{*}^{'} (\dots) log ρ_{π_{θ}}]

. Unfortunately, distribution

ρ_{π_{θ}} (s, a) = μ_{π_{θ}} (s) π_{θ} (a | s)

is in general not known because

μ_{π_{θ}} (s)

does not only depend on the policy but also on the system dynamics. Assuming the effect of policy parameters on the stationary state distribution is small [3], we arrive at the following optimization problem for fitting the policy parameters

θ = arg max_{\tilde{θ}} {\hat{E}}_{t} [log π_{\tilde{θ}} (a_{t} | s_{t}) f_{*}^{'} (\frac{{\hat{A}}^{w} (s_{t}, a_{t}) - λ + κ (s_{t}, a_{t})}{η})] .

(10)

Compare our policy improvement step (10) to the commonly used advantage-weighted maximum likelihood (ML) objective (3). They look surprisingly similar (especially if

f_{*}^{'} (y) = y

is a linear function), which is not a coincidence and will be systematically explained in the next sections.

3.6. Temperature Scheduling

The ‘temperature’ parameter

η

trades off reward vs divergence, as can be seen in the objective function in Problem (6). In practice, devising a schedule for

η

may be hard because

η

is sensitive to reward scaling and policy parameterization. A more intuitive way to impose the f-divergence proximity condition is by adding it as a constraint

D_{f} (ρ_{π} ∥ ρ_{π_{0}}) \leq ε

with a fixed

ε

and then treating the temperature

η \geq 0

as an optimization variable. Such formulation is easy to incorporate into the dual (8) by adding a term

η ε

to the objective and a constraint

η \geq 0

to the list of constraints. Constraint-based formulation was successfully used before with a KL divergence constraint [6] and with its quadratic approximation [5,7].

3.7. Practical Algorithm for Continuous State-Action Spaces

Our proposed approach for entropic proximal policy optimization is summarized in Algorithm 1. Following the generalized policy iteration scheme, we (i) collect data under a given policy, (ii) evaluate the policy by solving (8), and (iii) improve the policy by solving (10). In the following section, several instantiations of Algorithm 1 with different choices of function f will be presented and studied.

Algorithm 1: Primal-dual entropic proximal policy optimization with function approximation

4. High- and Low-Temperature Limits; $α$ -Divergences; Analytic Solutions and Asymptotics

How does the f-divergence penalty influence policy optimization? How should one choose the generator function f? What role does the step size play in optimization? This section will try to answer these and related questions. First, two special choices of the penalty function f are presented, which reveal that the common practice of using mean squared Bellman error minimization coupled with advantage reweighted policy update is equivalent to imposing a Pearson

χ^{2}

-divergence penalty. Second, high- and low-temperature limits are studied, on one hand revealing the special role the Pearson

χ^{2}

-divergence plays, being the high-temperature limit of all smooth f-divergences, and on the other hand establishing a link to the linear programming formulation of policy search as the low-temperature limit of our entropic penalty-based framework.

4.1. KL Divergence ( $α = 1$ ) and Pearson $χ^{2}$ -Divergence ( $α = 2$ )

As can be deduced from the form of (10), great simplifications occur when

f_{*}^{'} (y)

is a linear function (

α = 2

, see (9)) or an exponential function (

α = 1

). The fundamental reason for such simplifications lies in the fact that linear and exponential functions are homomorphisms with respect to addition. This allows, in particular, discovery of a closed-form solution for the dual variable

λ

and thus eliminate it from the optimization. Moreover, in these two special cases, the dual variables

κ (s, a)

can also be eliminated. They are responsible for non-negativity of probabilities: when

α = 1

(KL),

κ (s, a) = 0

uniformly for all

η \geq 0

, when

α = 2

(Pearson),

κ (s, a) = 0

for sufficiently big

η

. Table 1 gives the corresponding empirical actor-critic optimization objective pairs. A generic primal-dual actor-critic algorithm with an

α

-divergence penalty performs two steps

\begin{matrix} (step 1 : policy evaluation) & \underset{w}{minimize} & {\hat{g}}_{α} (w) \\ (step 2 : policy improvement) & \underset{θ}{maximize} & {\hat{L}}_{α} (θ) \end{matrix}

inside a policy iteration loop. It is worth comparing the explicit formulas in Table 1 to the customarily used objectives (2) and (3). To make the comparison fair, notice that (2) and (3) correspond to discounted infinite horizon formulation with discount factor

γ \in (0, 1)

whereas formulas in Table 1 are derived for the average-reward setting. In general, the difference between these two settings can be ascribed to an additional baseline that must be subtracted in the average reward setting [2]. In our derivations, the baseline corresponds to the dual variable

λ

, as in classical linear programming formulation of policy iteration [1], and it is automatically gets subtracted from the advantage (see (8)).

Mean Squared Error Minimization with Advantage Reweighting is Equivalent to Pearson Penalty

The baseline for

α = 2

is given by the average advantage

λ_{2} = {\hat{E}}_{t} [{\hat{A}}^{w} (s_{t}, a_{t})]

, which also equals the average return in our setting [1,2]. Therefore, to translate the formulas from Table 1 to the discounted infinite horizon form (2) and (3), we need to remove the baseline and add discounting to the advantage; that is, set

A^{w} (s, a) = R (s, a) + γ \int_{S} V^{w} (s^{'}) p (s^{'} | s, a) d s^{'} - V^{w} (s)

. Then the dual objective

{\hat{g}}_{2} (w) \propto {\hat{E}}_{t} [{({\hat{A}}^{w} (s_{t}, a_{t}))}^{2}]

(11)

is proportional to the average squared advantage. Naive optimization of (11) leads to the family of residual gradient algorithms [30,31]. However, if the same Monte Carlo estimate of the value function is used as in (2), then (11) and (2) are exactly equivalent. The same holds for the Pearson actor

{\hat{L}}_{2} (θ) \propto {\hat{E}}_{t} [log π_{θ} (a_{t} | s_{t}) {\hat{A}}^{w} (s_{t}, a_{t})]

(12)

and the standard policy improvement (3) provided that

η = {\hat{E}}_{t} [{\hat{A}}^{w} (s_{t}, a_{t})]

. That means (12) is equivalent to (3) if the weight of the divergence penalty is equal to the expected return.

4.2. High- and Low-Temperature Limits

In the previous subsection, we established a direct correspondence between the least-squares value function fitting coupled with the advantage-weighted maximum likelihood policy parameters estimation (2) and (3) and the dual-primal pair of optimization problems (11) and (12) arising from our Algorithm 1 for the special choice of the Pearson

χ^{2}

-divergence penalty. In this subsection, we will show that this is not a coincidence but a manifestation of the fundamental fact that the Pearson

χ^{2}

-divergence is the quadratic approximation of any smooth f-divergence about unity.

4.2.1. High Temperatures: All Smooth f-Divergences Tend Towards Pearson $χ^{2}$ -Divergence

There are two ways to show the independence of the primal-dual solution (8)–(10) on the choice of the divergence penalty: either exactly solve an approximate problem or approximate the exact solution of the original problem. In the first case, the penalty is replaced with its Taylor expansion at

η \to \infty

, which turns out to be the Pearson

χ^{2}

-divergence, and then the derivation becomes equivalent to the natural policy gradient derivation [5]. In the second case, the exact solution (8)–(10) is expanded by Taylor: for big

η

, dual variables

κ (s, a)

can be dropped if

ρ_{π_{0}} (s, a) > 0

, which yields

f_{*} (\frac{A^{w} (s, a) - λ}{η}) = f_{*} (0) + \frac{A^{w} (s, a) - λ}{η} f_{*}^{'} (0) + \frac{1}{2} {(\frac{A^{w} (s, a) - λ}{η})}^{2} f_{*}^{″} (0) + o (\frac{1}{η^{2}}) .

(13)

By definition of the f-divergence, the generator function f satisfies the condition

f (1) = 0

. Without loss of generality [32], one can impose an additional constraint

f^{'} (1) = 0

for convenience. Such constraint ensures that the graph of the function

f (x)

lies entirely in the upper half-plane, touching the x-axis at a single point

x = 1

. From the definition of the convex conjugate

f_{*}^{'} = {(f^{'})}^{- 1}

, we can deduce that

f_{*}^{'} (0) = 1

and

f_{*} (0) = 0

; by rescaling, it is moreover possible to set

f^{″} (1) = f_{*}^{″} (0) = 1

. These properties are automatically satisfied by the

α

-divergence, which can be verified by a direct computation. With this in mind, it is straightforward to see that substitution of (13) into (8) yields precisely the quadratic objective

{\hat{g}}_{2} (w)

from Table 1, the difference being of the second order in

1 / η

.

To obtain the asymptotic policy update objective, one can expand (10) in the high-temperature limit

η \to \infty

and observe that it equals

{\hat{L}}_{2} (θ)

from Table 1 with the difference being of the second order in

1 / η

. Therefore, it is established that the choice of the divergence function plays a minor role for big temperatures (small policy update steps). Since this is the mode in which the majority of iterative algorithms operate, our entropic proximal policy optimization point of view provides a rigorous justification for the common practice of using the mean squared Bellman error objective for value function fitting and the advantage-weighted maximum likelihood objective for policy improvement.

4.2.2. Low Temperatures: Linear Programming Formulation Emerges in the Limit

Setting

η

to a small number is equivalent to allowing large policy update steps because

η

is the weight of the divergence penalty in the objective function (6). Such regime is rather undesirable in reinforcement learning because of the covariate shift problem mentioned in the introduction. Problem (6) for

η \to 0

turns into a well-studied linear programming formulation [1,10] that can be readily applied if the model

{p (s^{'} | s, a), R (s, a)}

is known.

It is not straightforward to derive the asymptotics of policy evaluation (8) and policy improvement (10) for a general smooth f-divergence in the low-temperature limit

η \to 0

because the dual variables

κ (s, a)

do not disappear, in contrast to the high-temperature limit (13). However, for the KL divergence penalty (see Table 1), one can show that the policy evaluation objective

g_{1} (w)

tends towards the supremum of the advantage

g_{1} (w) \to {sup}_{s, a} A^{w} (s, a)

; the optimal policy is deterministic,

π (a | s) \to δ (a - arg {sup}_{b} A^{w} (s, b))

, therefore

L (θ) \to log π_{θ} (\bar{a} | \bar{s})

with

(\bar{s}, \bar{a}) = arg {sup}_{s^{'}, a^{'}} A^{w} (s^{'}, a^{'})

.

5. Empirical Evaluations

To develop an intuition regarding the influence of the entropic penalties on policy improvement, we first consider a simplified version of the reinforcement learning problem—namely the stochastic multi-armed bandit problem [33]. In this setting, our algorithm is closely related to the family of Exp3 algorithms [34], originally motivated by the adversarial bandit problem. Subsequently, we evaluate our approach in the standard reinforcement learning setting.

5.1. Illustrative Experiments on Stochastic Multi-Armed Bandit Problems

In the stochastic multi-armed bandit problem [33], at every time step

t \in {1, \dots, T}

, an agent chooses among K actions

a \in A

. After every choice

a_{t} = a

, it receives a noisy reward

R_{t} = R (a_{t})

drawn from a distribution with mean

Q (a)

. The goal of the agent is to maximize the expected total reward

J = E [\sum_{t = 1}^{T} R_{t}]

. Given the true values

Q (a)

, the optimal strategy is to always choose the best action,

a_{t}^{*} = {arg max}_{a} Q (a)

. However, due to the lack of knowledge, the agent faces the exploration-exploitation dilemma. A generic way to encode the exploration-exploitation trade-off is by introducing a policy

π_{t}

, i.e., a distribution from which the agent draws actions

a_{t} \sim π_{t}

. Thus, the question becomes: given the current policy

π_{t}

and the current estimate of action values

{\hat{Q}}_{t}

, what should the policy

π_{t + 1}

at the next time step be? Unlike the choice of the best action under perfect information, such sampling policies are hard to derive from first principles [35].

We apply our generic Algorithm 1 to the stochastic multi-armed bandit problem to illustrate the effects of the divergence choice. The value function disappears because there is no state and no system dynamics in this problem. Therefore, the estimate

{\hat{Q}}_{t}

plays the role of the advantage, and the dual optimization (8) is performed only with respect to the remaining Lagrange multipliers.

5.1.1. Effects of $α$ on Policy Improvement

Figure 1 shows the effects of the

α

-divergence choice on policy updates. We consider a 10-armed bandit problem with arm values

Q (a) \sim N (0, 1)

and keep the temperature fixed at

η = 2

for all values of

α

. Several iterations starting from an initial uniform policy are shown in the figure for comparison. Extremely large positive and negative values of

α

result in

ε

-elimination and

ε

-greedy policies, respectively. Small values of

α

, in contrast, weigh actions according to their values. Policies for

α < 1

are peaked and heavy-tailed, eventually turning into

ε

-greedy policies when

α \to - \infty

. Policies for

α \geq 1

are more uniform, but they put zero mass on bad actions, eventually turning into

ε

-elimination policies when

α \to \infty

. For

α \geq 1

, policy iteration may spend a lot of time in the end deciding between two best actions, whereas for

α < 1

the final convergence is faster.

5.1.2. Effects of $α$ on Regret

The average regret

C_{n} = n Q_{max} - E [\sum_{t = 0}^{n - 1} R_{t}]

is shown in Figure 2 for different values of

α

as a function of the time step n with

95 %

confidence error bars. The performance of the UCB algorithm [33] is also shown for comparison. The presented results are obtained in a 20-armed bandit environment where rewards have Gaussian distribution

R (a) \sim N (Q (a), 0.5)

. Arm values are estimated from observed rewards and the policy is updated every 20 time steps. The temperature parameter

η

is decreased starting from

η = 1

after every policy update according to the schedule

η^{+} = β η

with

β = 0.8

. Results are averaged over 400 runs. In general, extreme

α

’s accumulate more regret. However, they eventually focus on a single action and flatten out. Small

α

’s accumulate less regret, but they may keep exploring sub-optimal actions longer. Values of

α \in [0, 2]

perform comparably with UCB after around 400 steps, once reliable estimates of values have been obtained.

Figure 3 shows the average regret after a given number of time steps as a function of the divergence type

α

. As can be seen from the figure, smaller values of

α

result in lower regret. Large negative

α

’s correspond to

ε

-greedy policies, which oftentimes prematurely converge to a sub-optimal action, failing to discover the optimal action for a long time if the exploration probability

ε

is small. Large positive

α

’s correspond to

ε

-elimination policies, which may by mistake completely eliminate the best action or spend a lot of time deciding between two options in the end of learning, accumulating more regret. The optimal value of the parameter

α

depends on the time horizon for which the policy is being optimized. Depending on the horizon, the minimum of the curves shifts from slightly negative

α

’s towards the range

α \in [0, 2]

with increasing time horizon.

5.2. Empirical Evaluations on Ergodic MDPs

We evaluate our policy iteration algorithm with f-divergence on standard grid-world reinforcement learning problems from OpenAI Gym [36]. The environments that terminate or have absorbing states are restarted during data collection to ensure ergodicity. Figure 4 demonstrates the learning dynamics on different environments for various choices of the divergence function. Parameter settings and other implementation details can be found in Appendix B. In summary, one can either promote risk averse behavior by choosing

α < 0

, which may, however, result in sub-optimal exploration, or one can promote risk seeking behavior with

α > 1

, which may lead to overly aggressive elimination of options. Our experiments suggest that the optimal balance should be found in the range

α \in [0, 1]

. It should be noted that the effect of the

α

-divergence on policy iteration is not linear and not symmetric with respect to

α = 0.5

, contrary to what one could have expected given the symmetry of the

α

-divergence as a function of

α

. For example, switching from

α = - 3

to

α = - 2

may have little effect on policy iteration, whereas switching from

α = 3

to

α = 4

may have a much more pronounced influence on the learning dynamics.

6. Related Work

Apart from computational advantages, information-theoretic approaches provide a solid framework for describing and studying aspects of intelligent behavior [37], from autonomy [38] and curiosity [39] to bounded rationality [40] and game theory [41].

Entropic proximal mappings were introduced in [16] as a general framework for constructing approximation and smoothing schemes for optimization problem. Problem formulation (6) presented here can be considered as an application of this general theory to policy optimization in Markov decision processes. Following the recent work [10] that establishes links between popular in reinforcement learning KL-divergence-regularized policy iteration algorithms [6,7] and a well-known in optimization stochastic mirror descent algorithm [17,18], one can view our Algorithm 1 as an analog of the mirror descent with an f-divergence penalty.

Concurrent works [42,43] consider similar regularized formulations, although in the policy space instead of the state-action distribution space and in the infinite horizon discounted setting instead of the average-reward setting. The

α

-divergence in its entropic form, i.e., when the base measure is a uniform distribution, was used in several papers under the name Tsallis entropy [44,45,46,47], where its sparsifying effect was exploited in large discrete action spaces.

An alternative proximal reinforcement learning scheme was introduced in [48] based on the extragradient method for solving variational inequalities and leveraging operator splitting techniques. Although the idea of exploiting proximal maps and updates in the primal and dual spaces is similar to ours, regularization in [48] is applied in the value function space to smoothen generalized TD learning algorithms, whereas we study regularization in the primal space.

7. Conclusions

We presented a framework for deriving actor-critic algorithms as pairs of primal-dual optimization problems resulting from regularization of the standard expected return objective with so-called entropic penalties in the form of an f-divergence. Several examples with

α

-divergence penalties have been worked out in detail. In the limit of small policy update steps, all f-divergences with twice differentiable generator function f are approximated by the Pearson

χ^{2}

-divergence, which was shown to yield the most commonly used in reinforcement learning pair of actor-critic updates. Thus, our framework provides a sound justification for the common practice of minimizing mean squared Bellman error in the policy evaluation step and fitting policy parameters by advantage-weighted maximum likelihood in the policy improvement step.

In the future work, incorporating non-differentiable generator functions, such as the absolute value that corresponds to the total variation distance, may provide a principled explanation for the empirical success of the algorithms not accounted for by our current smooth f-divergence framework, such as the proximal policy optimization algorithm [8]. Establishing a tighter connection between online convex optimization that employs Bregman divergences and reinforcement learning will likely yield both a deeper understanding of the optimization dynamics in RL and allow for improved practical algorithms building on the firm fundament of optimization theory.

Author Contributions

Conceptualization, B.B. and J.P.; investigation, B.B. and J.P.; software, B.B.; supervision, J.P.; writing, B.B. and J.P.

Funding

This project has received funding from the European Union’s Horizon 2020 research and innovation program under grant agreement No. 640554.

Acknowledgments

We thank Hany Abdulsamad for many insightful discussions.

Conflicts of Interest

The authors declare no conflict of interest.

Appendix A

This section provides the background on the f-divergence, the

α

-divergence, and the convex conjugate function, highlighting the key properties required for our derivations.

The f-divergence [26,49,50] generalizes many similarity measures between probability distributions [32]. For two distributions

π

and q on a finite set

A

, the f-divergence is defined as

D_{f} (π ∥ q) = \sum_{a \in A} q (a) f (\frac{π (a)}{q (a)}),

where f is a convex function on

(0, \infty)

such that

f (1) = 0

. For example, the KL divergence corresponds to

f_{K L} (x) = x log x

. Please note that

π

must be absolutely continuous with respect to q to avoid division by zero, i.e.,

q (a) = 0

implies

π (a) = 0

for all

a \in A

. We additionally assume f to be continuously differentiable, which includes all cases of interest for us. The f-divergence can be generalized to unnormalized distributions. For example, the generalized KL divergence [27] corresponds to

f_{1} (x) = x log x - (x - 1)

. The derivations in this paper benefit from employing unnormalized distributions and subsequently imposing the normalization condition as a constraint.

The

α

-divergence [19,20] is a one-parameter family of f-divergences generated by the

α

-function

f_{α} (x)

with

α \in R

. The particular choice of the family of functions

f_{α}

is motivated by generalization of the natural logarithm [21]. The

α

-logarithm

{log}_{α} (x) = (x^{α - 1} - 1) / (α - 1)

is a power function for

α \neq 1

that turns into the natural logarithm for

α \to 1

. Replacing the natural logarithm in the derivative of the KL divergence

f_{1}^{'} = log x

by the

α

-logarithm and integrating

f_{α}^{'}

under the condition that

f_{α} (1) = 0

yields the

α

-function

f_{α} (x) = \frac{(x^{α} - 1) - α (x - 1)}{α (α - 1)} .

(A1)

The

α

-divergence generalizes the KL divergence, reverse KL divergence, Hellinger distance, Pearson

χ^{2}

-divergence, and Neyman (reverse Pearson)

χ^{2}

-divergence. Figure A1 displays well-known

α

-divergences as points on the parabola

y = α (α - 1)

. For every divergence, there is a reverse divergence symmetric with respect to the point

α = 0.5

, corresponding to the Hellinger distance.

Figure A1. The

α

-divergence smoothly connects several prominent divergences.

Figure A1. The

α

-divergence smoothly connects several prominent divergences.

The convex conjugate of

f (x)

is defined as

f^{*} (y) = {sup}_{x \in dom f} {〈 y, x 〉 - f (x)}

, where the angle brackets

〈 y, x 〉

denote the dot product [51]. The key property

{(f^{*})}^{'} = {(f^{'})}^{- 1}

relating the derivatives of

f^{*}

and f yields Table A1, which lists common functions

f_{α}

together with their convex conjugates and derivatives. In the general case (A1), the convex conjugate and its derivative are given by

\begin{matrix} f_{α}^{*} (y) & = \frac{1}{α} {(1 + (α - 1) y)}^{\frac{α}{α - 1}} - \frac{1}{α}, \\ {(f_{α}^{*})}^{'} (y) & = \sqrt[α - 1]{1 + (α - 1) y}, for y (1 - α) < 1 . \end{matrix}

(A2)

Function

f_{α}

is convex, non-negative, and attains minimum at

x = 1

with

f_{α} (1) = 0

. Function

{(f_{α}^{*})}^{'}

is positive on its domain with

{(f_{α}^{*})}^{'} (0) = 1

. Function

f_{α}^{*}

has the property

f_{α}^{*} (0) = 0

. The linear inequality constraint (A2) on the

dom f_{α}^{*}

follows from the requirement

dom f_{α} = (0, \infty)

. Another result from convex analysis crucial to our derivations is Fenchel’s equality

f^{*} (y) + f (x^{☆} (y)) = 〈 y, x^{☆} (y) 〉,

(A3)

where

x^{☆} (y) = {arg sup}_{x \in dom f} {〈 y, x 〉 - f (x)}

. We will occasionally put the conjugation symbol at the bottom, especially for the derivative of the conjugate function

f_{*}^{'} = {(f^{*})}^{'}

.

Table A1. Function

f_{α}

, its convex conjugate

f_{α}^{*}

, and their derivatives for some values of

α

.

Table A1. Function

f_{α}

, its convex conjugate

f_{α}^{*}

, and their derivatives for some values of

α

.

Divergence	$α$	$f (x)$	$f^{'} (x)$	${(f^{*})}^{'} (y)$	$f^{*} (y)$	$dom f^{*}$
KL	1	$x log x - (x - 1)$	$log x$	$e^{y}$	$e^{y} - 1$	$R$
Reverse KL	0	$- log x + (x - 1)$	$- \frac{1}{x} + 1$	$\frac{1}{1 - y}$	$- log (1 - y)$	$y < 1$
Pearson $χ^{2}$	2	$\frac{1}{2} {(x - 1)}^{2}$	$x - 1$	$y + 1$	$\frac{1}{2} {(y + 1)}^{2} - \frac{1}{2}$	$y > - 1$
Neyman $χ^{2}$	$- 1$	$\frac{{(x - 1)}^{2}}{2 x}$	$- \frac{1}{2 x^{2}} + \frac{1}{2}$	$\frac{1}{\sqrt{1 - 2 y}}$	$- \sqrt{1 - 2 y} + 1$	$y < \frac{1}{2}$
Hellinger	$\frac{1}{2}$	$2 {(\sqrt{x} - 1)}^{2}$	$2 - \frac{2}{\sqrt{x}}$	$\frac{4}{{(2 - y)}^{2}}$	$\frac{2 y}{2 - y}$	$y < 2$

Appendix B

In all experiments, the temperature parameter

η

is exponentially decayed

η_{i + 1} = η_{0} a^{i}

in each iteration

i = 0, 1, \dots

. The choice of

η_{0}

and a depends on the scale of the rewards and the number of samples collected per policy update. Tables for each environment list these parameters along with the number of samples per policy update, the number of policy iteration steps, and the number of runs for averaging the results. Where applicable, environment-specific settings are also listed. (see the Table A2, Table A3 and Table A4)

Table A2. Chain environment.

Parameter	Value
Number of states	8
Action success probability	0.9
Small and large rewards	(2.0, 10.0)
Number of runs	10
Number of iterations	30
Number of samples	800
Temperature parameters $(η_{0}, a)$	(15.0, 0.9)

Table A3. CliffWalking environment.

Parameter	Value
Punishment for falling from the cliff	$- 10.0$
Reward for reaching the goal	100
Number of runs	10
Number of iterations	40
Number of samples	1500
Temperature parameters $(η_{0}, a)$	(50.0, 0.9)

Table A4. FrozenLake environment.

Parameter	Value
Action success probability	0.8
Number of runs	10
Number of iterations	50
Number of samples	2000
Temperature parameters $(η_{0}, a)$	(1.0, 0.8)

References

Puterman, M.L. Markov Decision Processes: Discrete Stochastic Dynamic Programming; John Wiley & Sons: Hoboken, NJ, USA, 1994. [Google Scholar] [CrossRef]
Sutton, R.S.; Barto, A.G. Reinforcement Learning: An Introduction; MIT Press: Cambridge, MA, USA, 1998. [Google Scholar]
Deisenroth, M.P.; Neumann, G.; Peters, J. A survey on policy search for robotics. Found. Trends® Robot. 2013, 2, 1–142. [Google Scholar] [CrossRef]
Bellman, R. Dynamic Programming. Science 1957, 70, 342. [Google Scholar] [CrossRef]
Kakade, S.M. A Natural Policy Gradient. In Proceedings of the 14th International Conference on Neural Information Processing Systems: Natural and Synthetic, Vancouver, BC, Canada, 3–8 December 2001; pp. 1531–1538. [Google Scholar]
Peters, J.; Mülling, K.; Altun, Y. Relative Entropy Policy Search. In Proceedings of the 24th AAAI Conference on Artificial Intelligence, Atlanta, GA, USA, 11–15 July 2010; pp. 1607–1612. [Google Scholar]
Schulman, J.; Levine, S.; Moritz, P.; Jordan, M.; Abbeel, P. Trust Region Policy Optimization. In Proceedings of the 32nd International Conference on International Conference on Machine Learning, Lille, France, 6–11 July 2015. [Google Scholar]
Schulman, J.; Wolski, F.; Dhariwal, P.; Radford, A.; Klimov, O. Proximal policy optimization algorithms. arXiv 2017, arXiv:1707.06347. [Google Scholar]
Shimodaira, H. Improving predictive inference under covariate shift by weighting the log-likelihood function. J. Stat. Plann. Inference. 2000, 227–244. [Google Scholar] [CrossRef]
Neu, G.; Jonsson, A.; Gómez, V. A unified view of entropy-regularized Markov decision processes. arXiv 2017, arXiv:1705.07798. [Google Scholar]
Parikh, N. Proximal Algorithms. Found. Trends® Optim. 2014, 1, 127–239. [Google Scholar] [CrossRef]
Nielsen, F. An elementary introduction to information geometry. arXiv 2018, arXiv:1808.08271. [Google Scholar]
Goodfellow, I.; Pouget-Abadie, J.; Mirza, M.; Xu, B.; Warde-Farley, D.; Ozair, S.; Courville, A.; Bengio, Y. Generative Adversarial Nets. In Proceedings of the 27th International Conference on Neural Information Processing Systems, Montreal, QC, Canada, 8–13 December 2014. [Google Scholar]
Bottou, L.; Arjovsky, M.; Lopez-Paz, D.; Oquab, M. Geometrical Insights for Implicit Generative Modeling. Braverman Read. Mach. Learn. 2018, 11100, 229–268. [Google Scholar]
Nowozin, S.; Cseke, B.; Tomioka, R. f-GAN: Training Generative Neural Samplers using Variational Divergence Minimization. In Proceedings of the 30th International Conference on Neural Information Processing Systems, Barcelona, Spain, 5–10 December 2016; pp. 271–279. [Google Scholar]
Teboulle, M. Entropic Proximal Mappings with Applications to Nonlinear Programming. Math. Operations Res. 1992, 17, 670–690. [Google Scholar] [CrossRef]
Nemirovski, A.; Yudin, D. Problem complexity and method efficiency in optimization. J. Operational Res. Soc. 1984, 35, 455. [Google Scholar]
Beck, A.; Teboulle, M. Mirror descent and nonlinear projected subgradient methods for convex optimization. Operations Res. Lett. 2003, 31, 167–175. [Google Scholar] [CrossRef]
Chernoff, H. A measure of asymptotic efficiency for tests of a hypothesis based on the sum of observations. Ann. Math. Stat. 1952, 23, 493–507. [Google Scholar] [CrossRef]
Amari, S. Differential-Geometrical Methods in Statistics; Springer: New York, NY, USA, 1985. [Google Scholar] [CrossRef]
Cichocki, A.; Amari, S. Families of alpha- beta- and gamma- divergences: Flexible and robust measures of Similarities. Entropy 2010, 12, 1532–1568. [Google Scholar] [CrossRef]
Thomas, P.S.; Okal, B. A notation for Markov decision processes. arXiv 2015, arXiv:1512.09075. [Google Scholar]
Sutton, R.S.; Mcallester, D.; Singh, S.; Mansour, Y. Policy Gradient Methods for Reinforcement Learning with Function Approximation. In Proceedings of the 12th International Conference on Neural Information Processing Systems, Denver, CO, USA, 29 November–4 December 1999; pp. 1057–1063. [Google Scholar]
Peters, J.; Schaal, S. Natural Actor-Critic. Neurocomputing 2008, 71, 1180–1190. [Google Scholar] [CrossRef]
Schulman, J.; Moritz, P.; Levine, S.; Jordan, M.I.; Abbeel, P. High Dimensional Continuous Control Using Generalized Advantage Estimation. arXiv 2015, arXiv:1506.02438. [Google Scholar]
Csiszár, I. Eine informationstheoretische Ungleichung und ihre Anwendung auf den Beweis der Ergodizität von Markoffschen Ketten. Publ. Math. Inst. Hungar. Acad. Sci. 1963, 8, 85–108. [Google Scholar]
Zhu, H.; Rohwer, R. Information Geometric Measurements of Generalisation; Technical Report; Aston University: Birmingham, UK, 1995. [Google Scholar]
Williams, R.J. Simple statistical gradient-following methods for connectionist reinforcement learning. Mach. Learn. 1992, 8, 229–256. [Google Scholar] [CrossRef]
Wainwright, M.J.; Jordan, M.I. Graphical Models, Exponential Families, and Variational Inference. Found. Trends Mach. Learn. 2007, 1, 1–305. [Google Scholar] [CrossRef] [Green Version]
Baird, L. Residual Algorithms: Reinforcement Learning with Function Approximation. In Proceedings of the 12th International Conference on Machine Learning, Tahoe City, CA, USA, 9–12 July 1995; pp. 30–37. [Google Scholar]
Dann, C.; Neumann, G.; Peters, J. Policy Evaluation with Temporal Differences: A Survey and Comparison. J. Mach. Learn. Res. 2014, 15, 809–883. [Google Scholar]
Sason, I.; Verdu, S. F-divergence inequalities. IEEE Trans. Inf. Theory 2016, 62, 5973–6006. [Google Scholar] [CrossRef]
Bubeck, S.; Cesa-Bianchi, N. Regret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Problems. Found. Trends Mach. Learn. 2012, 5, 1–122. [Google Scholar] [CrossRef] [Green Version]
Auer, P.; Cesa-Bianchi, N.; Freund, Y.; Schapire, R. The Non-Stochastic Multi-Armed Bandit Problem. SIAM J. Comput. 2003, 32, 48–77. [Google Scholar] [CrossRef]
Ghavamzadeh, M.; Mannor, S.; Pineau, J.; Tamar, A. Bayesian Reinforcement Learning: A Survey. Found. Trends Mach. Learn. 2015, 8, 359–483. [Google Scholar] [CrossRef]
Brockman, G.; Cheung, V.; Pettersson, L.; Schneider, J.; Schulman, J.; Tang, J.; Zaremba, W. OpenAI Gym. arXiv 2016, arXiv:1606.01540. [Google Scholar]
Tishby, N.; Polani, D. Information theory of decisions and actions. In Perception-Action Cycle; Cutsuridis, V., Hussain, A., Taylor, J., Eds.; Springer: New York, NY, USA, 2011; pp. 601–636. [Google Scholar]
Bertschinger, N.; Olbrich, E.; Ay, N.; Jost, J. Autonomy: An information theoretic perspective. Biosystems 2008, 91, 331–345. [Google Scholar] [CrossRef] [PubMed]
Still, S.; Precup, D. An information-theoretic approach to curiosity-driven reinforcement learning. Theory Biosci. 2012, 131, 139–148. [Google Scholar] [CrossRef]
Genewein, T.; Leibfried, F.; Grau-Moya, J.; Braun, D.A. Bounded rationality, abstraction, and hierarchical decision-making: An information-theoretic optimality principle. Front. Rob. AI 2015, 2, 27. [Google Scholar] [CrossRef]
Wolpert, D.H. Information theory—The bridge connecting bounded rational game theory and statistical physics. In Complex Engineered Systems; Braha, D., Minai, A., Bar-Yam, Y., Eds.; Springer: Berlin, Germany, 2006; pp. 262–290. [Google Scholar]
Geist, M.; Scherrer, B.; Pietquin, O. A Theory of Regularized Markov Decision Processes. arXiv 2019, arXiv:1901.11275. [Google Scholar]
Li, X.; Yang, W.; Zhang, Z. A Unified Framework for Regularized Reinforcement Learning. arXiv 2019, arXiv:1903.00725. [Google Scholar]
Nachum, O.; Chow, Y.; Ghavamzadeh, M. Path consistency learning in Tsallis entropy regularized MDPs. arXiv 2018, arXiv:1802.03501. [Google Scholar]
Lee, K.; Kim, S.; Lim, S.; Choi, S.; Oh, S. Tsallis Reinforcement Learning: A Unified Framework for Maximum Entropy Reinforcement Learning. arXiv 2019, arXiv:1902.00137. [Google Scholar]
Lee, K.; Choi, S.; Oh, S. Sparse Markov decision processes with causal sparse Tsallis entropy regularization for reinforcement learning. IEEE Rob. Autom. Lett. 2018, 3, 1466–1473. [Google Scholar] [CrossRef]
Lee, K.; Choi, S.; Oh, S. Maximum Causal Tsallis Entropy Imitation Learning. In Proceedings of the 32nd International Conference on Neural Information Processing Systems, Montreal, QC, Canada, 3–8 December 2018; pp. 4408–4418. [Google Scholar]
Mahadevan, S.; Liu, B.; Thomas, P.; Dabney, W.; Giguere, S.; Jacek, N.; Gemp, I.; Liu, J. Proximal reinforcement learning: A new theory of sequential decision making in primal-dual spaces. arXiv 2014, arXiv:1405.6757. [Google Scholar]
Morimoto, T. Markov processes and the H-theorem. J. Phys. Soc. Jpn. 1963, 18, 328–331. [Google Scholar] [CrossRef]
Ali, S.M.; Silvey, S.D. A General Class of Coefficients of Divergence of One Distribution from Another. J. R. Stat. Soc. Ser. B (Methodol.) 1966, 28, 131–142. [Google Scholar] [CrossRef]
Boyd, S.; Vandenberghe, L. Convex Optimization; Cambridge University Press: Cambridge, UK, 2004; 487p. [Google Scholar] [CrossRef]

Figure 1. Effects of

α

on policy improvement. Each row corresponds to a fixed

α

. First four iterations of policy improvement together with a later iteration are shown in each row. Large positive

α

’s eliminate bad actions one by one, keeping the exploration level equal among the rest. Small

α

’s weigh actions according to their values; actions with low value get zero probability for

α > 1

, but remain possible with small probability for

α \leq 1

. Large negative

α

’s focus on the best action, exploring the remaining actions with equal probability.

Figure 1. Effects of

α

on policy improvement. Each row corresponds to a fixed

α

. First four iterations of policy improvement together with a later iteration are shown in each row. Large positive

α

’s eliminate bad actions one by one, keeping the exploration level equal among the rest. Small

α

’s weigh actions according to their values; actions with low value get zero probability for

α > 1

, but remain possible with small probability for

α \leq 1

. Large negative

α

’s focus on the best action, exploring the remaining actions with equal probability.

Figure 2. Average regret for various values of

α

.

Figure 2. Average regret for various values of

α

.

Figure 3. Regret after a fixed time as a function of

α

.

Figure 3. Regret after a fixed time as a function of

α

.

Figure 4. Effects of

α

-divergence on policy iteration. Each row corresponds to a given environment. Results for different values of

α

are split into three subplots within each row, from the more extreme

α

’s on the left to the more refined values on the right. In all cases, more negative values

α < 0

initially show faster improvement because they immediately jump to the mode and keep the exploration level low; however, after a certain number of iterations they get overtaken by moderate values

α \in [0, 1]

that weigh advantage estimates more evenly. Positive

α > 1

demonstrate high variance in the learning dynamics because they clamp the probability of good actions to zero if the advantage estimates are overly pessimistic, never being able to recover from such a mistake. Large positive

α

’s may even fail to reach the optimum altogether, as exemplified by

α = 10

in the plots. The most stable and reliable

α

-divergences lie between the reverse KL (

α = 0

) and the KL (

α = 1

), with the Hellinger distance (

α = 0.5

) outperforming both on the FrozenLake environment.

Figure 4. Effects of

α

-divergence on policy iteration. Each row corresponds to a given environment. Results for different values of

α

are split into three subplots within each row, from the more extreme

α

’s on the left to the more refined values on the right. In all cases, more negative values

α < 0

initially show faster improvement because they immediately jump to the mode and keep the exploration level low; however, after a certain number of iterations they get overtaken by moderate values

α \in [0, 1]

that weigh advantage estimates more evenly. Positive

α > 1

demonstrate high variance in the learning dynamics because they clamp the probability of good actions to zero if the advantage estimates are overly pessimistic, never being able to recover from such a mistake. Large positive

α

’s may even fail to reach the optimum altogether, as exemplified by

α = 10

in the plots. The most stable and reliable

α

-divergences lie between the reverse KL (

α = 0

) and the KL (

α = 1

), with the Hellinger distance (

α = 0.5

) outperforming both on the FrozenLake environment.

Table 1. Empirical policy evaluation and policy improvement objectives for

α \in {1, 2}

.

Table 1. Empirical policy evaluation and policy improvement objectives for

α \in {1, 2}

.

KL Divergence ( $α = 1$ )	Pearson $χ^{2}$ -Divergence ( $α = 2$ )
${\hat{g}}_{1} (w) = η log ({\hat{E}}_{t} [exp (\frac{{\hat{A}}^{w} (s_{t}, a_{t})}{η})])$	${\hat{g}}_{2} (w) = \frac{1}{2 η} {\hat{E}}_{t} [{({\hat{A}}^{w} (s_{t}, a_{t}) - {\hat{E}}_{t} [{\hat{A}}^{w}])}^{2}]$
${\hat{L}}_{1} (θ) = {\hat{E}}_{t} [log π_{θ} (a_{t} \| s_{t}) exp (\frac{{\hat{A}}^{w} (s_{t}, a_{t}) - {\hat{g}}_{1} (w)}{η})]$	${\hat{L}}_{2} (θ) = \frac{1}{η} {\hat{E}}_{t} [log π_{θ} (a_{t} \| s_{t}) ({\hat{A}}^{w} (s_{t}, a_{t}) - {\hat{E}}_{t} [{\hat{A}}^{w}] + η)]$

© 2019 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Belousov, B.; Peters, J. Entropic Regularization of Markov Decision Processes. Entropy 2019, 21, 674. https://doi.org/10.3390/e21070674

AMA Style

Belousov B, Peters J. Entropic Regularization of Markov Decision Processes. Entropy. 2019; 21(7):674. https://doi.org/10.3390/e21070674

Chicago/Turabian Style

Belousov, Boris, and Jan Peters. 2019. "Entropic Regularization of Markov Decision Processes" Entropy 21, no. 7: 674. https://doi.org/10.3390/e21070674

APA Style

Belousov, B., & Peters, J. (2019). Entropic Regularization of Markov Decision Processes. Entropy, 21(7), 674. https://doi.org/10.3390/e21070674

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Entropic Regularization of Markov Decision Processes

Abstract

1. Introduction

2. Background

2.1. Policy Gradient Methods

2.2. Entropic Penalties

3. Entropic Proximal Policy Optimization

3.1. Fighting Covariate Shift via Trust Regions

3.2. Policy Optimization with Entropic Penalties

3.3. Value Function Approximation

3.4. Sample-Based Algorithm for Dual Optimization

3.5. Parametric Policy Fitting

3.6. Temperature Scheduling

3.7. Practical Algorithm for Continuous State-Action Spaces

4. High- and Low-Temperature Limits; α -Divergences; Analytic Solutions and Asymptotics

4.1. KL Divergence ( α = 1 ) and Pearson χ 2 -Divergence ( α = 2 )

Mean Squared Error Minimization with Advantage Reweighting is Equivalent to Pearson Penalty

4.2. High- and Low-Temperature Limits

4.2.1. High Temperatures: All Smooth f-Divergences Tend Towards Pearson χ 2 -Divergence

4.2.2. Low Temperatures: Linear Programming Formulation Emerges in the Limit

5. Empirical Evaluations

5.1. Illustrative Experiments on Stochastic Multi-Armed Bandit Problems

5.1.1. Effects of α on Policy Improvement

5.1.2. Effects of α on Regret

5.2. Empirical Evaluations on Ergodic MDPs

6. Related Work

7. Conclusions

Author Contributions

Funding

Acknowledgments

Conflicts of Interest

Appendix A

Appendix B

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

4. High- and Low-Temperature Limits; $α$ -Divergences; Analytic Solutions and Asymptotics

4.1. KL Divergence ( $α = 1$ ) and Pearson $χ^{2}$ -Divergence ( $α = 2$ )

4.2.1. High Temperatures: All Smooth f-Divergences Tend Towards Pearson $χ^{2}$ -Divergence

5.1.1. Effects of $α$ on Policy Improvement

5.1.2. Effects of $α$ on Regret