On Entropy Regularized Path Integral Control for Trajectory Optimization

In this article, we present a generalized view on Path Integral Control (PIC) methods. PIC refers to a particular class of policy search methods that are closely tied to the setting of Linearly Solvable Optimal Control (LSOC), a restricted subclass of nonlinear Stochastic Optimal Control (SOC) problems. This class is unique in the sense that it can be solved explicitly yielding a formal optimal state trajectory distribution. In this contribution, we first review the PIC theory and discuss related algorithms tailored to policy search in general. We are able to identify a generic design strategy that relies on the existence of an optimal state trajectory distribution and finds a parametric policy by minimizing the cross-entropy between the optimal and a state trajectory distribution parametrized by a parametric stochastic policy. Inspired by this observation, we then aim to formulate a SOC problem that shares traits with the LSOC setting yet that covers a less restrictive class of problem formulations. We refer to this SOC problem as Entropy Regularized Trajectory Optimization. The problem is closely related to the Entropy Regularized Stochastic Optimal Control setting which is often addressed lately by the Reinforcement Learning (RL) community. We analyze the theoretical convergence behavior of the theoretical state trajectory distribution sequence and draw connections with stochastic search methods tailored to classic optimization problems. Finally we derive explicit updates and compare the implied Entropy Regularized PIC with earlier work in the context of both PIC and RL for derivative-free trajectory optimization.


Introduction
Finding controllers for systems that are high dimensional and continuous in space is still one of the most challenging problems faced by the robotic and control community. The goal is to find a feedback policy that stabilizes the system and that possibly also encodes rich and complex dynamic behavior such as locomotion [1]. A feedback policy is a state-and time-dependent function u(t, x) with u representing the input applied to the system that is in a state x at time t. A well-known paradigm to design such policies is Stochastic Optimal Control (SOC) [2]. The policy is determined such that when applied to the system, it is expected to accumulate a minimized cost over a specified finite time horizon. Finding such an optimal feedback policy is an elegant and appealing theoretical idea but meets significant difficulties in practice. Explicit expressions for the optimal policy u * (t, x) rarely exist and one often has to resort to local solutions instead. Such local solutions can be found by so called trajectory optimization algorithms (In this work we focus on discrete-time system and model-based strategies, in the sense that a simulator is available that closely approximates the actual dynamics) [3,4]. These yield open-loop solutions of the form u * (t) = u * (t, x * (t)) where the signals u * (t) and x * (t) are optimal in the sense that they minimize the accumulated cost starting from a given fixed initial state.
In hierarchical control approaches such trajectory optimizers are used to discover rich and complex dynamic behaviors "offline" [5]. The system is then stabilized "online" using a linearized feedback-controller around the optimal trajectory. The locally linear feedback-controller is usually designed such that u * fb (t, x) = k(t) + K(t)x(t) = u * (t) + K(t)(x(t) − x * (t)) where K(t) ≈ ∇ x u(t, x * (t)) and k(t) = u * (t) + K(t)x * (t). The gradient K(t) is approximated by defining a Linear Quadratic Regulator (LQR) problem around the optimal trajectory using a second order Taylor expansion. In fact, this LQR problem then also provides a correction to {x (t), u (t)} were this linearization point may not be optimal. As such, a new trajectory {u (t), x (t)} is obtained around which a new LQR can be defined. Iterating this approach will incrementally improve the trajectory. This is exactly how gradient based trajectory optimization algorithms work such as Differential Dynamic Programming (DDP) [3,6] and iterative LQR (iLQR) [7]. It is implied that gradient based trajectory optimizers require a differentiable model and cost function. Furthermore, they are notoriously ill-suited to handle state constraints. A second application is that of Model Predictive Control (MPC). The idea is to compute an optimal trajectory initialized with the current state measurement and apply the optimal open-loop controller every sample instant. This approach provides a "just-in-time" service and circumvents computation of the explicit policy altogether. MPC is feasible provided that the trajectory optimization problem can be solved in a single sample period [8]. This poses hard real-time requirements on the trajectory optimizer that are rarely met in practice, specifically for nonlinear dynamics with state and input constraints.

Stochastic Search Methods
In this article, we are interested in solving the trajectory optimization problem above using a sample based optimization method, or a so-called stochastic search method. Stochastic search methods rely on randomness to probe the optimization space and maintain mechanisms that eventually guide that randomness towards prosperity. Broadly speaking, a stochastic search method maintains a prior population and generates a posterior population based on the prosperity of the individuals.
Evolutionary Strategies (ESs) refer to a particular subclass of stochastic search algorithms tailored to static optimization, that, as opposed to population based algorithms [9], engage a parametric (θ) search distribution model, π(x|θ) : X → R ≥0 . Every main generation, g, a sample population, X g = {x k g }, is generated from a search distribution π(x|θ g ) and the search distribution parameters, θ g+1 ← θ g , are updated based on the relative success of the individual samples. The objective function value f (x k ) is used as a discriminator between prosperous and poor behavior of each individual, x k [10]. When iterated this concept spawns a sequence of distributions, {π g }. The update procedure is devised so that the distribution sequence migrates gradually through the optimization space and concentrates around the optimal solution eventually. Although that limiting the sequence to a parametric distribution family may compromise the inherent expressiveness or elaborateness of the associated search, it also elevates the determination of update rules, from what are basically heuristics, to a more rigorous and theoretical body [11,12]. Well-known members are the Covariance Matrix Adaptation Evolutionary Strategy (CMA-ES) [11,13] and the Natural Evolutionary Strategy (NES) [14]. Most ESs consider a multivariate Gaussian parametric distribution and provide appropriate update procedures for the mean, µ g , and covariance, Σ g . Examples of application on medium to high scale problems for non-differential design optimization are [15][16][17][18][19][20][21]. π g (x) = N (θ g ) = N (x|µ g , Σ g ) In the present work, we aim to address trajectory optimization problems, exploiting stochasticity as a natural means of exploration, in a similar fashion as how ESs address static optimization problem. The trajectory optimization problem is however fundamentally different in the sense that we can no longer probe the optimization space directly. We can only do so by applying stochastic policies to the system, then observe how the system evolves and infer an updated search distribution from these system rollouts.
Specifically, we are interested in locally linear Gaussian feedback policies of the following form, that we can apply to the system to artificiality inject the required stochasticity. Clearly this policy resembles the locally linear feedback trajectory as was describe in the previous paragraph and the idea can in that sense be understood as a sample based implementation of the iLQR or DDP algorithm. π g,t (u|x) = N (u|x; θ g,t ) = N (u|k g,t + K g,t x, Σ g,t ) The covariance determines the explorativeness of the policy and therefore the diverseness of the samples. Our goal is to actively shape it to stimulate exploration.

Path Integral Control
In the previous decade, a novel class of stochastic search algorithms was discovered that partially answer our question. This class is known as Path Integral Control (PIC). PIC is closely related to the Linearly Solvable Optimal Control (LSOC) prolem. LSOC is a restrictive subset of the Stochastic Optimal Control (SOC) framework (see Section 2) and is characterized by a number of remarkable properties (see Section 3.1) [22][23][24]. It was already pointed out that PIC and ES exhibit structural similarities [25,26]. This lead to the development of a PIC-CMA algorithm. This method deviates from the theory of LSOC and adapts the policy covariance in analogy with CMA-ES. This modification improved the convergence properties significantly yet ignores the underlying theory of LSOC. Other attempted generalizations include [27,28].
There has been keen interest in such algorithms as stochastic search algorithms may exhibit several advantages over gradient based algorithms [25,26,28]. So far it has been used in guided policy search to generate a set of prior optimal trajectories that were then used to fit a global policy [29] and is one of the two algorithms promoted by the Lyceum robot learning environment [30]. The use of PIC algorithms for real-time control applications was only recently considered as their execution is similar to Monte Carlo (MC) algorithms and were therefore thought to be too time critical to perform in real-time. However, with the rise of affordable GPUs and the ease of parallelization of MC based methods, it may become feasible in the near future to iterate dynamic stochastic search algorithms in real-time [31,32]. Nevertheless, practitioners of such methods have raised issues concerning the update of the covariance matrix [25,29,31,32]. It seems a mechanism is inherent to the existing framework that makes the covariance matrix vanish, compromising exploration. In other words, the search distribution collapses prematurely. Many authors suggested that the issue is limitedly understood and that it seems unlikely that it can be resolved with the theory at hand. Furthermore, it is clear that ESs and PIC methods are closely related and that therefore PIC method may benefit from the rich body of work concerning ESs. However, as PIC are derived solely from the theory of LSOC, we argue that they are only limitedly understood and their similarity has been circumstantial. Recently, Williams et al. provide a novel derivation of the PI 2 method from an information-theoretic background [32]. Other authors have explored the relation between LSOC and information-theory [32][33][34], however these studies aimed for a physical connection and understanding.
In this paper, we venture on a different strategy and aim to identify a generalized set of PIC methods. We aim to describe an overarching optimization principle that leads us to derive to ESs in the context of static optimization and PIC methods in the context of dynamic optimization. Therefore, we approach the problem from an algorithmic point of view, rather than searching for a deeper physical interpretation or understanding. To establish the overarching framework, we identify the principle of entropic inference as a suitable setting to synthesize stochastic search algorithms and derive an entropic optimization framework from it. This will also allow us to derive a generalized set of PIC methods which are no longer limited to the LSOC setting and therefore do not inherit any of its inherent limitations. Furthermore, the mutual theoretical background paves way for a knowledge transfer from ES to PIC. Finally, this viewpoint provides us with a unique opportunity to relate PIC to existing Entropy Regularization paradigms in Reinforcement Learning [35][36][37][38].

Contributions
As our efforts span a rather extensive body of earlier work and aim to provide a holistic view of related subjects, we feel inclined to list our original contributions.
• We provide a comprehensive overview of existing Path Integral Control tailored to policy search.
Here we point out a mutual underlying design principle which allows for a formal comparison and classification. • We propose an original and intuitive argument for the introduction of entropy regularization terms in the context of optimization that is based on the principle of entropic inference. • We untie the derivation of Path Integral Control methods from its historical roots in Linearly Solvable Optimal Control and illustrate that a similar set of algorithms can be derived directly from the more timely framework of Entropy Regularized Optimal Control. Therefore, we introduce the framework of Entropy Regularized Trajectory Optimization and derive the Entropy Regularized Path Integral Control (ERPIC) method. We consider this to be our primary contribution. Furthermore, this work elevates the structural similarity between Evolutionary Strategies, such as the CMA-ES [13] and NES [14], and PIC methods originally pointed out and exploited in [25], to a formal equivalence. • We give a formal comparison of preceding PIC methods and ERPIC tailored to derivative-free trajectory optimization with control affine dynamics and locally linear Gaussian policies.

Preliminaries and Notation
Let us first introduce a number of relevant concepts and establish notation.

General Notation
We denote the set of continuous probability density functions over any closed set X ⊆ R n as P (X ) = {p : X → R ≥0 | X π(x)dx = 1}. We will not always specify the argument as it is implied by the distribution definition. The statement π ∝ p with π ∈ P denotes that π is proportional to p up to a normalization constant. The expectation of f over a probability density function π is denoted as We will also need a number of information theoretic measures.
The differential entropy is defined as This is widely interpreted as a metric of uncertainty. However, more precisely it is to say that it is a measure for the amount of information left to be specified about some epistemological uncertain variable X for which we have represented our belief with a probability density function π [39].
The Kullback-Leibler divergence or relative entropy between two probability density functions π and ρ is defined as The relative entropy is always positive but not symmetric D [π ρ] = D [ρ π]. An interpretation is given in Section 4.

Dynamic System Models
Further, we will consider controlled stationary deterministic or stochastic discrete-time systems.
We use x ∈ X to represent the system state and u ∈ U to denote the input or control effort. Deterministic systems are modeled with a state-space function x t+1 = f (x t , u t ), f : X × U → X . Stochastic systems are modeled by a state transition distribution function x t+1 ∼ p(x t+1 |x t , u t ), p : X × X × U → R ≥0 . We also define two classes of non-stationary feedback policies. Deterministic policies are denoted as u t = u(t, x t ), u : Z × X → U , and stochastic policies are denoted as u t ∼ π(u t |t, x t ), π : Z × U × X → R ≥0 . Note that any deterministic dynamic system can also be represented as a stochastic system using the Dirac delta, p(x |x, u) = δ(x − f (x, u)). Analogously, any deterministic policy can also be represented as a stochastic policy, π(u t |t, x t ) = δ(u − u(t, x t )).
A realization of a system with policy u or π over a horizon T results into a trajectory In any case, we will assume that trajectories τ ∈ T agree with the underlying system dynamics and is induced by a stochastic (or deterministic) policy. Considering that we work within a stochastic setting, we can associate a probability to each trajectory. The trajectory distribution or path distribution induced by a policy π is denoted as p(τ|π). The state trajectory distribution induced by a policy π is denoted as p π (τ x ). These distributions can be factorized in a product of transition probabilities.
Here, we defined the controlled state transition distribution. Note that the control effort is now entirely explicit.
This definition also implies two state trajectory distributions of particular interest. The distribution p u induced by a deterministic policy u and the uncontrolled or free trajectory distribution which we denote as p 0 .
We can associate a cost to a trajectory τ ∈ T . In a general setting, we consider the cost R : T → R that accumulates at a rate r : S × A → R.
Analogously, we can associate a cost C : T X → R to state trajectories τ x ∈ T X . Here, the control efforts is accounted for implicitly through the accumulation rate c : X × X → R that penalizes transitions x → x (Note that we will often change between the use of time subscripts and the prime notation to indicate an increment in the time dimension) If there exists an inverse dynamic function f −1 so that u = f −1 (x, x ) when x = f (x, u), C and R are equivalent in the sense that r t (x, f −1 (x, x )) = c t (x, x ) and r T (x) = c T (x). We emphasize here that the existence of such an inverse model is a mere formal assumption. In the practical setting to be presented, we will never actually have to invert the system dynamics given that we will have access to u.
Finally, we are also interested in the exotic cost function L : T X → R. The total cost can be decomposed as the superposition of a costL that accumulates with a solely state dependent rate l : X → R and a term that accounts for the control effort. Specifically we are interested in the situation where the control efforts are penalized indirectly by introducing the logarithm of the ratio between the controlled and free state transition probabilities. This specific choice will have a remarkable technical consequence later on. Further, note that L is therefore of the same type as C and operates on T X . It follows that if the transition x → x is likelier when the system is controlled, i.e., p u (x |x) > p(x |x) the ratio is larger than 1 and so its logarithm is positive, effectively inducing a cost. Alternatively, when the transition becomes less probable, this induces a negative penalty, again inducing a cost. As a result it will depend onL whether the use of p u over p 0 is beneficial or not.

Stochastic Optimal Control
We are interested in finding policies that minimize the expected induced cost. Hence we address the Stochastic Optimal Control (SOC) problem. Such policies are known to be stabilizing and are capable of encoding very rich and complex dynamic behavior even for considerably simple cost formulations.
Solving (1) poses a so-called dynamic optimization problem. The prefix dynamic emphasizes that the optimization variables are constrained by a causal structure which allows to break the problem apart into several subproblems that can be solved recursively. This problem property is also known as optimal substructure which can be exploited by dedicated solution algorithms.
Correspondingly, the problem above can also be represented with the recursive Bellman equation. Here, V t : X t → R represents the value function or optimal cost-to-go, i.e., the cost that is accumulated if we initialize the system in state x at time t and control it optimally until the horizon T.
We emphasize that we presume the policy to be stochastic. However, unless there is the incentive to maintain a stochastic policy for the purpose of exploration, the expression minimizes for a deterministic policy so that the expectation over the policy can be omitted [40]. This redundant form is however appealing for later comparison. Further, note that if also p(x |x, u) = δ(x − f (x, u)), this problem reduces to deterministic optimal control.
Throughout this manuscript we will further assume that the initial state is fixed and known exactly, i.e., x(0) = x 0 . We could stress this formally with our notation and exemplify that any function associated to problem (2), be it u(t, x) π(u|t, x) or V(t, x) or any of the implied functions p π (x |t, x) and p π (τ x ) are conditional on x 0 . In order not to overload the notational burden we simply assume this to be true throughout the paper. This assumption will allow us to concentrate on local solutions or so called trajectory optimizers.

Local Parametric Policies
The algorithms we present here do not aim to retrieve an exact solution to the problem above, but instead operate on a restricted class of parametrized policies π(u|t, x, θ) where θ refers to the policy parameter. Considering that we are interested in trajectory optimizers, we restrict our focus to locally linear Gaussian policies of the form π (u|t, x, θ) Here K t represented a feedback gain matrix and Σ t the covariance matrix which controls the stochasticity and therewith the explorative incentive of the policy. This parametrization allows to approximate the exact solutions in the neighborhood of an optimal open-loop trajectory initialized at for example x(0) = x 0 .

Path Integral Control
In this section we give a brief introduction to the theory of LSOC and give an overview of existing PIC methods rooting directly from it. In almost all preceding contributions (apart from those following [22]), LSOC is addressed in a continuous time setting and a discretization is only performed afterwards in order to derive practical methods. Here we will directly address the LSOC problem in a discrete time setting as it better suits our requirements and as such also avoid the tedious technicalities involved with the discretization.
Second, we give a comprehensive overview of the class of PIC methods. As far as we are aware of, this is the first time such a formal comparison is made, revealing an overarching design principle shared by all algorithmic formulations. The identification of this design principle makes it easier to pinpoint the core prerequisite of any PIC method and will allow us to introduce a generalized set of PIC methods in Section 4.3.

Discrete Time Linearly Solvable Optimal Control
Formally, the discrete time LSOC framework refers to the following SOC problem, We emphasize that the control effort is penalized implicitly through the Kullback-Leibler divergence between the controlled and free state transition probabilities. For a formal motivation for this particular penalty formulation, we refer the reader to the work in [22,33] and recall the intuitive justification that was already given in Section 2. Here, we are mostly interested in the technical implications that are associated with this problem formulation.
Problem (3) implies the existence of an optimal policy u * In general the problem can not be solved exactly for u * LSOC (t, x). However, the following theorem establishes a relation between the optimal policy u * LSOC (t, x), the optimal state transition distribution p u * LSOC (x |t, x) and the optimal state trajectory distribution p u * LSOC (τ x ), and summarizes the most profound results in the context of LSOC. (3), the following problems are equivalent, The latter can be solved explicitly where V(t, x) is governed by the recursion and where u * LSOC (t, x) and p u * LSOC (x |t, x) are related as The proof relies on the calculus of variations. These peculiar results follow directly from the fact that the problem does no longer depend on u explicitly. For details we refer the reader to earlier references.
Most remarkable is that the equivalence of problems (4) and (5) implies that we can lift the optimization problem from the control space to the state transition distribution space. Second, although we can not solve to u * LSOC (t, x) directly, we can solve for the induced optimal state transition distribution, p u * LSOC (x |t, x). Note that therefore we should still identify V(t, x). The equivalence of (5) and (6) follows from the recursive definition of V(t, x) that can be evaluated exactly, and thus implies the existence of an explicit optimal state trajectory distribution, p u * LSOC (τ x ) also. This is a unique trait of the LSOC setting and lies at the very root of every PIC method. The reason why there exists an explicit solution is a direct consequence from the control effort penalization which makes it possible to lift the optimization problem. As an unfortunate side effect, the system uncertainty and the control penalization are now somehow entangled. This we identify as a fundamental limitation of the LSOC setting and related methods.
A final remark can be made with respect to control affine system dynamics, x = a(x) + B(x)u, disturbed by unbiased Gaussian noise ξ ∼ N (ξ|0, Σ) in the control subspace. In this case, it is possible to derive an explicit expression for the optimal control policy u * LSOC (t, x). The controlled transition distribution is then given by ) and the control policy u * LSOC (t, x) can be extracted by comparing the expected value of x based on either distribution This is the discrete time version of the property given in for example [41] Theorem 1, and clearly illustrates the path integral terminology. Moreover, it follows that the optimal policy can be estimated with Monte Carlo sampling from the free system dynamics making it appealing to use directly in an MPC setting [31,32,42]. However, as noted in the introduction we are mostly interested in its applications as a trajectory optimization method.
In the following section, we detail a number of policy search methods that effectively exploit this peculiar setting. All of them would rely on the stochastic system dynamics to explore the solution space.

Path Integral Control (PIC) Methods
The unique traits of the LSOC framework have been exploited to derive a class of so called PIC methods. The interested reader is referred to earlier references [26][27][28]30,32,41,[43][44][45][46][47]]. An overview of applications was already given in the introduction. Originally, the optimal policy was estimated directly from (7) using Monte Carlo samples generated with the free system, i.e., p 0 . Other methods evolved beyond that idea but the basic principle remained the same. In any case the goal is to find the optimal policy u * LSOC (t, x) relying on the inherent stochasticity of the system. To be able to give a concise overview, we distilled a generic design principle that allows to derive different PIC methods. Note that the authors have identified this principle and that the derivations included in the references may not explicitly state this concept.
The principle is based on two distributions. We assume the existence of a formal but explicit goal state trajectory distribution, say p u , and, use a parametrized path distribution, p u(θ) , induced by some parametric policy, u(θ).
It is then possible to infer the optimal policy by projecting the parametrized distribution p u(θ) onto the optimal distribution p u , and, determine the corresponding optimal policy parameter by minimizing the projection distance. Based on the cross-entropy argument originally made by [41], the Kullback-Leibler divergence is used as a projection operator.
The following optimization problem is addressed.
Provided that we can sample from the distribution p u , this expectation can be approximated using a Monte Carlo estimator. Assuming that we can evaluate but not sample from p u , but dispose of a distribution p u g that we can both evaluate and sample from, another estimate can be obtained by the idea of importance sampling. Such a distribution is easily constructed by applying some policy u g to the system. Particularly interesting is the setting where u g is somehow informed about the desired distribution p u . We denote this objective with subscript g to emphasize the use of the guiding policy u g .
We introduce the Monte Carlo estimatorĴ g (θ|p u ) for the formal objective defined above. As a guiding policy we substitute u(θ g ) for u g This framework admits to define an iterative strategy where θ g+1 is found by optimizing the objectiveĴ g (θ|p u ). As the existence of an explicit path distribution was so far a unique trait of the LSOC setting, the derivation of PIC methods has always been tied to the framework.
Depending on the optimal distribution substituted for p u , the parametrization of the policy u(θ) and the strategy used to solve the optimization problem, different PIC methods are obtained that can be useful to tackle different problems. As already stated in the introduction we are interested in its application in the context of trajectory optimization, i.e., finding local optimal control solutions for a fixed given initial state. Therefore, we will be interested in local policy parametrizations.
Regarding the optimization strategy we can make a distinction between two approaches, denoted as exact methods and gradient descent methods.

Exact Methods
For a particular subset of policy representations in combination with a control affine system model, the objective in (8) can be maximized exactly. If these conditions are met, the exact solution θ * can be substituted for the next guiding policy parameter θ g+1 . We refer to these approaches as exact methods. Such are particularly interesting to find local policies. We refer to Section 5 for details.
Regarding the choice for the goal distribution p u we can make an additional distinction between the Sample Efficient Path Integral Control (SEPIC) and Path Integral Relative Entropy Search (PIREPS) method.

Sample Efficient Path Integral Control Method
Most convenient is to substitute the desired optimal path distribution p u * LSOC for p u p u = p u * LSOC This strategy boils down to the one proposed by Williams et al. [30,32,44] and partially with the one proposed in [48] (In [48], the particular estimation of the system dynamics allows for a more general solution in terms of Equation (7). However this is out of scope here as we are interested in local policy parametrizations). The terminology stems from the fact that the sampling is done using the informed distribution p u g as a guiding post. Furthermore, if the parametric policy is sufficiently expressive to cover the optimal policy u * LSOC , this methods converges to the actual solution of the LSOC framework. The approximate objective is evaluated aŝ SEPIC aims to solve for p u * LSOC directly. The problem is that if the original guiding policy is too far removed from the optimal policy, the method will not converge. Because the interesting regions in the solution space are simply not sampled and the optimal policy is never discovered.

Path Integral Relative Entropy Policy Search
To remedy this issue [47] proposed PIREPS. Retrospectively, this is an information-theoretic trust-region strategy which introduces a regularization term penalizing the Kullback-Leibler divergence between the new and old path distribution to promote more conservative policy updates. In fact, this is a form of entropy regularization which we will come back to in Section 4. This idea was introduced independently from [47] in [29] to generate local policies in the context of guided policy search.
The approach generates a sequence of intermediate pseudo-optimal path probabilities that aim to improve the convergence properties. This problem can be solved explicitly and the intermediate probabilities are substituted for p u consequently. The idea is that the distribution p u g+1 will be closer to the solution space sampled by p u(θ g ) and that as a direct result the policy updates will be more robust. Note that for g → ∞ the sequence of pseudo-optimal path probabilities collapses on p u * LSOC .
The approximate objective is evaluated aŝ

Gradient Ascent Methods
Alternatively one can try maximize the objective estimatorĴ(θ|p u ) using a gradient ascent method. Such a strategy is suited for general policy parametrizations that aim to find a global approximation of u * LSOC (t, x).
Again regarding the choice of the goal distribution p u either the Path Integral Cross Entropy (PICE) or the Adaptive Smoothing Path Integral Control (ASPIC) method are obtained.
Path Integral Cross Entropy Method PICE can be considered the gradient ascent version of SEPIC and was proposed in [41].
ASPIC can be considered the gradient ascent version of PIREPS and was proposed more recently in [49].

Other Noteworthy PIC Methods
An important subset of PIC methods, or methods that are associated to the framework of LSOC at least, are Path Integral Policy Improvement algorithms or PI 2 . These methods are hard to classify as they are somewhere in between Evolutionary Search methods and policy optimization methods and rely on a heuristic temporal averaging strategy to resolve the conflicting policy parameter update schemes. The most important members of this class are PI 2 [45], PI 2 with Covariance Matrix Adaptation [25] and PI 2 with Population Adaptation [50]. The authors of PI 2 -CMA, were the first to pursue the structural equivalence between Evolutionary Strategies and PIC methods. Based on this equivalence they proposed to adapt the policy covariance in correspondence to the CMA-ES algorithm, improving the convergence but destroying the underlying assumption of LSOC. This idea was successfully repeated in the context of trajectory optimization by [29].

Other Remarks
An issue of widespread concern in RL is how to shape a deliberately stochastic policy to obtain an explorative incentive without compromising safety measures or becoming risk seeking by unfortunate coincidence. PIC-based methods were one of the first strategies that somewhat address this issue with the implied connection between the uncertainty and cost. Especially, in case of deterministic systems where a deliberate stochastic policy could be introduced. Unfortunately the underlying theory of LSOC enforces an inverse proportionality between the control noise magnitude and the control penalty and moreover requires the control to be penalized quadratically (see later). Although justifiable from a control engineering perspective, this also poses severe practical limitations. Nowadays, entropy regularization seems to be a fruitful resolution to address the issue of stochastic policy shaping in a less restrictive or predetermined fashion. As for now it is not really clear how PIC methods relate to the framework of entropy regularized RL and whether they can benefit from recent advances made by this community. We shall address this question in the following section.

Entropy Regularized Path Integral Control
The main purpose of this section is to identify a novel SOC problem that can be solved explicitly yielding a formal optimal state trajectory distribution. The problem formulation is in that sense similar to the LSOC setting, yet it addresses a less restricted set of optimal control problems. Based on the general design principle for PIC methods discussed in Section 3.2, this optimal distribution can be substituted for the goal distribution p u hinting at a generalized set of PIC methods which will be investigate structurally for the purpose of model-free trajectory optimization in Section 5.
Our problem formulation relies heavily on the concept of entropy regularization. Entropy regularization is a setting of widespread use in the context of RL nowadays, yet it is less studied in the context of (static) stochastic optimization or stochastic search, at least in the way that we will treat it.
As we will show our SOC framework shares properties with static stochastic optimization and therefore the results that we derive in the latter setting will also hold in the context that enjoys our interest. Furthermore, we note that our results related to static stochastic optimization may be of interest to the Evolutionary Search community. Vice versa, the methods in Section 5 may benefit from prevailing ideas in the stochastic optimization community in order to deal with the problem of sample efficiency which is still considered to be a major challenge by the RL community. An example of such a strategy that is potentially interesting is importance mixing [51,52].

Entropy Regularized Optimization
There exists a large body of work that addresses the relation between inference and control. A lesser amount of work investigates the relation between inference and optimization. In this brief section, we provide an original and convincing argument to introduce an entropic regularization term into formal stochastic optimization problems that does not rely on an information-theoretic but on a strictly inference related argument. Although the resulting framework and associated distributions are known, our justification is original and in our opinion more intuitive. The concept results into a distribution sequence which exhibits a number of interesting properties and allows to make formal statements about convergence rates of derived practical search methods. As it will turn out these properties also directly apply to the entropy regularized optimal control problem that we will introduce in Section 4.3. To appreciate the argument, we must give an introduction to entropic inference.

Entropic Inference
Inductive inference refers to the problem of how a rational agent should update its state of knowledge, or so-called belief, about some quantity when new information about that quantity becomes available. Beliefs about the quantity x ∈ X ⊂ R n are modeled as distributions. An inference procedure refers to the computational framework that establishes how to integrate new information with information held by any prior belief, say ρ, to determine an informed posterior, say π, consistently.
A well-known framework is that of Bayesian inference which allows to process new information in the form of data. A lesser known inference framework deals with the setting where information is available in the form of a constraint on the family of acceptable posteriors [53]. Specifically constraints in the form of the expected value of some function f : X → R n , i.e., E[ f ] = µ. Consequently we can focus our attention to the subset of distributions that agree with it C f (µ) := {π ∈ P : E π [ f ] = µ}.
Any π ∈ C f (µ) that satisfies the information constraint qualifies as a potential posterior belief. The challenge thus reduces to identifying a unique posterior from among all those that could give rise to the constraint. The solution is to establish a ranking on the set C f (µ) by determining a functional F that associates a value to any posterior π relative to a given prior ρ. The inference procedure is then effectively cast into an optimization problem.
The problem remains in finding a suitable and meaningful functional F . The measure should promote a distribution that is maximally unbiased (i.e., maximally ignorant) about all the information we do not possess. This principle is referred to as minimal updating. This problem setting roots back at least to the maximum entropy principle first introduced by Jaynes [54,55]. As many authors provided compelling theoretical arguments for the relative entropy as the only consistent measure [39,[56][57][58][59][60], here it is important to emphasize that no interpretation of the measure is given. It simply is the only measure that agrees with the axioms of minimal updating.
The following variational problem determines the framework of entropic inference.
We now possess of a rational argument as to why an entropy regularization term is added to any problem and what its effect will be on the solution.

Optimization as an Inference Problem
Let us now argue how the principle of entropic inference can be practiced to serve the purpose of static or classic optimization. For convenience we assume that the objective f : X ⊂ R n → R has a unique global minimum.
x * = arg min x∈X f (x) Assume we can model any beliefs we have about the solution x * by some prior distribution, say ρ ∈ P. Second, instead of supposing information in the form of the expected value of the objective f , here we only require that the expected value with respect to the posterior, π, is, some amount ∆ > 0, smaller than the expected value is with respect to the prior. In this fashion, we change our prior belief about the optimal solution but only to the minimal extent required to decrease the expectation taken over f with some arbitrary value ∆. Put differently, we obtain a posterior that makes least claim to being informed about the optimal solution beyond the stated lower limit on the expectation. This idea can be formalized accordingly As the relative entropy minimizes for π = ρ, it follows that the inequality tightens into an equality. Nonetheless, we should be careful when we pick a value for ∆ > 0 that respects the bound ∆ ≤ E ρ [ f ] − f * . This is a practical concern that does not interfere with what we wish to accomplish, which is to construct a minimal update procedure that we can practice to serve the purpose of optimization. It suffices to solve the problem above for π using variational calculus. This yields the following entropic update rule where λ > 0 denotes the Lagrangian multiplier associated with the inequality constraint. This update principle is well known and can be subjected to an interesting interpretation (The posterior distribution, π, is equal to the prior distribution, ρ, multiplied with a cost driven probability shift, e −λ f , that makes rewarding regions more probable, resembling the concept of Bayesian inference. The transformation p( f ) ∝ e −λ f maps costs to probabilities. Indeed one may recognize the inverse log-likelihood transformation from probability to cost as it is often used in the context of Bayesian inference) However, as far we are aware of, it has never been derived from the theory of entropic inference or thus minimal updating, which makes it possible to appreciate it in a more general light.
The exact value of λ can be determined by solving the dual optimization problem. Alternatively, we could also pick any λ > 0 without the risk of overshooting the constraint ∆ ≤ E ρ [ f ] − f * . We will show that when λ → ∞, the expectation E π [ f ] collapses on the exact solution f * . Therefore, 1 λ reduces to a temperature like quantity that determines the amount of information that is added to ρ, where in the limit exactly so much information is admitted to precisely determine f * and thus x * .
The same strategy is obtained by considering the entropy regularized stochastic optimization problem below. Here an information-theoretic trust-region is introduced to promote conservative distribution updates but apart from the fact that the problem can be solved exactly, there is no proper motivation, at least not in the sense of minimal updating. This strategy has been adopted by previous authors from the optimization community [11,61].
Again it is easier to choose a suitable value for the implied Lagrangian multiplier λ > 0 than it is for ∆ > 0. The problem above is therefore often relaxed using a penalty function.

Theoretical Search Distribution Sequences
The entropic updating procedure suggested in (13) can be solved exactly and implies a theoretical distribution sequence, substituting the previous posterior for the next prior.
The update mechanism in (14) can be used as a theoretical model to shape practical search distribution sequences that can be used to solve the underlying optimization problem. In practice one seeks algorithms that estimate the posterior distribution π g+1 from samples taken with the prior π g . The theory now implies that the sequence will get gradually more informed about the optimum.
As far as we are aware of, the properties of the theoretical distribution sequence in (15) have not been studied before. In the following Theorem we summarize some of its properties. For the proof we refer to the work in Appendix A.
Theorem 2. Assume that, without loss of generality, objective f attains a unique global minimum at the origin. Further define the sequence of search distributions {π g } so that π g ∝ π 0 · e −g f . Then, it holds that • the distribution sequence {π g } collapses in the limit on the Dirac delta distribution in the sense that • the function E π g [ f ] of g is monotonically decreasing regardless of π 0 • the function H π g of g is monotonically decreasing if π 0 is chosen uniform on X The property (17) is suggested by the problem definition in (12). It follows that the the sequence {E π g [ f ]} converges monotonically to f * , implying that {π g } should converge to x * (16). If we thus construct a stochastic optimization algorithm that maintains a sequence of search distributions governed by (14), the algorithm should converge monotonically to the optimal solution.
This sequence of search distributions is increasingly more informed about the solution and as a result its entropy content deteriorates as the sequence converges to the minimum (see Appendix A).
Properties (16) and (18) thus imply that the entropy slowly evaporates. This is an important observation indicating that the explorative incentive of the search distribution deteriorates for increasing g. Therefore, when we would use this sequence to shape a practical search algorithm, this property suggest that the entropy of the sample population will slowly diminish. As a result of the finiteness of the population it will become increasingly more likely that the correct distribution parameters cannot be inferred from the population so that the sequence collapses prematurely.
In order to stimulate the sequence to preserve an explorative incentive, i.e., manage the entropic content, problem (13) can be relaxed using an entropy term (γ > 0), which will force the distribution to maintain a finite entropy.
The entropy regularized problem above implies the following distribution sequence. This is easily verified relying on the calculus of variations.
As opposed to (15), sequence (21) does not collapse on the minimum, but instead converges to π ∞ ∝ e − 1 γ f . The latter is equal to the exact solution of the entropy relaxed optimization problem defined below.
The derivative of the implied functions E π g [ f ] and H π g are given by where α(g) = ( λ λ+γ ) g > 0 so that lim g→∞ α(g) = 0. The convergence rate of the implied function E π g [ f ] now also depends on the interaction between f and π 0 . If the initial distribution π 0 is taken to be uniform on X , so that cov π g [π 0 , f ] = 0, the expected value will decrease monotonically. As opposed to (18) the rate will stall for large g. Derivations are provided in Appendix B.
In Section 4.3, we introduce a stochastic optimal control problem that is characterized by a similar distribution sequence. Therefore it inherits the properties described above. The usefulness of these theoretical distributions, is illustrated in Appendix C where we derive a stochastic search method.
Next, let us return to our original problem and briefly review entropy regularized SOC and its usefulness to RL as a stepping stone to Section 4.3.

Entropy Regularized Optimal Control
As we already explained in the introduction, there is no reason to maintain a stochastic policy in the generic SOC setting (2). On the other hand, a stochastic policy can be of interest to derive policy updating algorithms that depend on stochasticity for improved exploration rather than to rely on the stochastic system dynamics only.
As was justified rigorously in the previous section this can be achieved by introducing an entropy relaxation term in the optimization objective. Such a term actively stimulates a stochastic policy. Nowadays, this is a setting of widespread use in the RL community and was explored extensively by Levine et al. particularly in the context of Actor-Critic RL algorithms [35,62]. In recent literature, this problem has been regularized successfully using a relative entropy relaxation term in order to limit policy oscillations between updates [36][37][38]63,64]. These references address the following entropy regularized SOC problem.
Relying on the calculus of variations, one can verify that the solution is given by [38,65] The authors in the respective references do use this problem formulation as a starting point to derive policy search methods. Because of the expectation in the exponent, it is impossible to practice the recursion and find an explicit expression for the corresponding optimal path distribution which would circumvent the dependency on the value function. Therefore all of them require to estimate the value, V g (t, x), or state-action value function, Q g (t, x, u), either using a general function approximator or a local approximation. As a consequence, the entropy regularized optimal control framework can also not be used to derive a PIC method, at least not in the sense discussed in Section 3.2.
In this article, we specifically limit our focus to the class of PIC methods described in Section 3.2, which as stated assume the existence of an explicit goal state trajectory distribution. In the following section, we introduce an adjusted entropy regularized stochastic optimal problem for which there does exists a formal yet explicit optimal path distribution, and that consequently can be treated with the PIC machinery put in place.

Entropy Regularized Trajectory Optimization
As discussed in the previous section, there exists no explicit expression for the optimal path distribution sequence in the setting of entropy regularized SOC. Consequently, the framework can not be leveraged by the PIC design principle described in Section 3.2. Here, we aim to identify an entropy regularized SOC problem that can be solved for an explicit trajectory distribution. As was identified to be an essential condition in the setting of LSOC, therefore we must get rid of the stochastic policy, π. Therefore, we suggest to absorb it into the implied state transition distribution, p π . Analogously, this will elevate the problem from the policy to the state transition distribution optimization space, provided that we also can get rid of the dependency of the cost rate r t (x, u) on the control effort u.
Let us therefore assume there exists an inverse dynamic function f −1 so that u = f −1 (t, x, x ) when x = f (t, x, u). This implies a trajectory cost rate function c t that accumulates into a trajectory cost (We emphasize here that the existence of such an inverse model is a mere formal assumption. In the practical setting to be presented in Section 5, we will never actually have to invert the system dynamics given that we have direct access to the control values that have been applied to the system).
These definitions allow to define a stochastic state trajectory optimization problem. We can rewrite the stochastic Bellman Equation (2) as follows, and then substitute the expressions above. This yields a similar problem As there is a one-to-one correspondence between π and p π , one can replace the minimization with respect to the policy by a minimization with respect to the implied state transition distribution p π .
Finally, we can regularize and relax the problem according to the general entropic principles justified in Section 4.1 and as are thus also common in the RL community.
We end up with what we refer to as the Entropy Regularized Trajectory Optimization (ERTO) problem.
Remark that this problem is fundamentally different from that given in (24) considering that the penalties, D π π g and D p π p π g , and, H [π] and H [p π ], do not express the same restrictions.
Analogously to the LSOC, we summarize the most profound properties of problem (26) in the following theorem. The theorem also establishes a relation between the optimal stochastic policy π g (t, x), the optimal state transition distribution p π g (x |t, x) and the optimal state trajectory distribution p π g (τ x ). Again, the proof relies on the calculus of variations mostly. We direct the reader to Appendix D for details.

Theorem 3.
With V g+1 (t, x) defined as in (26), the following problems are equivalent, p π g+1 (x |t, x) = arg min p π g+1 (τ x ) = arg min The latter can be solved explicitly, where V g+1 (t, x) is governed by the recursion λ+γ (c t (x,x )+V(t+1,x )) dx and where π g (u|t, x) and p π g (x |t, x) are related as We list the most important implications of Theorem 3: • As a result of the entropy regularization and the state trajectory lifted optimization space, it is now possible to obtain another formal yet explicit optimal state trajectory distribution. Note that this is not the same optimal distribution as we derived in the LSOC setting given that here the control is not penalized through a Kullback-Leibler divergence term but is penalized implicitly through the cost C. When we evaluate C and have access to the state-action trajectory τ, we can simply replace C by R. Here, we emphasize that p π g still represents the state trajectory distribution which is now also a function of the actions.
• Second, the theorem implies that we can readily apply the PIC design strategy described in Section 3.2, substituting the optimal path distribution sequence (31) for p π , and, the parametrized path distribution p π(θ) induced by some stochastic parametric policy π(θ) for p u(θ) , to derive a generalized class of PIC methods. We refer to this class as Entropy Regularized Path Integral Control or ERPIC.
• As their is no longer a formal difference between problem (19) and (30), it follows that the properties derived for the sequence (21) also hold for the optimal state trajectory distribution sequence {p π g } and imply monotonic convergence to p π ∞ (τ x ) ∝ exp (− 1 γ C(τ x )). We conclude this section with a final remark related to control affine deterministic system dynamics, x = a(x) + B(x)u, in this setting controlled by a stochastic policy of the form π g (u|t, x) = N (u|u g (t, x), Σ g (t, x)). Again, it is possible to derive an explicit expression for the optimal policy. The controlled transition distribution is given by x)B(x) ). By matching the moments of the two distributions N (x |a(x) + B(x)u g+1 (t, x), B(x)Σ g+1 (t, x)B(x) ) and p π g+1 (x |t, x) we find expressions for u g+1 (t, x) and Σ g+1 (t, x).
It is easily verified that and where ξ ∼ N (0, Σ g (t, x)). We point out the similarities with (7). Further note that for λ = 0, the information-theoretic trust-region is lifted in which case (33) can be applied in the same MPC sense as (7).

Formal Comparison of Path Integral Control Methods
In this final section, we give a formal comparison of three different PIC methods based on the general design principle identified in Section 3.2 that are tailored to trajectory optimization. The goal distribution p u can now either root from the LSOC or the ETRO setting, respectively, introduced in Sections 3.1 and 4.3. The methods we discuss are tailored to find local solutions. That is policies that will control the system when the initial state is equal to a given state x 0 . Sometimes on refers to such strategies as trajectory-based policy optimization methods. All of the methods are model-free and sample-based. Currently, such methods are used in the literature either to deal with complex simulation environments where traditional gradient based trajectory optimizers fall short [28,42] or to derive reinforcement learning algorithms [29,41,49].
The methods discussed here are in that sense closest related to [38]. However, this algorithm derives from the generic entropy regularized SOC problem (24) which differs significantly from the problem formulation proposed in Section 4.3. We come back to this later.
Remark that depending on the theoretical setting, either LSOC or ETRO, we have to use a different parametrized state trajectory distribution which has an effect on the PIC objectives. In case of LSOC based PIC we address the following objective, while in case of ERTO based PIC we addresŝ The values w k g differ for each methods and will be discussed in the following section.

Control Affine Systems
We concentrate on systems governed by deterministic control affine dynamics. In either case this is the sole condition for which an explicit solution exists for the optimal policies u LSOC (t, x) or π g (t, x), given in (7) or (33)), respectively . Furthermore, it is also the sole condition for which w k g can be evaluated exactly. This assumption introduces little practical limitations as many systems comply to this system model.
Further let us assume Gaussian policies of the form In the case that we discuss methods derived from the LSOC setting, this means that we deliberately introduce a stochastic policy to mimic stochastic system dynamics. We emphasize that in this case Σ t (x) can not be chosen freely as it is directly related to the control effort penalization. From here on forth, we can thus consider stochastic policies in the LSOC setting as long as we silently acknowledge that the covariance can not be chosen freely. Then the state transition distribution is given by We can now substitute these expression into the different PIC objectives defined throughout the rest of the paper to find expressions for w k g . Therefore, we will require the following intermediate results, . These results follow from the fact that the samples satisfy the following condition Finally we can write the weights w k g as e −P(τ k x ) where the function P(·) depends on the specific objective. We will address two LSOC based objectives associated with SEPIC (or PICE) (9) and PIREPS (or ASPIC) (10) and the ETRO based objective associated to ERPIC (32). When we substitute the appropriate expressions in the corresponding objectives, we obtain, respectively, • SEPIC (or PICE) The function P reads as the cost accumulated over the trajectory τ k where the states and control efforts are penalized separately. The trailing term is included to compensate for the full noise penalization [41]. To penalize the states any nonlinear function can be used, the control is penalized using a quadratic penalty term which depends on the noise added to the system. A crucial limitations is clearly that the exploration noise and the control cost are therefore coupled.
• PIREPS (or ASPIC) P PIREPS = 1 1+λ P SEPIC It turns out that in this setting the weights w k g are simply a smoothed version of those associated to SEPIC. For λ 1 (i.e., strong regularization) the weights will all have approximately the same value and therefore u g+1 (t, x) ≈ u g (t, x). For 1 λ > 0 (i.e., weak regularization), the method reduces to SEPIC. • ERPIC One can easily verify that in the ERPIC setting, function P represents the cost accumulated over the trajectory τ k where the states and control efforts are no longer penalized separately.
Here, a discount terms is included that promotes uncertain trajectories making sure that the entropy of the search distribution does not evaporate eventually. Second, we wish to point out the obvious similarities with the stochastic search method in Appendix C.

Locally Linear Gaussian Policies
In conclusion, we solve the optimization problem defined in (32) exactly. Therefore, we will approximate the Gaussian policy N (u|u t (x), Σ t (x)) with a locally linear Gaussian feedback policy of the form N (u|u g,t + K g,t x, Σ g,t ).
In the ERPIC setting, the proportionality between the control penalization and the injected noise is lifted and therefore we gain access to the full parametrization of the policies, specifically θ g,t = {a g,t , K g,t , Σ g,t }. Starting from (35b) it follows that This procedure then yields the following elaborate updates (we refer to Appendix E for a proper derivation). Notation f is shorthand for the likelihood weighted average ∑ K k=1 w k g ∑ K k=1 w k g f k while f is shorthand for the empirical mean 1 K g+1,t =Σ ux,g,tΣ Σ g+1,t =Σ uu,g,t −Σ ux,g,tΣ −1 xx,g,tΣ xu,g,t with ∆μ u,g,t = ∆u k

Discussion
As was made clear in the introduction of the paper, it is not our intention to provide a numerical analysis or study. We are merely concerned with the relation between all the topics that we touched upon. In conclusion, we will discuss therefore a number of observations that are of interest to fully grasp the relation between Path Integral Control methods, Stochastic Search methods and Reinforcement Learning and the implied limitations.

Remarks Related to Stochastic Search Methods and Variance
In this section, we comment on Stochastic Search Methods from two perspectives: The first perspective is related to the entropy regularized optimization framework and the theoretical search distribution sequences introduced in Section 4.1. The second perspective is related to the apparent connections between Entropy Regularized Optimization tailored to standard optimization and Entropy Regularized Trajectory Optimization which is tailored to optimal control problems.
As far as we are aware of, we provide an original argument to address the standard optimization problem as a probabilistic inference problem. The idea of treating the belief about the solution as a distribution function and aiming to reduce the expected cost value with each belief update is to our opinion very intuitive. It seem to be ideas worth pursuing whether existing stochastic search methods fit this abstract framework and whether other practical methods can be modeled after the theoretical distribution sequences. However, the mathematical framework can also be rewritten as a variational optimization problem with respect to a search distribution regularized by information-theoretic or information-geometric trust-regions. From this perspective our entropic inference argument sheds a new light on earlier work in the context of stochastic search methods [11,12,14,36,61,64].
By conducting an Entropy Regularization in the context of Stochastic Optimal Control that draws direct inspiration from the LSOC setting, we were able to formulate ETRO. This problem shares the crucial trait with the LSOC problem that made it a topic of interest in learning-based control. As a result the problems looses the so called optimal substructure property which is characteristic to Optimal Control and can therefore be treated as a standard entropy regularized optimization problem. This condition has two consequences: The theoretical analysis become easier to handle. On the other hand, the framework infers the temporal policy parameters as if they influence the entire trajectory instead of only from time instant t.
Following the latter observation, it is interesting to compare the updates in (37) with those in Appendix C. In retrospect, these are almost identical apart from one crucial difference. The same weights w k g are used to update the temporal parameters {u g,t , K g,t , Σ g,t } regardless of t. This relates directly to the comment above and has a dreadful practical implication. The effect of a random policy variations δu k t at time t is accounted for by the complete associated trajectory. The trajectory itself is however influenced by random policy variations at different time instants t ∈ {0, . . . , t − 1, t + 1, . . . , T − 1}. We remark that the approach is therefore expected to suffer from high variance, especially for T 1, which will ultimately destabilize the convergence. This observation is a fundamental flaw of the framework described in Section 3.2 and therefore holds for any PIC methods including any existing methods. The high variance can partially be alleviated by acknowledging that the parameter update at time instant t is independent from the stochastic variables {x k t , u k t } t−1 t =0 and therefore the optimization problem in (36) can be revised using temporal weights where w k g,t = w g,t (τ k ) = exp(−P t,ERPIC (τ k )) with The update is then more in line with the explicit expression for the optimal stochastic policy that is given in (33).

Positioning of PIC Methods within the Field of RL
Formerly it was unclear how the framework of Linearly Solvable Optimal Control was related exactly to other RL methods. By identifying the framework of Entropy Regularized Trajectory Optimization and comparing it with the framework of Entropy Regularized Stochastic Optimal Control this ambiguity is lifted. Second, the model-free (local) trajectory optimization algorithms presented in this section are closest related to the method presented in [38]. However, this algorithm is based on problem formulation (24) and therefore requires to estimate the Q g -function which is not required in this strictly PIC-based setting. Related to the final comment above, this circumvents the problem of temporal variance induced by random policy variations at different time instants and is therefore expected to perform superiorly in practice.

Conclusions
In this paper, we addressed PIC methods, a class of policy search methods which have been closely tied to the theoretical setting of LSOC. Our work was motivated primarily by an unsatisfactory understanding and generality of Path Integral Control methods in relation with the setting of LSOC. Nevertheless, referred class of policy search methods enjoyed interest in the RL community with applications ranging from robust trajectory optimization and guided policy search to robust model based predictive control.
The LSOC setting was considered to be unique in the sense that an explicit expression exists for the optimal state trajectory distribution. The former we identified to be fundamental to the class of PIC methods. We illustrate that the existence of such a solution is not a unique trait to the setting of LSOC and argue that a similar solution can be derived from within the setting of Entropy Regularized Stochastic Optimal Control, a framework of widespread use nowadays in the RL community. The setting of Entropy Regularized Stochastic Optimal Control allows to treat a more general class of optimal control problems than the setting of LSOC, rendering LSOC obsolete.
In either case, the result follows from lifting the stochastic optimal control problem from the policy to the state transition distribution space making the control implicit. As a result of this characteristic, the properties of the implied state trajectory distribution sequence can be analyzed. We show that the sequence converges monotonically to the solution of the underlying deterministic optimal control problem.
To treat this sequence more formally, we give an original and compelling argument for the use of information-geometric measures in the context of stochastic search algorithms based on the principle of entropic inference. The main idea is to maintain a belief function over the solution space that is least committed to any assumption about the solution, apart from the requirement that the expectation over the objective should decrease monotonically between updates. The resulting Entropy Regularized Optimization framework may serve as an overarching paradigm to analyze and derive stochastic search algorithms.
In conclusion, the value of this article is in that it identifies PIC methods for what they really are, a class of model-free and approximation free policy search algorithms that are theoretically founded on a specific subclass of Entropy Regularized Stochastic Optimal Control. Therewith, the underlying machinery associated to preceding derivations is untied from the peculiar setting of LSOC and the relation between PIC methods and state-of-the-art Reinforcement Learning is finally demystified. Nevertheless, our investigation suggests that PIC methods in fact are structurally closer related to Stochastic Search Methods tailored to classic optimization problem than they are to state-of-the-art Reinforcement Learning methods tailored to optimal control and decision-making problems and that the associated policy parameter updates will therefore be prone to higher variances.

Acknowledgments:
The authors wish to express their gratitude towards the anonymous reviewers for providing valuable suggestions and for pointing out crucial literature references.

Conflicts of Interest:
The authors declare no conflicts of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript; or in the decision to publish the results.
• Consider any x • ∈ X : These definitions allow to derive an upper bound for the value of π g (x • ), specifically Now as exp( f • − f ) > 1, ∀x ∈ X f • > f it follows that π • g tends to 0 for g → ∞. On the other hand, if we choose x • = x * , one can easily verify that the denominator tends to 0 and thus π • g tends to ∞ as g → ∞. This limit behavior agrees with that of the Dirac delta and the statement follows.
• To proof that E π g [ f ] is a monotonically decreasing function of g, we simply have to verify whether the derivative to g is strictly negative. Therefore, let us first express the expectation explicitly, introducing the normalizer where a g = e −g 1 λ f π 0 with d dg a g = a g = − 1 λ f a g Taking the derivative to g yields d dg E π g [ f ] = f a g dx a g dx − f a g dx a g dx a g dx a g dx As the variance is a strictly positive operator, expect for π ∞ , the statement follows. • The entropy of the distribution π g is equal to H π g = log a g dx − log(a g )a g dx a g dx We will also need the logarithm of a g which is log a g = log π 0 − g 1 λ f . Taking the derivative of H π g to g yields d dg H π g = a g dx a g dx − 1 a g a g a g dx a g dx − log(a g )a g dx a g dx + log(a g )a g dx a g dx a g dx a g dx = − log(a g )a g dx a g dx + log(a g )a g dx a g dx a g dx a g dx = 1 λ log(a g ) f a g dx a g dx − 1 λ log(a g )a g dx f a g dx a g dx a g dx = 1 λ log π 0 f a g dx a g dx − g λ 2 f 2 a g dx a g dx − 1 λ log π 0 a g dx f a g dx a g dx a g dx + g λ 2 f a g dx f a g dx a g dx a g dx = − g λ 2 Var π g [ f ] + 1 λ Cov π g [log π 0 , f ] In case we choose π 0 uniform on X the entropy decreases monotonically with g. Otherwise, the entropy might temporarily increase especially when π 0 and π ∞ are far apart.
Further note that the rate of convergence increases with g.

Appendix B. Derivation of Derivatives in (22) and (23)
Consider the implied function of g by taking the expectation of f Here, function a g is defined as a g = π α 0 · π 1−α ∞ , where α = ( λ λ+γ ) g and π ∞ ∝ exp(− 1 γ f ). We will also needs the derivatives α = −α log λ+γ λ a g = −α log π ∞ π 0 a g = α log λ+γ λ log π ∞ π 0 a g Taking the derivative of E π g [ f ] to g yields d dg E π g [ f ] = −γ d dg E π g [log π ∞ ] = −γ log π ∞ a g dx a g dx + log π ∞ a g dx a g dx a g dx a g dx = −γα log λ+γ λ log π ∞ log π ∞ π 0 a g dx a g dx + log π ∞ a g dx log π ∞ π 0 a g dx a g dx a g dx = −γα log λ+γ λ Cov π g log π ∞ , log π ∞ π 0 when π 0 is chosen uniform on X this is always smaller than 0 due to the positive definiteness of the covariance operator.
In a similar fashion we can address the entropy H π g = log a g dx − log(a g )a g dx a g dx Here we will also need the logarithm of a g which is given by log a g = log π ∞ − α (log π ∞ − log π 0 ) Taking the derivative of H π g to g yields d dg H π g = a g dx a g dx − 1 a g a g a g dx a g dx − log(a g )a g dx a g dx + log(a g )a g dx a g dx a g dx a g dx = − log(a g )a g dx a g dx + log(a g )a g dx a g dx a g dx a g dx = −α log λ+γ λ log(a g ) log π ∞ π 0 a g dx a g dx − log(a g )a g dx log π ∞ π 0 a g dx a g dx a g dx = −α log λ+γ λ Cov π g log a g , log π ∞ − log π 0 = −α log λ+γ λ (1 − α)Cov π g log π ∞ , log π ∞ π 0 + αCov π g log π 0 , log π ∞ π 0 Again, when π 0 is uniform on X , the sequence decreases monotonically.

Appendix C. Entropy Regularized Evolutionary Strategy
In order to illustrate the practical use of the search distribution sequence in (20)-(21), let us demonstrate how to cast the theoretical distribution into a practical search algorithm. To that end, we project π g onto a parametric distribution family π(θ) and manipulate the resulting expression into an expectation over the prior belief π g .
Note that this strategy is analogous to the one described in Section 3.2. As such we establish a calculable update procedure that infers parameters from an estimated expectation using samples taken from the prior. In particular we are interested in the Gaussian family, i.e., π(x|θ) = N (x|µ, Σ), which is commonly used in the context of evolutionary strategies [11,13,14]. As a projection operator again the relative entropy is used relying on the same argument as in [41]. θ g+1 = arg min θ D π g+1 π(θ) = − E π(θ g ) π g+1 π(θ g ) log π(θ) +constant This problem can be solved explicitly. We propose solving it for each parameter independently using a coordinate descent strategy, substituting the previous value of the respective other in the objective. This strategy renders each independent problem concave [11].
The derivative of J g to µ and Σ equal ∇ µ J g | µ,Σ g = −Σ −1 g E π g w g (x) δx g − µ + µ g ∇ Σ J g | µ g ,Σ = E π g w g (x)Σ −1 δx g δx g Σ −1 − Σ −1 where w g (x) = π(x|θ g ) −γ λ+γ e − 1 λ+γ f (x) and δx g = x − µ g . We can solve these equations to provide independent update procedures for the distribution parameters. µ g+1 = µ g + E π(θ g ) w g δx g Σ g+1 = E π(θ g ) w g δx g δx g Finally we substitute Monte Carlo estimates estimates for the expectations. We obtain the following update which readers, familiar with the class of Evolutionary Strategies, will recognize to be similar to those they are accustomed with (Note that if it was not for the diffusion factor γ λ+γ , the weights would simply equal the exponential transformed objective. We emphasize here that f does not need to represent the physical objective but could be any rank preserving mapping implying these updates correspond with those of any other ES provided that the mapping is known) where w k g = w g (x k ) and δx k g = x k − µ g . Finally, we note that the weights can also be expressed as illustrated below.
It is noted that this strategy is closely related to the work in [61].

Appendix D. Proof of Theorem 3
Proof. The equivalence of (28) and (29) follows directly from the definition of p π . The equation in (29) determines a variational problem in p π which we can solve correspondingly p π g (x |t, x) ∝ p λ λ+γ π g e − 1 λ+γ c t (x,x )− 1 λ+γ V g+1 (t+1,x ) and that we can substitute back into (26) to yield the recursive equation Either by iterating the recursion or by solving problem (30) directly, it is then also easily verified that proving the second statement.