Forward-Backward Sweep Method for the System of HJB-FP Equations in Memory-Limited Partially Observable Stochastic Control

Tottori, Takehiro; Kobayashi, Tetsuya J.

doi:10.3390/e25020208

Open AccessArticle

Forward-Backward Sweep Method for the System of HJB-FP Equations in Memory-Limited Partially Observable Stochastic Control

by

Takehiro Tottori

^1,*

and

Tetsuya J. Kobayashi

^1,2,3,4

¹

Department of Mathematical Informatics, Graduate School of Information Science and Technology, The University of Tokyo, Tokyo 113-8654, Japan

²

Institute of Industrial Science, The University of Tokyo, Tokyo 153-8505, Japan

³

Department of Electrical Engineering and Information Systems, Graduate School of Engineering, The University of Tokyo, Tokyo 113-8654, Japan

⁴

Universal Biology Institute, The University of Tokyo, Tokyo 113-8654, Japan

^*

Author to whom correspondence should be addressed.

Entropy 2023, 25(2), 208; https://doi.org/10.3390/e25020208

Submission received: 5 November 2022 / Revised: 9 January 2023 / Accepted: 16 January 2023 / Published: 21 January 2023

(This article belongs to the Special Issue Information Theory in Control Systems)

Download

Browse Figures

Review Reports Versions Notes

Abstract

:

Memory-limited partially observable stochastic control (ML-POSC) is the stochastic optimal control problem under incomplete information and memory limitation. To obtain the optimal control function of ML-POSC, a system of the forward Fokker–Planck (FP) equation and the backward Hamilton–Jacobi–Bellman (HJB) equation needs to be solved. In this work, we first show that the system of HJB-FP equations can be interpreted via Pontryagin’s minimum principle on the probability density function space. Based on this interpretation, we then propose the forward-backward sweep method (FBSM) for ML-POSC. FBSM is one of the most basic algorithms for Pontryagin’s minimum principle, which alternately computes the forward FP equation and the backward HJB equation in ML-POSC. Although the convergence of FBSM is generally not guaranteed in deterministic control and mean-field stochastic control, it is guaranteed in ML-POSC because the coupling of the HJB-FP equations is limited to the optimal control function in ML-POSC.

Keywords:

decision-making; optimal control; stochastic control; incomplete information; memory limitation; mean-field control

1. Introduction

In many practical applications of the stochastic optimal control theory, several constraints need to be considered. In the cases of small devices [1,2] and biological systems [3,4,5,6,7,8], for example, incomplete information and memory limitation become predominant because their sensors are extremely noisy and their memory resources are severely limited. To take into account one of these constraints, incomplete information, partially observable stochastic control (POSC) has been extensively studied in the stochastic optimal control theory [9,10,11,12,13]. However, because POSC cannot take into account the other constraint, memory limitation, it is not practical enough for designing memory-limited controllers for small devices and biological systems. To resolve this problem, memory-limited POSC (ML-POSC) has recently been proposed [14]. Because ML-POSC formulates noisy observation and limited memory explicitly, ML-POSC can take into account both incomplete information and memory limitation in the stochastic optimal control problem.

However, ML-POSC cannot be solved in a similar way as completely observable stochastic control (COSC), which is the most basic stochastic optimal control problem [15,16,17,18]. In COSC, the optimal control function depends only on the Hamilton–Jacobi–Bellman (HJB) equation, which is a time-backward partial differential equation given a terminal condition (Figure 1a) [15,16,17,18]. Therefore, the optimal control function of COSC can be obtained by solving the HJB equation backward in time from the terminal condition, which is called the value iteration method [19,20,21]. In contrast, the optimal control function of ML-POSC depends not only on the HJB equation but also on the Fokker–Planck (FP) equation, which is a time-forward partial differential equation given an initial condition (Figure 1b) [14]. Because the HJB equation and the FP equation interact with each other through the optimal control function in ML-POSC, the optimal control function of ML-POSC cannot be obtained by the value iteration method.

To propose an algorithm to solve ML-POSC, we first show that the system of HJB-FP equations can be interpreted via Pontryagin’s minimum principle on the probability density function space. Pontryagin’s minimum principle is one of the most representative approaches to the deterministic optimal control problem, which converts it into the two-point boundary value problem of the forward state equation and the backward adjoint equation [22,23,24,25]. We formally show that the system of HJB-FP equations is an extension of the system of adjoint and state equations from the deterministic optimal control problem to the stochastic optimal control problem.

The system of HJB-FP equations also appears in the mean-field stochastic control (MFSC) [26,27,28]. Although the relationship between the system of HJB-FP equations and Pontryagin’s minimum principle has been briefly mentioned in MFSC [29,30,31], its details have not yet been investigated. In this work, we investigate it in more detail by deriving the system of HJB-FP equations in a similar way to Pontryagin’s minimum principle. We note that our derivations are formal, not analytical, and more mathematically rigorous proofs remain future challenges. However, our results are consistent with many conventional results and also provide a useful perspective in proposing an algorithm.

We then propose the forward-backward sweep method (FBSM) for ML-POSC. FBSM is an algorithm to compute the forward FP equation and the backward HJB equation alternately, which can be interpreted as an extension of the value iteration method. FBSM has been proposed in Pontryagin’s minimum principle of the deterministic optimal control problem, which computes the forward state equation and the backward adjoint equation alternately [32,33,34]. Because FBSM is easy to implement, it has been used in many applications [35,36]. However, the convergence of FBSM is not guaranteed in deterministic control except for special cases [37,38] because the coupling of adjoint and state equations is not limited to the optimal control function (Figure 1c). In contrast, we show that the convergence of FBSM is generally guaranteed in ML-POSC because the coupling of the HJB-FP equations is limited only to the optimal control function (Figure 1b).

FBSM is called the fixed-point iteration method in MFSC [39,40,41,42]. Although the fixed-point iteration method is the most basic algorithm to solve MFSC, its convergence is not guaranteed for the same reason as deterministic control (Figure 1d). Therefore, ML-POSC is a special and nice class of optimal control problems where FBSM or the fixed-point iteration method is guaranteed to converge.

This paper is organized as follows: In Section 2, we formulate ML-POSC. In Section 3, we derive the system of HJB-FP equations of ML-POSC from the viewpoint of Pontryagin’s minimum principle. In Section 4, we propose FBSM for ML-POSC and prove its convergence. In Section 5, we apply FBSM to the linear-quadratic-Gaussian (LQG) problem. In Section 6, we verify the convergence of FBSM by numerical experiments. In Section 7, we discuss our work. In Appendix A, we briefly review Pontryagin’s minimum principle of deterministic control. In Appendix B, we derive the system of HJB-FP equations of MFSC from the viewpoint of Pontryagin’s minimum principle. In Appendix C, we show the detailed derivations of our results.

2. Memory-Limited Partially Observable Stochastic Control

In this section, we briefly review the formulation of ML-POSC [14], which is the stochastic optimal control problem under incomplete information and memory limitation.

2.1. Problem Formulation

This subsection outlines the formulation of ML-POSC [14]. The state of the system

x_{t} \in R^{d_{x}}

at time

t \in [0, T]

evolves by the following stochastic differential equation (SDE):

\begin{matrix} d x_{t} & = b (t, x_{t}, u_{t}) d t + σ (t, x_{t}, u_{t}) d ω_{t}, \end{matrix}

(1)

where

x_{0}

obeys

p_{0} (x_{0})

,

u_{t} \in R^{d_{u}}

is the control, and

ω_{t} \in R^{d_{ω}}

is the standard Wiener process. In COSC [15,16,17,18], because the controller can completely observe the state

x_{t}

, it determines the control

u_{t}

based on the state

x_{t}

as

u_{t} = u (t, x_{t})

. By contrast, in POSC [9,10,11,12,13] and ML-POSC [14], the controller cannot directly observe the state

x_{t}

and instead obtains the observation

y_{t} \in R^{d_{y}}

, which evolves by the following SDE:

\begin{matrix} d y_{t} & = h (t, x_{t}) d t + γ (t) d ν_{t}, \end{matrix}

(2)

where

y_{0}

obeys

p_{0} (y_{0})

, and

ν_{t} \in R^{d_{ν}}

is the standard Wiener process. In POSC [9,10,11,12,13], because the controller can completely memorize the observation history

y_{0 : t} : = {y_{τ} | τ \in [0, t]}

, it determines the control

u_{t}

based on the observation history

y_{0 : t}

as

u_{t} = u (t, y_{0 : t})

. In ML-POSC [14], by contrast, because the controller cannot completely memorize the observation history

y_{0 : t}

, it compresses the observation history

y_{0 : t}

into the finite-dimensional memory

z_{t} \in R^{d_{z}}

, which evolves by the following SDE:

\begin{matrix} d z_{t} = c (t, z_{t}, v_{t}) d t + κ (t, z_{t}, v_{t}) d y_{t} + η (t, z_{t}, v_{t}) d ξ_{t}, \end{matrix}

(3)

where

z_{0}

obeys

p_{0} (z_{0})

,

v_{t} \in R^{d_{v}}

is the control, and

ξ_{t} \in R^{d_{ξ}}

is the standard Wiener process. The memory dimension

d_{z}

is determined by the available memory size of the controller. In addition, the memory noise

ξ_{t}

represents the intrinsic stochasticity of the memory to be used. Therefore, unlike the conventional POSC, ML-POSC can explicitly take into account the memory size and noise of the controller. Furthermore, because the memory dynamics (3) depends on the memory control

v_{t}

, it can be optimized through the memory control

v_{t}

, which is expected to realize the optimal compression of the observation history

y_{0 : t}

into the limited memory

z_{t}

. In ML-POSC [14], the controller determines the state control

u_{t}

and the memory control

v_{t}

based on the memory

z_{t}

as follows:

\begin{matrix} u_{t} = u (t, z_{t}), v_{t} = v (t, z_{t}) . \end{matrix}

(4)

The objective function of ML-POSC is given by the following expected cumulative cost function:

\begin{matrix} J [u, v] : = E_{p (x_{0 : T}, y_{0 : T}, z_{0 : T}; u, v)} [\int_{0}^{T} f (t, x_{t}, u_{t}, v_{t}) d t + g (x_{T})], \end{matrix}

(5)

where f is the cost function, g is the terminal cost function,

p (x_{0 : T}, y_{0 : T}, z_{0 : T}; u, v)

is the probability of

x_{0 : T}

,

y_{0 : T}

, and

z_{0 : T}

given u and v as parameters, and

E_{p} [\cdot]

is the expectation with respect to the probability p. Because the cost function f depends on the memory control

v_{t}

, ML-POSC can explicitly take into account the memory control cost, which is also impossible with the conventional POSC.

ML-POSC is the problem of finding the optimal state control function

u^{*}

and the optimal memory control function

v^{*}

that minimize the expected cumulative cost function

J [u, v]

as follows:

\begin{matrix} u^{*}, v^{*} : = arg min_{u, v} J [u, v] . \end{matrix}

(6)

ML-POSC first formulates the finite-dimensional and stochastic memory dynamics explicitly, then optimizes the memory control by considering the memory control cost. As a result, unlike the conventional POSC, ML-POSC is a practical framework for memory-limited controllers where the memory size, noise, and cost are imposed and non-negligible.

The previous work [14] has shown the validity and effectiveness of ML-POSC. In the LQG problem of conventional POSC, the observation history

y_{0 : T}

can be compressed into the Kalman filter without a loss of performance [10,18,43]. Because the Kalman filter is finite-dimensional, it can be interpreted as the finite-dimensional memory

z_{t}

and discussed in terms of ML-POSC. The previous work [14] has proven that the optimal memory dynamics of ML-POSC become the Kalman filter in this problem, which indicates that ML-POSC is a consistent framework with the conventional POSC. Furthermore, the previous work [14] has demonstrated the effectiveness of ML-POSC in the LQG problem with memory limitation and in the non-LQG problem by numerical experiments.

2.2. Problem Reformulation

Although the formulation of ML-POSC in the previous subsection is intuitive, it is inconvenient for further mathematical investigations. To address this problem, we reformulate ML-POSC in this subsection. The formulation in this subsection is simpler and more general than that in the previous subsection.

First, we define an extended state

s_{t}

as follows:

\begin{matrix} s_{t} : = (\begin{matrix} x_{t} \\ z_{t} \end{matrix}) \in R^{d_{s}}, \end{matrix}

(7)

where

d_{s} = d_{x} + d_{z}

. The extended state

s_{t}

evolves by the following SDE:

\begin{matrix} d s_{t} = \tilde{b} (t, s_{t}, {\tilde{u}}_{t}) d t + \tilde{σ} (t, s_{t}, {\tilde{u}}_{t}) d {\tilde{ω}}_{t}, \end{matrix}

(8)

where

s_{0}

obeys

p_{0} (s_{0})

,

{\tilde{u}}_{t} \in R^{d_{\tilde{u}}}

is the control, and

{\tilde{ω}}_{t} \in R^{d_{\tilde{ω}}}

is the standard Wiener process. ML-POSC determines the control

{\tilde{u}}_{t} \in R^{d_{\tilde{u}}}

based on the memory

z_{t}

as follows:

\begin{matrix} {\tilde{u}}_{t} = \tilde{u} (t, z_{t}) . \end{matrix}

(9)

The extended state SDE (8) includes the previous SDEs (1)–(3) as a special case because they can be represented as follows:

\begin{matrix} d s_{t} = (\begin{matrix} b (t, x_{t}, u_{t}) \\ c (t, z_{t}, v_{t}) + κ (t, z_{t}, v_{t}) h (t, x_{t}) \end{matrix}) d t \\ + (\begin{matrix} σ (t, x_{t}, u_{t}) & O & O \\ O & κ (t, z_{t}, v_{t}) γ (t) & η (t, z_{t}, v_{t}) \end{matrix}) (\begin{matrix} d ω_{t} \\ d ν_{t} \\ d ξ_{t} \end{matrix}), \end{matrix}

(10)

where

p_{0} (s_{0}) = p_{0} (x_{0}) p_{0} (z_{0})

.

The objective function of ML-POSC is given by the following expected cumulative cost function:

\begin{matrix} J [\tilde{u}] : = E_{p (s_{0 : T}; \tilde{u})} [\int_{0}^{T} \tilde{f} (t, s_{t}, {\tilde{u}}_{t}) d t + \tilde{g} (s_{T})] . \end{matrix}

(11)

where

\tilde{f}

is the cost function and

\tilde{g}

is the terminal cost function. It is obvious that this objective function (11) is more general than that in the previous subsection (5).

ML-POSC is the problem of finding the optimal control function

{\tilde{u}}^{*}

that minimizes the expected cumulative cost function

J [\tilde{u}]

as follows:

\begin{matrix} {\tilde{u}}^{*} : = arg min_{\tilde{u}} J [\tilde{u}] . \end{matrix}

(12)

In the following sections, we mainly consider the formulation of this subsection because it is simpler and more general than that in the previous subsection. Moreover, we omit

\tilde{\cdot}

for simplicity of notation.

3. Pontryagin’s Minimum Principle

If the control

u_{t}

is determined based on the extended state

s_{t}

as

u_{t} = u (t, s_{t})

, ML-POSC is the same problem with COSC of the extended state, and its optimality conditions can be obtained in the conventional way [15,16,17,18]. In reality, however, because ML-POSC determines the control

u_{t}

based only on the memory

z_{t}

as

u_{t} = u (t, z_{t})

, its optimality conditions cannot be obtained in a similar way as COSC. In the previous work [14], the optimality conditions of ML-POSC were obtained by employing a mathematical technique of MFSC [30,31].

In this section, we obtain the optimality conditions of ML-POSC by employing Pontryagin’s minimum principle [22,23,24,25] on the probability density function space (Figure 2 (bottom right)). The conventional approach in ML-POSC [14] and MFSC [30,31] can be interpreted as a conversion from Bellman’s dynamic programming principle (Figure 2 (top right)) to Pontryagin’s minimum principle (Figure 2 (bottom right)) on the probability density function space.

In Appendix A, we briefly review Pontryagin’s minimum principle in deterministic control (Figure 2 (left)). In this section, we obtain the optimality conditions of ML-POSC in a similar way as Appendix A (Figure 2 (right)). Furthermore, in Appendix B, we obtain the optimality conditions of MFSC in a similar way as Appendix A (Figure 2 (right)). MFSC is more general than ML-POSC except for the partial observability. In particular, the expected Hamiltonian is non-linear with respect to the probability density function in MFSC, while it is linear in ML-POSC.

Although our derivations are formal, not analytical, and more mathematically rigorous proofs remain future challenges, our results are consistent with the conventional results of COSC [15,16,17,18], ML-POSC [14], and MFSC [26,27,28,30,31], and also provide a useful perspective in proposing an algorithm.

3.1. Preliminary

In this subsection, we show a useful result in obtaining Pontryagin’s minimum principle. Given arbitrary control functions u and

u^{'}

,

J [u] - J [u^{'}]

can be calculated as follows:

\begin{matrix} J [u] - J [u^{'}] = \int_{0}^{T} (E_{p (t, s)} [H (t, s, u, w^{'})] - E_{p (t, s)} [H (t, s, u^{'}, w^{'})]) d t, \end{matrix}

(13)

where

H

is the Hamiltonian, which is defined as follows:

\begin{matrix} H (t, s, u, w) : = f (t, s, u) + L_{u} w (t, s) . \end{matrix}

(14)

L_{u}

is the backward diffusion operator, which is defined as follows:

\begin{matrix} L_{u} w (t, s) & : = \sum_{i = 1}^{d_{s}} b_{i} (t, s, u) \frac{\partial w (t, s)}{\partial s_{i}} + \frac{1}{2} \sum_{i, j = 1}^{d_{s}} D_{i j} (t, s, u) \frac{\partial^{2} w (t, s)}{\partial s_{i} \partial s_{j}}, \end{matrix}

(15)

where

D (t, s, u) : = σ (t, s, u) σ^{⊤} (t, s, u)

.

w^{'} (t, s)

is the solution of the following Hamilton–Jacobi–Bellman (HJB) equation driven by

u^{'}

:

\begin{matrix} - \frac{\partial w^{'} (t, s)}{\partial t} = H (t, s, u^{'}, w^{'}), \end{matrix}

(16)

where

w^{'} (T, s) = g (s)

.

p (t, s)

is the solution of the following Fokker–Planck (FP) equation driven by u:

\begin{matrix} \frac{\partial p (t, s)}{\partial t} = L_{u}^{†} p (t, s), \end{matrix}

(17)

where

p (0, s) = p_{0} (s)

.

L_{u}^{†}

is the forward diffusion operator, which is defined as follows:

\begin{matrix} L_{u}^{†} p (t, s) & : = - \sum_{i = 1}^{d_{s}} \frac{\partial (b_{i} (t, s, u) p (t, s))}{\partial s_{i}} + \frac{1}{2} \sum_{i, j = 1}^{d_{s}} \frac{\partial^{2} (D_{i j} (t, s, u) p (t, s))}{\partial s_{i} \partial s_{j}} . \end{matrix}

(18)

L_{u}^{†}

is the conjugate of

L_{u}

as follows:

\begin{matrix} \int w (t, s) L_{u}^{†} p (t, s) d s = \int p (t, s) L_{u} w (t, s) d s . \end{matrix}

(19)

We derive Equation (13) in Appendix C.1.

3.2. Necessary Condition

In this subsection, we show the necessary condition of the optimal control function of ML-POSC. It corresponds to Pontryagin’s minimum principle on the probability density function space (Figure 2 (bottom right)). If

u^{*}

is the optimal control function of ML-POSC (12), then the following equation is satisfied:

\begin{matrix} u^{*} (t, z) = arg min_{u} E_{p_{t}^{*} (x | z)} [H (t, s, u, w^{*})], a . s .^{\forall} t \in [0, T],^{\forall} z \in R^{d_{z}}, \end{matrix}

(20)

where

w^{*} (t, s)

is the solution of the following HJB equation driven by

u^{*}

:

\begin{matrix} - \frac{\partial w^{*} (t, s)}{\partial t} = H (t, s, u^{*}, w^{*}), \end{matrix}

(21)

where

w^{*} (T, s) = g (s)

.

p_{t}^{*} (x | z) : = p^{*} (t, s) / \int p^{*} (t, s) d x

is the conditional probability density function of state x given memory z, and

p^{*} (t, s)

is the solution of the following FP equation driven by

u^{*}

:

\begin{matrix} \frac{\partial p^{*} (t, s)}{\partial t} = L_{u^{*}}^{†} p^{*} (t, s), \end{matrix}

(22)

where

p^{*} (0, s) = p_{0} (s)

. We derive this result in Appendix C.2.

In deterministic control, Pontryagin’s minimum principle can be expressed by the derivatives of the Hamiltonian (Figure 2 (bottom left)). Similarly, the system of HJB-FP Equations (21) and (22) can be expressed by the variations of the expected Hamiltonian

\begin{matrix} \bar{H} (t, p, u, w) : = E_{p (s)} [H (t, s, u, w)] \end{matrix}

(23)

as follows:

\begin{matrix} \frac{\partial p^{*} (t, s)}{\partial t} & = \frac{δ \bar{H} (t, p^{*}, u^{*}, w^{*})}{δ w} (s), \end{matrix}

(24)

\begin{matrix} - \frac{\partial w^{*} (t, s)}{\partial t} & = \frac{δ \bar{H} (t, p^{*}, u^{*}, w^{*})}{δ p} (s), \end{matrix}

(25)

where

p^{*} (0, s) = p_{0} (s)

and

w^{*} (T, s) = g (s)

(Figure 2 (bottom right)). Therefore, the system of HJB-FP equations can be interpreted via Pontryagin’s minimum principle on the probability density function space.

3.3. Sufficient Condition

Pontryagin’s minimum principle (20) is only a necessary condition and generally not a sufficient condition. Pontryagin’s minimum principle (20) becomes a necessary and sufficient condition if the expected Hamiltonian

\bar{H} (t, p, u, w)

is convex with respect to p and u. We obtain this result in Appendix C.3.

3.4. Relationship with Bellman’s Dynamic Programming Principle

From Bellman’s dynamic programming principle on the probability density function space (Figure 2 (top right)) [14], the optimal control function of ML-POSC is given by the following equation:

\begin{matrix} u^{*} (t, z, p) = arg min_{u} E_{p (x | z)} [H (t, s, u, \frac{δ V^{*} (t, p)}{δ p} (s))], \end{matrix}

(26)

where

V^{*} (t, p)

is the value function on the probability density function space, which is the solution of the following Bellman equation:

\begin{matrix} - \frac{\partial V^{*} (t, p)}{\partial t} = E_{p (s)} [H (t, s, u^{*}, \frac{δ V^{*} (t, p)}{δ p} (s))], \end{matrix}

(27)

where

V^{*} (T, p) = E_{p (s)} [g (s)]

. More specifically, the optimal control function of ML-POSC is given by

u^{*} (t, z) = u^{*} (t, z, p^{*})

, where

p^{*}

is the solution of the FP Equation (22).

Because the Bellman Equation (27) is a functional differential equation, it cannot be solved even numerically. To resolve this problem, the previous work [14] converted the Bellman Equation (27) into the HJB Equation (21) by defining

\begin{matrix} w^{*} (t, s) : = \frac{δ V^{*} (t, p^{*})}{δ p} (s), \end{matrix}

(28)

where

p^{*}

is the solution of FP Equation (22). This approach can be interpreted as the conversion from Bellman’s dynamic programming principle (Figure 2 (top right)) to Pontryagin’s minimum principle (Figure 2 (bottom right)) on the probability density function space.

3.5. Relationship with Completely Observable Stochastic Control

In the COSC of the extended state, the control

u_{t}

is determined based on the extended state

s_{t}

as

u_{t} = u (t, s_{t})

. Therefore, in the COSC of the extended state, Pontryagin’s minimum principle on the probability density function space is given by the following equation:

\begin{matrix} u^{*} (t, s) = arg min_{u} H (t, s, u, w^{*}), a . s .^{\forall} t \in [0, T],^{\forall} s \in R^{d_{s}}, \end{matrix}

(29)

where

w^{*} (t, s)

is the solution of the HJB Equation (21). Because this proof is almost identical to that of Section 3.2, it is omitted in this paper.

While the optimal control function of ML-POSC (20) depends on the FP equation and the HJB equation, the optimal control function of COSC (29) depends only on the HJB equation. From this nice property of COSC, Equation (29) is not only a necessary condition but also a sufficient condition without assuming the convexity of the expected Hamiltonian. We derive this result in Appendix C.4.

This result is consistent with the conventional result of COSC [15,16,17,18]. Unlike ML-POSC and MFSC, COSC can be solved by Bellman’s dynamic programming principle on the state space. In COSC, Pontryagin’s minimum principle on the probability density function space is equivalent to Bellman’s dynamic programming principle on the state space. Because Bellman’s dynamic programming principle on the state space is a necessary and sufficient condition, Pontryagin’s minimum principle on the probability density function space may also become a necessary and sufficient condition.

4. Forward-Backward Sweep Method

In this section, we propose FBSM for ML-POSC and then prove its convergence by employing the interpretation of the system of HJB-FP equations by Pontryagin’s minimum principle introduced in the previous section.

4.1. Forward-Backward Sweep Method

In this subsection, we propose FBSM for ML-POSC, which is summarized in Algorithm 1. FBSM is an algorithm to compute the forward FP equation and the backward HJB equation alternately. More specifically, in the initial step of FBSM, we initialize the control function

u_{0 : T - d t}^{0}

and obtain

p_{0 : T}^{0}

by computing the FP equation forward in time from the initial condition. In the backward step, we obtain

w_{0 : T}^{1}

by computing the HJB equation backward in time from the terminal condition and simultaneously update the control function from

u_{0 : T - d t}^{0}

to

u_{0 : T - d t}^{1}

by minimizing the conditional expected Hamiltonian. In the forward step, we obtain

p_{0 : T}^{2}

by computing the FP equation forward in time from the initial condition and simultaneously update the control function from

u_{0 : T - d t}^{1}

to

u_{0 : T - d t}^{2}

by minimizing the conditional expected Hamiltonian. By iterating the backward and forward steps, the objective function of ML-POSC

J [u_{0 : T - d t}^{k}]

monotonically decreases and finally converges to the local minimum at which the control function of ML-POSC

u_{0 : T - d t}^{k}

satisfies Pontryagin’s minimum principle.

Pontryagin’s minimum principle is only a necessary condition of the optimal control function, not a sufficient condition. Therefore, the control function obtained by FBSM is not necessarily the global optimum except in the case where the expected Hamiltonian is convex. Nevertheless, the control function obtained by FBSM is expected to be superior to most control functions because it is locally optimal.

FBSM has been used in deterministic control [32,34,35,38] and MFSC [39,40,41,42]. However, the convergence of FBSM for these problems is not guaranteed because the backward dynamics depend on the forward dynamics even without the optimal control function (Figure 1c,d). In contrast, the convergence of FBSM is guaranteed in ML-POSC because the backward HJB equation does not depend on the forward FP equation without the optimal control function (Figure 1b). More specifically, in FBSM for ML-POSC, the objective function

J [u_{0 : T - d t}^{k}]

monotonically decreases and finally converges to Pontryagin’s minimum principle. In the following subsections, we prove this nice property of FBSM for ML-POSC.

Algorithm 1: Forward-Backward Sweep Method (FBSM)

//— Initial step —//
$k \leftarrow 0$
$p_{0}^{k} (s) \leftarrow p_{0} (s)$
for $t = 0$ to $T - d t$ do
Initialize $u_{t}^{k} (z)$
$p_{t + d t}^{k} (s) \leftarrow p_{t}^{k} (s) + L_{u_{t}^{k}}^{†} p_{t}^{k} (s) d t$
end for
while $J [u_{0 : T - d t}^{k}]$ do not converge do
if k is even then
//— Backward step —//
$w_{T}^{k + 1} (s) \leftarrow g (s)$
for $t = T - d t$ to 0 do
$u_{t}^{k + 1} (z) \leftarrow arg {min}_{u} E_{p_{t}^{k} (x | z)} [H (t, s, u, w_{t + d t}^{k + 1})]$
$w_{t}^{k + 1} (s) \leftarrow w_{t + d t}^{k + 1} (s) + H (t, s, u_{t}^{k + 1}, w_{t + d t}^{k + 1}) d t$
end for
else
//— Forward step —//
$p_{0}^{k + 1} (s) \leftarrow p_{0} (s)$
for $t = 0$ to $T - d t$ do
$u_{t}^{k + 1} (z) \leftarrow arg {min}_{u} E_{p_{t}^{k + 1} (x | z)} [H (t, s, u, w_{t + d t}^{k})]$
$p_{t + d t}^{k + 1} (s) \leftarrow p_{t}^{k + 1} (s) + L_{u_{t}^{k + 1}}^{†} p_{t}^{k + 1} (s) d t$
end for
end if
$k \leftarrow k + 1$
end while
return $u_{0 : T - d t}^{k}$

4.2. Preliminary

In this subsection, we show an important result in proving the convergence of FBSM for ML-POSC. We suppose that

u_{0 : t - d t, t + d t : T - d t} : = {u_{0}, . . ., u_{t - d t}, u_{t + d t}, . . ., u_{T - d t}}

is given and only

u_{t}

is optimized as follows:

\begin{matrix} u_{t}^{*} : = arg min_{u_{t}} J [u_{0 : T - d t}] . \end{matrix}

(30)

In ML-POSC,

u_{t}^{*}

can be calculated as follows:

\begin{matrix} u_{t}^{*} (z) = arg min_{u_{t}} E_{p_{t} (x | z)} [H (t, s, u_{t}, w_{t + d t})], a . s .^{\forall} z \in R^{d_{z}}, \end{matrix}

(31)

where

w_{t + d t} (s)

is the solution of the following time-discretized HJB equation driven by

u_{t + d t : T - d t}

:

\begin{matrix} w_{τ} (s) = w_{τ + d t} (s) + H (τ, s, u_{τ}, w_{τ + d t}) d t, τ \in {t + d t, . . ., T - d t}, \end{matrix}

(32)

where

w_{T} (s) = g (s)

.

p_{t} (x | z) : = p_{t} (s) / \int p_{t} (s) d x

is the conditional probability density function of state x given memory z, and

p_{t} (s)

is the solution of the following time-discretized FP equation driven by

u_{0 : t - d t}

:

\begin{matrix} p_{τ + d t} (s) = p_{τ} (s) + L_{u_{τ}}^{†} p_{τ} (s) d t, τ \in {0, . . ., t - d t}, \end{matrix}

(33)

where

p_{0} (s)

. Equation (31) is obtained by the similar way to Pontyragin’s minimum principle in Appendix C.5 and also by the time discretization method in Appendix C.6.

Importantly,

w_{t + d t}

does not depend on

u_{t}

in ML-POSC (Figure 3a) while

λ_{t + d t}

and

w_{t + d t}

depend on

u_{t}

in deterministic control (Figure 3b) and MFSC (Figure 3c), respectively. Therefore,

u_{t}^{*}

can be obtained without modifying

w_{t + d t}

in ML-POSC, which is essentially different from deterministic control and MFSC. From this nice property, the convergence of FBSM is guaranteed in ML-POSC.

4.3. Monotonicity

In FBSM for ML-POSC, the objective function is monotonically non-increasing with respect to the update of the control function at each time step. More specifically,

\begin{matrix} J [u_{0 : t - d t}^{k}, u_{t : T - d t}^{k + 1}] \leq J [u_{0 : t}^{k}, u_{t + d t : T - d t}^{k + 1}] \end{matrix}

(34)

is satisfied in the backward step, and

\begin{matrix} J [u_{0 : t - d t}^{k + 1}, u_{t : T - d t}^{k}] \geq J [u_{0 : t}^{k + 1}, u_{t + d t : T - d t}^{k}] \end{matrix}

(35)

is satisfied in the forward step. We prove this result in Appendix C.7. Furthermore, in FBSM for ML-POSC, the objective function is monotonically non-increasing with respect to the update of the control function at each iteration step as follows:

\begin{matrix} J [u_{0 : T - d t}^{k + 1}] \leq J [u_{0 : T - d t}^{k}] . \end{matrix}

(36)

Equation (36) is obviously satisfied from Equations (34) and (35).

4.4. Convergence to Pontryagin’s Minimum Principle

We assume that

J [u_{0 : T - d t}]

has a lower bound. From Equation (36), FBSM for ML-POSC is guaranteed to converge to the local minimum. Furthermore, we assume that if the candidate of

u_{t}^{k + 1}

includes

u_{t}^{k}

, then set

u_{t}^{k + 1}

at

u_{t}^{k}

. Under these assumptions, FBSM for ML-POSC converges to Pontryagin’s minimum principle (20). More specifically, if

J [u_{0 : T - d t}^{k + 1}] = J [u_{0 : T - d t}^{k}]

holds,

u_{0 : T - d t}^{k + 1}

satisfies Pontryagin’s minimum principle (20). We prove this result in Appendix C.8.

Therefore, unlike deterministic control and MFSC, in FBSM for ML-POSC, the objective function

J [u_{0 : T - d t}^{k}]

monotonically decreases and finally converges to the local minimum at which the control function

u_{0 : T - d t}^{k}

satisfies Pontryagin’s minimum principle (20).

5. Linear-Quadratic-Gaussian Problem

In this section, we apply FBSM to the LQG problem of ML-POSC [14]. In the LQG problem of ML-POSC, the system of HJB-FP equations is reduced from partial differential equations to ordinary differential equations.

5.1. Problem Formulation

In the LQG problem of ML-POSC, the extended state SDE (8) is given as follows [14]:

\begin{matrix} d s_{t} = (A (t) s_{t} + B (t) u_{t}) d t + σ (t) d ω_{t}, \end{matrix}

(37)

where

s_{0}

obeys the Gaussian distribution

p_{0} (s_{0}) : = N (s_{0} |μ_{0}, Λ_{0})

where

μ_{0}

is the mean vector and

Λ_{0}

is the precision matrix. The objective function (11) is given as follows:

\begin{matrix} J [u] : = E_{p (s_{0 : T}; u)} [\int_{0}^{T} (s_{t}^{⊤} Q (t) s_{t} + u_{t}^{⊤} R (t) u_{t}) d t + s_{T}^{⊤} P s_{T}], \end{matrix}

(38)

where

Q (t) ⪰ O

,

R (t) ≻ O

, and

P ⪰ O

. The LQG problem of ML-POSC is the problem of finding the optimal control function

u^{*}

that minimizes the objective function

J [u]

as follows:

\begin{matrix} u^{*} : = arg min_{u} J [u] . \end{matrix}

(39)

5.2. Pontryagin’s Minimum Principle

In the LQG problem of ML-POSC, Pontryagin’s minimum principle (20) can be calculated as follows [14]:

\begin{matrix} u^{*} (t, z) = - R^{- 1} B^{⊤} (Π K (Λ) (s - μ) + Ψ μ), a . s .^{\forall} t \in [0, T],^{\forall} z \in R^{d_{z}}, \end{matrix}

(40)

where

K (Λ)

is defined as follows:

\begin{matrix} K (Λ) : = (\begin{matrix} O & Λ_{x x}^{- 1} Λ_{x z} \\ O & I \end{matrix}), \end{matrix}

(41)

where

μ (t)

and

Λ (t)

are the mean vector and the precision matrix of the extended state, respectively, which correspond to the solution of the FP Equation (22). We note that

E_{p_{t} (z | x)} [s] = K (Λ) (s - μ) + μ

is satisfied.

μ (t)

and

Λ (t)

are the solutions of the following ordinary differential equations (ODEs):

\begin{matrix} \dot{μ} & = (A - B R^{- 1} B^{⊤} Ψ) μ, \end{matrix}

(42)

\begin{matrix} \dot{Λ} & = - {(A - B R^{- 1} B^{⊤} Π K (Λ))}^{⊤} Λ - Λ (A - B R^{- 1} B^{⊤} Π K (Λ)) - Λ σ σ^{⊤} Λ, \end{matrix}

(43)

where

μ (0) = μ_{0}

and

Λ (0) = Λ_{0}

.

Ψ (t)

and

Π (t)

are the control gain matrices of the deterministic and stochastic extended state, respectively, which correspond to the solution of the HJB Equation (21).

Ψ (t)

and

Π (t)

are the solutions of the following ODEs:

\begin{matrix} - \dot{Ψ} & = Q + A^{⊤} Ψ + Ψ A - Ψ B R^{- 1} B^{⊤} Ψ, \end{matrix}

(44)

\begin{matrix} - \dot{Π} & = Q + A^{⊤} Π + Π A - Π B R^{- 1} B^{⊤} Π + {(I - K (Λ))}^{⊤} Π B R^{- 1} B^{⊤} Π (I - K (Λ)), \end{matrix}

(45)

where

Ψ (T) = Π (T) = P

. The ODE of

Ψ

(44) is the Riccati equation [16,17,18], which also appears in the LQG problem of COSC. In contrast, the ODE of

Π

(45) is the partially observable Riccati equation [14], which appears only in the LQG problem of ML-POSC. The above result is obtained in [14].

The ODE of

Ψ

(44) can be solved backward in time from the terminal condition. Using

Ψ

, the ODE of

μ

(42) can be solved forward in time from the initial condition. In contrast, the ODEs of

Π

(45) and

Λ

(43) cannot be solved in a similar way as the ODEs of

Ψ

(44) and

μ

(42) because they interact with each other, which is a similar problem to the system of HJB-FP equations.

5.3. Forward-Backward Sweep Method

In the LQG problem of ML-POSC, FBSM is reduced from Algorithm 1 to Algorithm 2.

F (Λ, Π)

and

G (Λ, Π)

are defined by the right-hand sides of the ODEs of

Λ

(43) and

Π

(45), respectively, as follows:

\begin{matrix} F (Λ, Π) & : = - {(A - B R^{- 1} B^{⊤} Π K (Λ))}^{⊤} Λ - Λ (A - B R^{- 1} B^{⊤} Π K (Λ)) - Λ σ σ^{⊤} Λ, \\ G (Λ, Π) & : = Q + A^{⊤} Π + Π A - Π B R^{- 1} B^{⊤} Π + {(I - K (Λ))}^{⊤} Π B R^{- 1} B^{⊤} Π (I - K (Λ)) . \end{matrix}

This result is obtained in Appendix C.9. Importantly, in the LQG problem of ML-POSC, FBSM computes the ODEs of

Λ

(43) and

Π

(45) instead of the FP Equation (22) and the HJB Equation (21).

Algorithm 2: Forward-Backward Sweep Method (FBSM) in the LQG problem

//— Initial step —//
$k \leftarrow 0$
$Λ_{0}^{k} \leftarrow Λ_{0}$
for $t = 0$ to $T - d t$ do
Initialize $Π_{t + d t}^{k}$
$Λ_{t + d t}^{k} \leftarrow Λ_{t}^{k} + F (Λ_{t}^{k}, Π_{t + d t}^{k}) d t$
end for
while $J [u_{0 : T - d t}^{k}]$ do not converge do
if k is even then
//— Backward step —//
$Π_{T}^{k + 1} \leftarrow P$
for $t = T - d t$ to 0 do
$Π_{t}^{k + 1} \leftarrow Π_{t + d t}^{k + 1} + G (Λ_{t}^{k}, Π_{t + d t}^{k + 1}) d t$
end for
else
//— Forward step —//
$Λ_{0}^{k + 1} \leftarrow Λ_{0}$
for $t = 0$ to $T - d t$ do
$Λ_{t + d t}^{k + 1} \leftarrow Λ_{t}^{k + 1} + F (Λ_{t}^{k + 1}, Π_{t + d t}^{k}) d t$
end for
end if
$k \leftarrow k + 1$
end while
return $u_{0 : T - d t}^{k}$

6. Numerical Experiments

In this section, we verify the convergence of FBSM in ML-POSC by performing numerical experiments on the LQG and non-LQG problems. The setting of the numerical experiments is the same as the previous work [14].

6.1. LQG Problem

In this subsection, we verify the convergence of FBSM for ML-POSC by conducting a numerical experiment on the LQG problem. We consider state

x_{t} \in R

, observation

y_{t} \in R

, and memory

z_{t} \in R

, which evolve by the following SDEs:

\begin{matrix} d x_{t} & = (x_{t} + u_{t}) d t + d ω_{t}, \end{matrix}

(46)

\begin{matrix} d y_{t} & = x_{t} d t + d ν_{t}, \end{matrix}

(47)

\begin{matrix} d z_{t} & = v_{t} d t + d y_{t}, \end{matrix}

(48)

where

x_{0}

and

z_{0}

obey the standard Gaussian distributions,

y_{0}

is an arbitrary real number,

ω_{t} \in R

and

ν_{t} \in R

are independent standard Wiener processes, and

u_{t} = u (t, z_{t}) \in R

and

v_{t} = v (t, z_{t}) \in R

are the controls. The objective function to be minimized is given as follows:

\begin{matrix} J [u, v] : = E_{p (x_{0 : 10}, y_{0 : 10}, z_{0 : 10}; u, v)} [\int_{0}^{10} (x_{t}^{2} + u_{t}^{2} + v_{t}^{2}) d t] . \end{matrix}

(49)

Therefore, the objective of this problem is to minimize the state variance with small state and memory controls.

This problem corresponds to the LQG problem, which is defined by (37) and (38). By defining

s_{t} : = (x_{t}, z_{t}) \in R^{2}

,

{\tilde{u}}_{t} : = (u_{t}, v_{t}) \in R^{2}

, and

{\tilde{ω}}_{t} : = (ω_{t}, ν_{t}) \in R^{2}

, the SDEs (46)–(48) can be rewritten as follows:

\begin{matrix} d s_{t} = ((\begin{matrix} 1 & 0 \\ 1 & 0 \end{matrix}) s_{t} + {\tilde{u}}_{t}) d t + d {\tilde{ω}}_{t}, \end{matrix}

(50)

which corresponds to (37). Furthermore, the objective function (49) can be rewritten as follows:

\begin{matrix} J [\tilde{u}] : = E_{p (s_{0 : 10}; \tilde{u})} [\int_{0}^{10} (s_{t}^{⊤} (\begin{matrix} 1 & 0 \\ 0 & 0 \end{matrix}) s_{t} + {\tilde{u}}_{t}^{⊤} {\tilde{u}}_{t}) d t], \end{matrix}

(51)

which corresponds to (38).

We apply the FBSM of the LQG problem (Algorithm 2) to this problem.

Π^{0} (t)

is initialized by

Π^{0} (t) = O

. To solve the ODEs of

Π^{k} (t)

and

Λ^{k} (t)

, we use the fourth-order Runge–Kutta method. Figure 4 shows the control gain matrix

Π^{k} (t) \in R^{2 \times 2}

and the precision matrix

Λ^{k} (t) \in R^{2 \times 2}

obtained by FBSM. The color of each curve represents the iteration k. The darkest curve corresponds to the first iteration

k = 0

, and the brightest curve corresponds to the last iteration

k = 50

. Importantly,

Π^{k} (t)

and

Λ^{k} (t)

converge with respect to the iteration k.

Figure 5a shows the objective function

J [u^{k}]

with respect to iteration k. The objective function

J [u^{k}]

monotonically decreases with respect to iteration k, which is consistent with Section 4.3. This monotonicity of FBSM is the nice property of ML-POSC that is not guaranteed in deterministic control and MFSC. The objective function

J [u^{k}]

finally converges, and

u^{k}

satisfies Pontryagin’s minimum principle from Section 4.4.

Figure 5b–d compare the performance of the control function

u^{k}

at the first iteration

k = 0

and the last iteration

k = 50

by performing a stochastic simulation. At the first iteration

k = 0

, the distributions of state and memory are unstable, and the cumulative cost diverges. In contrast, at the last iteration

k = 50

, the distributions of state and memory are stabilized and the cumulative cost is smaller. This result indicates that FBSM improves the performance in ML-POSC.

Although Figure 5b–d look similar to Figure 2d–f in the previous work [14], they are comparing different things. While Figure 5b–d demonstrate the performance improvement by the FBSM iteration, the previous work [14] compares the performance of the partially observable Riccati Equation (45) with that of the conventional Riccati Equation (44).

6.2. Non-LQG Problem

In this subsection, we verify the convergence of FBSM in ML-POSC by conducting a numerical experiment on the non-LQG problem. We consider state

x_{t} \in R

, observation

y_{t} \in R

, and memory

z_{t} \in R

, which evolve by the following SDEs:

\begin{matrix} d x_{t} & = u_{t} d t + d ω_{t}, \end{matrix}

(52)

\begin{matrix} d y_{t} & = x_{t} d t + d ν_{t}, \end{matrix}

(53)

\begin{matrix} d z_{t} & = d y_{t}, \end{matrix}

(54)

where

x_{0}

and

z_{0}

obey the Gaussian distributions

p_{0} (x_{0}) = N (x_{0} | 0, 0.01)

and

p_{0} (z_{0}) = N (z_{0} | 0, 0.01)

, respectively.

y_{0}

is an arbitrary real number,

ω_{t} \in R

and

ν_{t} \in R

are independent standard Wiener processes, and

u_{t} = u (t, z_{t}) \in R

is the control. For the sake of simplicity, memory control is not considered. The objective function to be minimized is given as follows:

\begin{matrix} J [u] : = E_{p (x_{0 : 1}, y_{0 : 1}, z_{0 : 1}; u)} [\int_{0}^{1} (Q (t, x_{t}) + u_{t}^{2}) d t + 10 x_{1}^{2}], \end{matrix}

(55)

where

\begin{matrix} Q (t, x) : = \{\begin{matrix} 1000 & (0.3 \leq t \leq 0.6, 0.1 \leq | x | \leq 2.0), \\ 0 & (o t h e r s) . \end{matrix} \end{matrix}

(56)

The cost function is high in

0.3 \leq t \leq 0.6

and

0.1 \leq | x | \leq 2.0

, which represents the obstacles. In addition, the terminal cost function is the lowest at

x = 0

, which represents the desirable goal. Therefore, the system should avoid the obstacles and reach the goal with a small control. Because the cost function is non-quadratic, it is a non-LQG problem.

We apply the FBSM (Algorithm 1) to this problem.

u^{0} (t, z)

is initialized by

u^{0} (t, z) = 0

. To solve the HJB equation and the FP equation, we use the finite-difference method. Figure 6 shows

w^{k} (t, s)

and

p^{k} (t, s)

obtained by FBSM at the first iteration

k = 0

and at the last iteration

k = 50

. From Appendix C.6,

w^{k} (t, s)

is given as follows:

\begin{matrix} w^{k} (t, s) = E_{p (s_{t + d t : 1} | s_{t} = s; u^{k})} [\int_{t}^{1} (Q (τ, x_{τ}) + {(u_{τ}^{k})}^{2}) d τ + 10 x_{1}^{2}] . \end{matrix}

(57)

Because

u^{0} (t, z) = 0

,

w^{0} (t, s)

reflects the cost function corresponding to the obstacles and the goal (Figure 6a–e). In contrast, because

u^{50} (t, z) \neq 0

,

w^{50} (t, s)

becomes more complex (Figure 6f–j). In particular, while

w^{0} (t, s)

does not depend on memory z,

w^{50} (t, s)

depends on memory z, which indicates that the control function

u^{50} (t, z)

is adjusted by the memory z. We note that

w^{0} (1, s)

(Figure 6e) and

w^{50} (1, s)

(Figure 6j) are the same because they are given by the terminal cost function as

w^{0} (1, s) = w^{50} (1, s) = 10 x^{2}

. Furthermore, while

p^{0} (t, s)

is a unimodal distribution (Figure 6k–o),

p^{50} (t, s)

is a bimodal distribution (Figure 6p–t), which can avoid the obstacles.

Figure 7a shows the objective function

J [u^{k}]

with respect to iteration k. The objective function

J [u^{k}]

monotonically decreases with respect to iteration k, which is consistent with Section 4.3. This monotonicity of FBSM is the nice property of ML-POSC that is not guaranteed in deterministic control and MFSC. The objective function

J [u^{k}]

finally converges, and its

u^{k}

satisfies Pontryagin’s minimum principle from Section 4.4.

Figure 7b,c compare the performance of the control function

u^{k}

at the first iteration

k = 0

and the last iteration

k = 50

by conducting the stochastic simulation. At the first iteration

k = 0

, the obstacles cannot be avoided, which results in a higher objective function. In contrast, at the last iteration

k = 50

, the obstacles can be avoided, which results in a lower objective function. This result indicates that FBSM improves the performance in ML-POSC.

Although Figure 7b,c look similar to Figure 3a,b in the previous work [14], they are comparing different things. While Figure 7b,c demonstrate the performance improvement by the FBSM iteration, the previous work [14] compares the performance of ML-POSC with the local LQG approximation of the conventional POSC.

7. Discussion

In this work, we first showed that the system of HJB-FP equations corresponds to Pontryagin’s minimum principle on the probability density function space. Although the relationship between the system of HJB-FP equations and Pontryagin’s minimum principle has been briefly mentioned in MFSC [29,30,31], its details have not yet been investigated. We addressed this problem by deriving the system of HJB-FP equations in a similar way to Pontryagin’s minimum principle. We then proposed FBSM to ML-POSC. Although the convergence of FBSM is generally not guaranteed in deterministic control [32,34,35,38] and MFSC [39,40,41,42], we proved the convergence in ML-POSC by noting the fact that the update of the current control function does not affect the future HJB equation in ML-POSC. Therefore, ML-POSC is a special and nice class where FBSM is guaranteed to converge.

Our derivation of Pontryagin’s minimum principle on the probability density function space is formal, not analytical. Therefore, more mathematically rigorous proofs should be pursued in future work. Nevertheless, because our results are consistent with the conventional results of COSC [15,16,17,18], ML-POSC [14], and MFSC [26,27,28,30,31], they would be reliable except for special cases. Furthermore, our results provide a unified perspective on FBSM in deterministic control [32,34,35,38] and the fixed-point iteration method in MFSC [39,40,41,42], which have been studied independently. It clarifies the different properties of ML-POSC from deterministic control and MFSC, which ensures the convergence of FBSM.

The regularized FBSM has recently been proposed in deterministic control, which is guaranteed to converge even in the general deterministic control [44,45]. Our work gives an intuitive reason why the regularized FBSM is guaranteed to converge. In the regularized FBSM, the Hamiltonian is regularized, which makes the update of the control function smaller. When the regularization is sufficiently strong, the effect of the current control function on the future backward dynamics would be negligible. Therefore, the regularized FBSM of deterministic control would be guaranteed to converge for a similar reason to the FBSM of ML-POSC. However, the convergence of the regularized FBSM is much slower because the stronger regularization makes the update of the control function smaller. The FBSM of ML-POSC does not suffer from such a problem because the future backward dynamics already do not depend on the current control function without regularization.

Our work gives a hint about a modification of the fixed-point iteration method to ensure convergence in MFSC. Although the fixed-point iteration method is the most basic algorithm in MFSC, its convergence is not guaranteed [39,40,41,42]. Our work showed that the fixed-point iteration method is equivalent to the FBSM on the probability density function space. Therefore, the idea of regularized FBSM may also be applied to the fixed-point iteration method. More specifically, the fixed-point iteration method may be guaranteed to converge by regularizing the expected Hamiltonian.

In FBSM, we solve the HJB equation and the FP equation using the finite-difference method. However, because the finite-difference method is prone to the curse of dimensionality, it is difficult to solve high-dimensional ML-POSC. To resolve this problem, two directions can be considered. One direction is the policy iteration method [21,46,47]. Although the policy iteration method is almost the same as FBSM, only the update of the control function is different. While FBSM updates the system of HJB-FP equations and the control function simultaneously, the policy iteration method updates them separately. In the policy iteration method, the system of HJB-FP equations becomes linear, which can be solved by the sampling method [48,49,50]. Because the sampling method is more tractable than the finite-difference method, the policy iteration method may allow high-dimensional ML-POSC to be solved. Furthermore, the policy iteration method has recently been studied in MFSC [51,52,53]. However, its convergence is not guaranteed except for special cases in MFSC. In a similar way to FBSM, the convergence of the policy iteration method may be guaranteed in ML-POSC.

The other direction is machine learning. Neural network-based algorithms have recently been proposed in MFSC, which can solve high-dimensional problems efficiently [54,55]. By extending these algorithms, high-dimensional ML-POSC may be solved efficiently. Furthermore, unlike MFSC, the coupling of the HJB-FP equations is limited only to the optimal control function in ML-POSC. By exploiting this nice property, more efficient algorithms may be devised for ML-POSC.

Author Contributions

Conceptualization, Formal analysis, Funding acquisition, Writing—original draft, T.T. and T.J.K.; Software, Visualization, T.T. All authors have read and agreed to the published version of the manuscript.

Funding

The first author received a JSPS Research Fellowship (Grant No. 21J20436). This work was supported by JSPS KAKENHI (Grant No. 19H05799) and JST CREST (Grant No. JPMJCR2011).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Not applicable.

Conflicts of Interest

The authors declare no conflict of interest.

Abbreviations

The following abbreviations are used in this manuscript:

COSC	Completely Observable Stochastic Control
POSC	Partially Observable Stochastic Control
ML-POSC	Memory-Limited Partially Observable Stochastic Control
MFSC	Mean-Field Stochastic Control
FBSM	Forward-Backward Sweep Method
HJB	Hamilton-Jacobi-Bellman
FP	Fokker-Planck
SDE	Stochastic Differential Equation
ODE	Ordinary Differential Equation
LQG	Linear-Quadratic-Gaussian

Appendix A. Deterministic Control

In this section, we briefly review Pontryagin’s minimum principle in deterministic control [22,23,24,25].

Appendix A.1. Problem Formulation

In this subsection, we formulate deterministic control [22,23,24,25]. The state of the system

s_{t} \in R^{d_{s}}

at time

t \in [0, T]

evolves according to the following ordinary differential equation (ODE):

\begin{matrix} \frac{d s_{t}}{d t} = b (t, s_{t}, u_{t}), \end{matrix}

(A1)

where the initial state is

s_{0}

, and the control is

u_{t} = u (t) \in R^{d_{u}}

. The objective function is given by the following cumulative cost function:

\begin{matrix} J [u] : = \int_{0}^{T} f (t, s_{t}, u_{t}) d t + g (s_{T}), \end{matrix}

(A2)

where f is the cost function and g is the terminal cost function. Deterministic control is the problem of finding the optimal control function

u^{*}

that minimizes the cumulative cost function

J [u]

as follows:

\begin{matrix} u^{*} : = arg min_{u} J [u] . \end{matrix}

(A3)

Appendix A.2. Preliminary

In this subsection, we show a useful result in deriving Pontryagin’s minimum principle. Given arbitrary control functions u and

u^{'}

,

J [u] - J [u^{'}]

can be calculated as follows [16]:

\begin{matrix} J [u] - J [u^{'}] & = \int_{0}^{T} (H (t, s_{t}, u_{t}, λ_{t}^{'}) - H (t, s_{t}^{'}, u_{t}^{'}, λ_{t}^{'}) - {(\frac{\partial H (t, s_{t}^{'}, u_{t}^{'}, λ_{t}^{'})}{\partial s})}^{⊤} (s_{t} - s_{t}^{'})) d t \\ + g (s_{T}) - g (s_{T}^{'}) - {(\frac{\partial g (s_{T}^{'})}{\partial s})}^{⊤} (s_{T} - s_{T}^{'}), \end{matrix}

(A4)

where

H

is the Hamiltonian, which is defined as follows:

\begin{matrix} H (t, s, u, λ) : = f (t, s, u) + λ^{⊤} b (t, s, u) . \end{matrix}

(A5)

λ_{t}^{'}

is the solution of the following adjoint equation driven by

u^{'}

:

\begin{matrix} - \frac{d λ_{t}^{'}}{d t} = \frac{\partial H (t, s_{t}^{'}, u_{t}^{'}, λ_{t}^{'})}{\partial s}, \end{matrix}

(A6)

where

λ_{T}^{'} = \partial g (s_{T}^{'}) / \partial s

.

s_{t}

and

s_{t}^{'}

are the solutions of the state Equation (A1) driven by u and

u^{'}

, respectively.

In the following, we derive Equation (A4).

J [u] - J [u^{'}]

can be calculated as follows:

\begin{matrix} J [u] - J [u^{'}] & = [\int_{0}^{T} f (t, s_{t}, u_{t}) d t + g (s_{T})] - [\int_{0}^{T} f (t, s_{t}^{'}, u_{t}^{'}) d t + g (s_{T}^{'})] \\ = [\int_{0}^{T} (H (t, s_{t}, u_{t}, λ_{t}^{'}) - {(λ_{t}^{'})}^{⊤} b (t, s_{t}, u_{t})) d t + g (s_{T})] \\ - [\int_{0}^{T} (H (t, s_{t}^{'}, u_{t}^{'}, λ_{t}^{'}) - {(λ_{t}^{'})}^{⊤} b (t, s_{t}^{'}, u_{t}^{'})) d t + g (s_{T}^{'})] \\ = \int_{0}^{T} (H (t, s_{t}, u_{t}, λ_{t}^{'}) - H (t, s_{t}^{'}, u_{t}^{'}, λ_{t}^{'})) d t \\ - \int_{0}^{T} {(λ_{t}^{'})}^{⊤} (b (t, s_{t}, u_{t}) - b (t, s_{t}^{'}, u_{t}^{'})) d t + g (s_{T}) - g (s_{T}^{'}) . \end{matrix}

(A7)

From the state Equation (A1),

\begin{matrix} J [u] - J [u^{'}] & = \int_{0}^{T} (H (t, s_{t}, u_{t}, λ_{t}^{'}) - H (t, s_{t}^{'}, u_{t}^{'}, λ_{t}^{'})) d t \\ - \int_{0}^{T} {(λ_{t}^{'})}^{⊤} \frac{d (s_{t} - s_{t}^{'})}{d t} d t + g (s_{T}) - g (s_{T}^{'}) . \end{matrix}

(A8)

From the integration by parts and

s_{0} - s_{0}^{'} = 0

,

\begin{matrix} J [u] - J [u^{'}] & = \int_{0}^{T} (H (t, s_{t}, u_{t}, λ_{t}^{'}) - H (t, s_{t}^{'}, u_{t}^{'}, λ_{t}^{'})) d t \\ + \int_{0}^{T} {(\frac{d λ_{t}^{'}}{d t})}^{⊤} (s_{t} - s_{t}^{'}) d t + g (s_{T}) - g (s_{T}^{'}) - {(λ_{T}^{'})}^{⊤} (s_{T} - s_{T}^{'}) . \end{matrix}

(A9)

From the adjoint Equation (A6), Equation (A4) is obtained.

Appendix A.3. Necessary Condition

In this subsection, we show the necessary condition of the optimal control function of deterministic control. It corresponds to Pontryagin’s minimum principle on the state space (Figure 2 (bottom left)). If

u^{*}

is the optimal control function of deterministic control (A3), then the following equation is satisfied [16]:

\begin{matrix} u^{*} (t) = arg min_{u} H (t, s_{t}^{*}, u, λ_{t}^{*}),^{\forall} t \in [0, T], \end{matrix}

(A10)

where

λ_{t}^{*}

is the solution of the following adjoint equation driven by

u^{*}

:

\begin{matrix} - \frac{d λ_{t}^{*}}{d t} = \frac{\partial H (t, s_{t}^{*}, u_{t}^{*}, λ_{t}^{*})}{\partial s}, \end{matrix}

(A11)

where

λ_{T}^{*} = \partial g (s_{T}^{*}) / \partial s

.

s_{t}^{*}

is the solution of the following state equation driven by

u^{*}

:

\begin{matrix} \frac{d s_{t}^{*}}{d t} = \frac{\partial H (t, s_{t}^{*}, u_{t}^{*}, λ_{t}^{*})}{\partial λ}, \end{matrix}

(A12)

where

s_{0}^{*} = s_{0}

. Because

\partial H (t, s_{t}^{*}, u_{t}^{*}, λ_{t}^{*}) / \partial λ = b (t, s_{t}^{*}, u_{t}^{*})

, Equation (A12) is consistent with Equation (A1).

In the following, we show that Equation (A10) is the necessary condition of the optimal control function of deterministic control. We define the control function:

\begin{matrix} u^{ε} (t) : = \{\begin{matrix} u^{*} (t) & t \in [0, T] \ E_{ε}, \\ u (t) & t \in E_{ε}, \end{matrix} \end{matrix}

(A13)

where

E_{ε} : = [t^{'}, t^{'} + ε] \subseteq [0, T]

, and

^{\forall} u : [0, T] \to R^{d_{u}}

. From Equation (A4),

J [u^{ε}] - J [u^{*}]

can be calculated as follows:

\begin{matrix} J [u^{ε}] - J [u^{*}] & = \int_{0}^{T} (H (t, s_{t}^{ε}, u_{t}^{ε}, λ_{t}^{*}) - H (t, s_{t}^{*}, u_{t}^{*}, λ_{t}^{*}) - {(\frac{\partial H (t, s_{t}^{*}, u_{t}^{*}, λ_{t}^{*})}{\partial s})}^{⊤} (s_{t}^{ε} - s_{t}^{*})) d t \\ + g (s_{T}^{ε}) - g (s_{T}^{*}) - {(\frac{\partial g (s_{T}^{*})}{\partial s})}^{⊤} (s_{T}^{ε} - s_{T}^{*}) \\ = \int_{0}^{T} (H (t, s_{t}^{ε}, u_{t}^{*}, λ_{t}^{*}) - H (t, s_{t}^{*}, u_{t}^{*}, λ_{t}^{*}) - {(\frac{\partial H (t, s_{t}^{*}, u_{t}^{*}, λ_{t}^{*})}{\partial s})}^{⊤} (s_{t}^{ε} - s_{t}^{*})) d t \\ + g (s_{T}^{ε}) - g (s_{T}^{*}) - {(\frac{\partial g (s_{T}^{*})}{\partial s})}^{⊤} (s_{T}^{ε} - s_{T}^{*}) \\ + \int_{E_{ε}} (H (t, s_{t}^{ε}, u_{t}, λ_{t}^{*}) - H (t, s_{t}^{ε}, u_{t}^{*}, λ_{t}^{*})) d t . \end{matrix}

(A14)

Letting

ε \to 0

,

\begin{matrix} J [u^{ε}] - J [u^{*}] & = \int_{0}^{T} ({(\frac{\partial H (t, s_{t}^{*}, u_{t}^{*}, λ_{t}^{*})}{\partial s})}^{⊤} (s_{t}^{ε} - s_{t}^{*}) - {(\frac{\partial H (t, s_{t}^{*}, u_{t}^{*}, λ_{t}^{*})}{\partial s})}^{⊤} (s_{t}^{ε} - s_{t}^{*})) d t \\ + {(\frac{\partial g (s_{T}^{*})}{\partial s})}^{⊤} (s_{T}^{ε} - s_{T}^{*}) - {(\frac{\partial g (s_{T}^{*})}{\partial s})}^{⊤} (s_{T}^{ε} - s_{T}^{*}) \\ + (H (t^{'}, s_{t^{'}}^{*}, u_{t^{'}}, λ_{t^{'}}^{*}) - H (t^{'}, s_{t^{'}}^{*}, u_{t^{'}}^{*}, λ_{t^{'}}^{*})) d t \\ = (H (t^{'}, s_{t^{'}}^{*}, u_{t^{'}}, λ_{t^{'}}^{*}) - H (t^{'}, s_{t^{'}}^{*}, u_{t^{'}}^{*}, λ_{t^{'}}^{*})) d t . \end{matrix}

(A15)

Because

u^{*}

is the optimal control function, the following inequality is satisfied:

\begin{matrix} 0 \leq J [u^{ε}] - J [u^{*}] & = (H (t^{'}, s_{t^{'}}^{*}, u_{t^{'}}, λ_{t^{'}}^{*}) - H (t^{'}, s_{t^{'}}^{*}, u_{t^{'}}^{*}, λ_{t^{'}}^{*})) d t . \end{matrix}

(A16)

Therefore, Equation (A10) is the necessary condition of the optimal control function of deterministic control.

Appendix A.4. Sufficient Condition

Pontryagin’s minimum principle (A10) is a necessary condition and generally not a sufficient condition. Pontryagin’s minimum principle (A10) becomes a necessary and sufficient condition if the Hamiltonian

H (t, s, u, λ)

is convex with respect to s and u and the terminal cost function

g (s)

is convex with respect to s.

In the following, we show this result. We define the arbitrary control function

^{\forall} u : [0, T] \to R^{d_{u}}

. From Equation (A4),

J [u] - J [u^{*}]

is given by the following equation:

\begin{matrix} J [u] - J [u^{*}] & = \int_{0}^{T} (H (t, s_{t}, u_{t}, λ_{t}^{*}) - H (t, s_{t}^{*}, u_{t}^{*}, λ_{t}^{*}) - {(\frac{\partial H (t, s_{t}^{*}, u_{t}^{*}, λ_{t}^{*})}{\partial s})}^{⊤} (s_{t} - s_{t}^{*})) d t \\ + g (s_{T}) - g (s_{T}^{*}) - {(\frac{\partial g (s_{T}^{*})}{\partial s})}^{⊤} (s_{T} - s_{T}^{*}) . \end{matrix}

(A17)

Since

H (t, s, u, λ)

is convex with respect to s and u and

g (s)

is convex with respect to s, the following inequalities are satisfied:

\begin{matrix} H (t, s_{t}, u_{t}, λ_{t}^{*}) & \geq H (t, s_{t}^{*}, u_{t}^{*}, λ_{t}^{*}) + {(\frac{\partial H (t, s_{t}^{*}, u_{t}^{*}, λ_{t}^{*})}{\partial s})}^{⊤} (s_{t} - s_{t}^{*}) \end{matrix}

\begin{matrix} + {(\frac{\partial H (t, s_{t}^{*}, u_{t}^{*}, λ_{t}^{*})}{\partial u})}^{⊤} (u_{t} - u_{t}^{*}), \end{matrix}

(A18)

\begin{matrix} g (s_{T}) & \geq g (s_{T}^{*}) + {(\frac{\partial g (s_{T}^{*})}{\partial s})}^{⊤} (s_{T} - s_{T}^{*}) . \end{matrix}

(A19)

Hence, the following inequality is satisfied:

\begin{matrix} J [u] - J [u^{*}] & \geq \int_{0}^{T} {(\frac{\partial H (t, s_{t}^{*}, u_{t}^{*}, λ_{t}^{*})}{\partial u})}^{⊤} (u_{t} - u_{t}^{*}) d t . \end{matrix}

(A20)

Because

u^{*}

satisfies (A10), the following stationary condition is satisfied:

\begin{matrix} \frac{\partial H (t, s_{t}^{*}, u_{t}^{*}, λ_{t}^{*})}{\partial u} = 0 . \end{matrix}

(A21)

Hence, the following inequality is satisfied:

\begin{matrix} J [u] - J [u^{*}] & \geq 0 . \end{matrix}

(A22)

Therefore, Equation (A10) is the sufficient condition of the optimal control function of deterministic control if

H (t, s, u, λ)

is convex with respect to s and u and

g (s)

is convex with respect to s.

Appendix A.5. Relationship with Bellman’s Dynamic Programming Principle

From Bellman’s dynamic programming principle on the state space (Figure 2 (top left)) [16], the optimal control function of deterministic control is given by the following equation:

\begin{matrix} u^{*} (t, s) = arg min_{u} H (t, s, u, \frac{\partial w^{*} (t, s)}{\partial s}), \end{matrix}

(A23)

where

w^{*} (t, s)

is the value function on the state space, which is the solution of the following Hamilton-Jacobi-Bellman (HJB) equation:

\begin{matrix} - \frac{\partial w^{*} (t, s)}{\partial t} = H (t, s, u^{*}, \frac{\partial w^{*} (t, s)}{\partial s}), \end{matrix}

(A24)

where

w^{*} (T, s) = g (s)

. More specifically, the optimal control function of deterministic control is given by

u^{*} (t) = u^{*} (t, s_{t}^{*})

, where

s_{t}^{*}

is the solution of the state Equation (A12).

The HJB Equation (A24) can be converted into the adjoint Equation (A11) by defining

\begin{matrix} λ_{t}^{*} : = \frac{\partial w^{*} (t, s_{t}^{*})}{\partial s}, \end{matrix}

(A25)

where

s_{t}^{*}

is the solution of the state Equation (A12). This approach can be interpreted as the conversion from Bellman’s dynamic programming principle (Figure 2 (top left)) to Pontryagin’s minimum principle (Figure 2 (bottom left)) on the state space.

In the following, we obtain this result. First, we define

\begin{matrix} Λ^{*} (t, s) : = \frac{\partial w^{*} (t, s)}{\partial s} . \end{matrix}

(A26)

By differentiating the HJB Equation (A24) with respect to s, the following equation is obtained:

\begin{matrix} - \frac{\partial Λ^{*} (t, s)}{\partial t} = \frac{\partial H (t, s, u^{*}, Λ^{*})}{\partial s} + {(\frac{\partial Λ^{*} (t, s)}{\partial s})}^{⊤} b (t, s^{*}, u^{*}), \end{matrix}

(A27)

where

Λ^{*} (T, s) = \partial g (s) / \partial s

. Then the derivative of

λ_{t}^{*} = Λ^{*} (t, s_{t}^{*})

with respect to t can be calculated as follows:

\begin{matrix} \frac{d λ_{t}^{*}}{d t} = \frac{\partial Λ^{*} (t, s_{t}^{*})}{\partial t} + {(\frac{\partial Λ^{*} (t, s_{t}^{*})}{\partial s})}^{⊤} \frac{d s_{t}^{*}}{d t} . \end{matrix}

(A28)

By substituting Equation (A27) into Equation (A28), the following equation is obtained:

\begin{matrix} - \frac{d λ_{t}^{*}}{d t} = \frac{\partial H (t, s, u^{*}, λ^{*})}{\partial s} - {(\frac{\partial Λ^{*} (t, s_{t}^{*})}{\partial s})}^{⊤} \underset{(*)}{\underset{⏟}{(\frac{d s_{t}^{*}}{d t} - b (t, s^{*}, u^{*}))}} . \end{matrix}

(A29)

From the state Equation (A12),

(*) = 0

is satisfied. Therefore,

λ^{*} (t)

satisfies the adjoint Equation (A11).

Appendix B. Mean-Field Stochastic Control

In this section, we show that the system of HJB-FP equations in MFSC corresponds to Pontryagin’s minimum principle on the probability density function space. Although the relationship between the system of HJB-FP equations and Pontryagin’s minimum principle has been mentioned briefly in MFSC [29,30,31], its details have not yet been investigated. In this section, we address this problem by deriving the system of HJB-FP equations in the similar way as Appendix A. Although our derivations are formal, not analytical, our results are consistent with the conventional results of MFSC [26,27,28,30,31].

Appendix B.1. Problem Formulation

In this subsection, we formulate MFSC [26,27,28]. The state of the system

s_{t} \in R^{d_{s}}

at time

t \in [0, T]

evolves by the following stochastic differential equation (SDE):

\begin{matrix} d s_{t} = b (t, s_{t}, p_{t}, u_{t}) d t + σ (t, s_{t}, p_{t}, u_{t}) d ω_{t}, \end{matrix}

(A30)

where

s_{0}

obeys

p_{0} (s_{0})

,

p_{t} (s) : = p (t, s)

is the probability density function of the state s,

u_{t} (s) : = u (t, s) \in R^{d_{u}}

is the control, and

ω_{t} \in R^{d_{ω}}

is the standard Wiener process. The objective function is given by the following expected cumulative cost function:

\begin{matrix} J [u] : = E_{p (s_{0 : T}; u)} [\int_{0}^{T} f (t, s_{t}, p_{t}, u_{t}) d t + g (s_{T}, p_{T})], \end{matrix}

(A31)

where f is the cost function, g is the terminal cost function,

p (s_{0 : T}; u)

is the probability of

s_{0 : t} : = {s_{τ} | τ \in [0, t]}

given u as a parameter, and

E_{p} [\cdot]

is the expectation with respect to probability p. MFSC is the problem of finding the optimal control function

u^{*}

that minimizes the expected cumulative cost function

J [u]

as follows:

\begin{matrix} u^{*} : = arg min_{u} J [u] . \end{matrix}

(A32)

Appendix B.2. Preliminary

In this subsection, we show a useful result in deriving Pontryagin’s minimum principle. Given arbitrary control functions u and

u^{'}

,

J [u] - J [u^{'}]

can be calculated as follows:

\begin{matrix} J [u] - J [u^{'}] & = \int_{0}^{T} (\bar{H} (t, p, u, w^{'}) - \bar{H} (t, p^{'}, u^{'}, w^{'}) \\ - \int \frac{δ \bar{H} (t, p^{'}, u^{'}, w^{'})}{δ p} (s) (p (t, s) - p^{'} (t, s)) d s) d t \\ + \bar{g} (p) - \bar{g} (p^{'}) - \int \frac{δ \bar{g} (p^{'})}{δ p} (s) (p (T, s) - p^{'} (T, s)) d s, \end{matrix}

(A33)

where

\bar{H}

and

\bar{g}

are the expected Hamiltonian and terminal cost function, respectively, which are defined as follows:

\begin{matrix} \bar{H} (t, p, u, w) & : = E_{p (s)} [H (t, s, p, u, w)], \end{matrix}

(A34)

\begin{matrix} \bar{g} (p) & : = E_{p (s)} [g (s, p)] . \end{matrix}

(A35)

H

is the Hamiltonian, which is defined as follows:

\begin{matrix} H (t, s, p, u, w) : = f (t, s, p, u) + L_{u} w (t, s) . \end{matrix}

(A36)

L_{u}

is the backward diffusion operator, which is defined as follows:

\begin{matrix} L_{u} w (t, s) & : = \sum_{i = 1}^{d_{s}} b_{i} (t, s, p, u) \frac{\partial w (t, s)}{\partial s_{i}} + \frac{1}{2} \sum_{i, j = 1}^{d_{s}} D_{i j} (t, s, p, u) \frac{\partial^{2} w (t, s)}{\partial s_{i} \partial s_{j}}, \end{matrix}

(A37)

where

D (t, s, p, u) : = σ (t, s, p, u) σ^{⊤} (t, s, p, u)

.

w^{'}

is the solution of the following Hamilton-Jacobi-Bellman (HJB) equation driven by

u^{'}

:

\begin{matrix} - \frac{\partial w^{'} (t, s)}{\partial t} = \frac{δ \bar{H} (t, p^{'}, u^{'}, w^{'})}{δ p} (s), \end{matrix}

(A38)

where

w^{'} (T, s) = (δ \bar{g} (p^{'}) / δ p) (s)

. p is the solution of the following Fokker-Planck (FP) equation driven by u:

\begin{matrix} \frac{\partial p (t, s)}{\partial t} = L_{u}^{†} p (t, s), \end{matrix}

(A39)

where

p (0, s) = p_{0} (s)

.

p^{'}

is the solution of the FP Equation (A39) driven by

u^{'}

.

L_{u}^{†}

is the forward diffusion operator, which is defined as follows:

\begin{matrix} L_{u}^{†} p (t, s) & : = - \sum_{i = 1}^{d_{s}} \frac{\partial (b_{i} (t, s, p, u) p (t, s))}{\partial s_{i}} + \frac{1}{2} \sum_{i, j = 1}^{d_{s}} \frac{\partial^{2} (D_{i j} (t, s, p, u) p (t, s))}{\partial s_{i} \partial s_{j}} . \end{matrix}

(A40)

L_{u}^{†}

is the conjugate of

L_{u}

as follows:

\begin{matrix} \int w (t, s) L_{u}^{†} p (t, s) d s = \int p (t, s) L_{u} w (t, s) d s . \end{matrix}

(A41)

In the following, we derive Equation (A33).

J [u] - J [u^{'}]

can be calculated as follows:

\begin{matrix} J [u] - J [u^{'}] & = E_{p (s_{0 : T})} [\int_{0}^{T} f (t, s_{t}, p_{t}, u_{t}) d t + g (s_{T}, p_{T})] \\ - E_{p^{'} (s_{0 : T})} [\int_{0}^{T} f (t, s_{t}, p_{t}^{'}, u_{t}^{'}) d t + g (s_{T}, p_{T}^{'})] \\ = E_{p (s_{0 : T})} [\int_{0}^{T} (H (t, s_{t}, p_{t}, u_{t}, w^{'}) - L_{u_{t}} w^{'} (t, s_{t})) d t + g (s_{T}, p_{T})] \\ - E_{p^{'} (s_{0 : T})} [\int_{0}^{T} (H (t, s_{t}, p_{t}^{'}, u_{t}^{'}, w^{'}) - L_{u_{t}^{'}} w^{'} (t, s_{t})) d t + g (s_{T}, p_{T}^{'})] \\ = \int_{0}^{T} (\bar{H} (t, p, u, w^{'}) - \bar{H} (t, p^{'}, u^{'}, w^{'})) d t \\ - \int_{0}^{T} (E_{p (t, s)} [L_{u} w^{'} (t, s)] - E_{p^{'} (t, s)} [L_{u^{'}} w^{'} (t, s)]) d t + \bar{g} (p) - \bar{g} (p^{'}) . \end{matrix}

(A42)

Because

L_{u_{t}}

and

L_{u_{t}^{'}}

are the conjugates of

L_{u_{t}}^{†}

and

L_{u_{t}^{'}}^{†}

, respectively,

\begin{matrix} J [u] - J [u^{'}] & = \int_{0}^{T} (\bar{H} (t, p, u, w^{'}) - \bar{H} (t, p^{'}, u^{'}, w^{'})) d t \\ - \int_{0}^{T} \int (L_{u}^{†} p (t, s) - L_{u^{'}}^{†} p^{'} (t, s)) w^{'} (t, s) d s d t + \bar{g} (p) - \bar{g} (p^{'}) . \end{matrix}

(A43)

From the FP Equation (A39),

\begin{matrix} J [u] - J [u^{'}] & = \int_{0}^{T} (\bar{H} (t, p, u, w^{'}) - \bar{H} (t, p^{'}, u^{'}, w^{'})) d t \\ - \int_{0}^{T} \int \frac{\partial (p (t, s) - p^{'} (t, s))}{\partial t} w^{'} (t, s) d s d t + \bar{g} (p) - \bar{g} (p^{'}) . \end{matrix}

(A44)

From the integration by parts and

p (0, s) - p^{'} (0, s) = p_{0} (s) - p_{0} (s) = 0

,

\begin{matrix} J [u] - J [u^{'}] & = \int_{0}^{T} (\bar{H} (t, p, u, w^{'}) - \bar{H} (t, p^{'}, u^{'}, w^{'})) d t \\ + \int_{0}^{T} \int (p (t, s) - p^{'} (t, s)) \frac{\partial w^{'} (t, s)}{\partial t} d s d t \\ + \bar{g} (p) - \bar{g} (p^{'}) - \int (p (T, s) - p^{'} (T, s)) w^{'} (T, s) d s . \end{matrix}

(A45)

From the HJB Equation (A38), Equation (A33) is obtained.

Appendix B.3. Necessary Condition

In this subsection, we show the necessary condition of the optimal control function of MFSC. It corresponds to Pontryagin’s minimum principle on the probability density function space (Figure 2 (bottom right)). If

u^{*}

is the optimal control function of MFSC (A32), then the following equation is satisfied:

\begin{matrix} u^{*} (t, s) = arg min_{u} H (t, s, p^{*}, u, w^{*}), a . s .^{\forall} t \in [0, T],^{\forall} s \in R^{d_{s}}, \end{matrix}

(A46)

where

w^{*}

is the solution of the following HJB equation driven by

u^{*}

:

\begin{matrix} - \frac{\partial w^{*} (t, s)}{\partial t} = \frac{δ \bar{H} (t, p^{*}, u^{*}, w^{*})}{δ p} (s), \end{matrix}

(A47)

where

w^{*} (T, s) = (δ \bar{g} (p^{*}) / δ p) (s)

.

p^{*}

is the solution of the following FP equation driven by

u^{*}

:

\begin{matrix} \frac{\partial p^{*} (t, s)}{\partial t} = \frac{δ \bar{H} (t, p^{*}, u^{*}, w^{*})}{δ w} (s), \end{matrix}

(A48)

where

p^{*} (0, s) = p_{0} (s)

.

In the following, we show that Equation (A46) is the necessary condition of the optimal control function of MFSC. We define the control function

\begin{matrix} u^{ε} (t, z) : = \{\begin{matrix} u^{*} (t, s) & (t, s) \in ([0, T] \times R^{d_{s}}) \ (E_{ε_{1}} \times F_{ε_{2}}), \\ u (t, s) & (t, s) \in E_{ε_{1}} \times F_{ε_{2}}, \end{matrix} \end{matrix}

(A49)

where

E_{ε_{1}} : = [t^{'}, t^{'} + ε_{1}] \subseteq [0, T]

,

F_{ε_{2}} : = [s^{'}, s^{'} + ε_{2}] \subseteq R^{d_{s}}

, and

^{\forall} u : [0, T] \times R^{d_{s}} \to R^{d_{u}}

. From Equation (A33),

J [u^{ε}] - J [u^{*}]

can be calculated as follows:

\begin{matrix} J [u^{ε}] - J [u^{*}] & = \int_{0}^{T} (\bar{H} (t, p^{ε}, u^{ε}, w^{*}) - \bar{H} (t, p^{*}, u^{*}, w^{*}) \\ - \int \frac{δ \bar{H} (t, p^{*}, u^{*}, w^{*})}{δ p} (s) (p^{ε} (t, s) - p^{*} (t, s)) d s) d t \\ + \bar{g} (p^{ε}) - \bar{g} (p^{*}) - \int \frac{δ \bar{g} (p^{*})}{δ p} (s) (p^{ε} (T, s) - p^{*} (T, s)) d s \\ = \int_{0}^{T} (\bar{H} (t, p^{ε}, u^{*}, w^{*}) - \bar{H} (t, p^{*}, u^{*}, w^{*}) \\ - \int \frac{δ \bar{H} (t, p^{*}, u^{*}, w^{*})}{δ p} (s) (p^{ε} (t, s) - p^{*} (t, s)) d s) d t \\ + \bar{g} (p^{ε}) - \bar{g} (p^{*}) - \int \frac{δ \bar{g} (p^{*})}{δ p} (s) (p^{ε} (T, s) - p^{*} (T, s)) d s \\ + \int_{E_{ε_{1}}} \int_{F_{ε_{2}}} (H (t, s, p^{ε}, u, w^{*}) - H (t, s, p^{ε}, u^{*}, w^{*})) p^{ε} (t, s) d s d t . \end{matrix}

(A50)

Letting

ε_{1} \to 0

and

ε_{2} \to 0

,

\begin{matrix} J [u^{ε}] - J [u^{*}] & = \int_{0}^{T} (\int \frac{δ \bar{H} (t, p^{*}, u^{*}, w^{*})}{δ p} (s) (p^{ε} (t, s) - p^{*} (t, s)) d s \\ - \int \frac{δ \bar{H} (t, p^{*}, u^{*}, w^{*})}{δ p} (s) (p^{ε} (t, s) - p^{*} (t, s)) d s) d t \\ + \int \frac{δ \bar{g} (p^{*})}{δ p} (s) (p^{ε} (T, s) - p^{*} (T, s)) d s - \int \frac{δ \bar{g} (p^{*})}{δ p} (s) (p^{ε} (T, s) - p^{*} (T, s)) d s \\ + (H (t^{'}, s^{'}, p^{*}, u, w^{*}) - H (t^{'}, s^{'}, p^{*}, u^{*}, w^{*})) p^{*} (t^{'}, s^{'}) d s d t \\ = (H (t^{'}, s^{'}, p^{*}, u, w^{*}) - H (t^{'}, s^{'}, p^{*}, u^{*}, w^{*})) p^{*} (t^{'}, s^{'}) d s d t . \end{matrix}

(A51)

Because

u^{*}

is the optimal control function, the following inequality is satisfied:

\begin{matrix} 0 \leq J [u^{ε}] - J [u^{*}] & = (H (t^{'}, s^{'}, p^{*}, u, w^{*}) - H (t^{'}, s^{'}, p^{*}, u^{*}, w^{*})) p^{*} (t^{'}, s^{'}) d s d t . \end{matrix}

(A52)

Therefore, Equation (A46) is the necessary condition of the optimal control function of MFSC.

Appendix B.4. Sufficient Condition

Pontryagin’s minimum principle (A46) is a necessary condition and generally not a sufficient condition. Pontryagin’s minimum principle (A46) becomes a necessary and sufficient condition if the expected Hamiltonian

\bar{H} (t, p, u, w)

is convex with respect to p and u and the expected terminal cost function

\bar{g} (p)

is convex with respect to p.

In the following, we show this result. We define the arbitrary control function

^{\forall} u : [0, T] \times R^{d_{s}} \to R^{d_{u}}

. From Equation (A33),

J [u] - J [u^{*}]

is given by the following equation:

\begin{matrix} J [u] - J [u^{*}] & = \int_{0}^{T} (\bar{H} (t, p, u, w^{*}) - \bar{H} (t, p^{*}, u^{*}, w^{*}) \\ - \int \frac{δ \bar{H} (t, p^{*}, u^{*}, w^{*})}{δ p} (s) (p (t, s) - p^{*} (t, s)) d s) d t \\ + \bar{g} (p) - \bar{g} (p^{*}) - \int \frac{δ \bar{g} (p^{*})}{δ p} (s) (p (T, s) - p^{*} (T, s)) d s . \end{matrix}

(A53)

Because

\bar{H} (t, p, u, w)

is convex with respect to p and u and

\bar{g} (p)

is convex with respect to p, the following inequalities are satisfied:

\begin{matrix} \bar{H} (t, p, u, w^{*}) & \geq \bar{H} (t, p^{*}, u^{*}, w^{*}) + \int \frac{δ \bar{H} (t, p^{*}, u^{*}, w^{*})}{δ p} (s) (p (t, s) - p^{*} (t, s)) d s \end{matrix}

\begin{matrix} + \int {(\frac{δ \bar{H} (t, p^{*}, u^{*}, w^{*})}{δ u} (s))}^{⊤} (u (t, s) - u^{*} (t, s)) d s, \end{matrix}

(A54)

\begin{matrix} \bar{g} (p) & \geq \bar{g} (p^{*}) + \int \frac{δ \bar{g} (p^{*})}{δ p} (s) (p (T, s) - p^{*} (T, s)) d s . \end{matrix}

(A55)

Hence, the following inequality is satisfied:

\begin{matrix} J [u] - J [u^{*}] & \geq \int_{0}^{T} E_{p^{*} (t, s)} [{(\frac{\partial H (t, s, p^{*}, u^{*}, w^{*})}{\partial u})}^{⊤} (u (t, s) - u^{*} (t, s))] d t . \end{matrix}

(A56)

Because

u^{*}

satisfies Equation (A46), the following stationary condition is satisfied:

\begin{matrix} \frac{\partial H (t, s, p^{*}, u^{*}, w^{*})}{\partial u} = 0 . \end{matrix}

(A57)

Hence, the following inequality is satisfied:

\begin{matrix} J [u] - J [u^{*}] & \geq 0 \end{matrix}

(A58)

Therefore, Equation (A46) is the sufficient condition of the optimal control function of MFSC if the expected Hamiltonian

\bar{H} (t, p, u, w)

is convex with respect to p and u and the expected terminal cost function

\bar{g} (p)

is convex with respect to p.

Appendix B.5. Relationship with Bellman’s Dynamic Programming Principle

From Bellman’s dynamic programming principle on the probability density function space (Figure 2 (top right)) [56,57,58], the optimal control function of MFSC is given by the following equation:

\begin{matrix} u^{*} (t, s, p) = arg min_{u} H (t, s, p, u, \frac{δ V^{*} (t, p)}{δ p} (s)), \end{matrix}

(A59)

where

V^{*} (t, p)

is the value function on the probability density function space, which is the solution of the following Bellman equation:

\begin{matrix} - \frac{\partial V^{*} (t, p)}{\partial t} = E_{p (s)} [H (t, s, p, u^{*}, \frac{δ V^{*} (t, p)}{δ p} (s))], \end{matrix}

(A60)

where

V^{*} (T, p) = E_{p (s)} [g (s)]

. More specifically, the optimal control function of MFSC is given by

u^{*} (t, s) = u^{*} (t, s, p^{*})

, where

p^{*}

is the solution of the FP Equation (A48).

Because the Bellman Equation (A60) is a functional differential equation, it cannot be solved even numerically. To resolve this problem, the previous works [30,31] converted the Bellman Equation (A60) into the HJB Equation (A47) by defining

\begin{matrix} w^{*} (t, s) : = \frac{δ V^{*} (t, p^{*})}{δ p} (s), \end{matrix}

(A61)

where

p^{*}

is the solution of FP Equation (A48). This approach can be interpreted as the conversion from Bellman’s dynamic programming principle (Figure 2 (top right)) to Pontryagin’s minimum principle (Figure 2 (bottom right)) on the probability density function space.

Appendix C. Derivation of Main Results

Appendix C.1. Derivation of Result in Section 3.1

In this subsection, we derive Equation (13).

J [u] - J [u^{'}]

can be calculated as follows:

\begin{matrix} J [u] - J [u^{'}] & = E_{p (s_{0 : T})} [\int_{0}^{T} f (t, s_{t}, u_{t}) d t + g (s_{T})] - E_{p^{'} (s_{0 : T})} [\int_{0}^{T} f (t, s_{t}, u_{t}^{'}) d t + g (s_{T})] \\ = E_{p (s_{0 : T})} [\int_{0}^{T} (H (t, s_{t}, u_{t}, w^{'}) - L_{u_{t}} w^{'} (t, s_{t})) d t + g (s_{T})] \\ - E_{p^{'} (s_{0 : T})} [\int_{0}^{T} (H (t, s_{t}, u_{t}^{'}, w^{'}) - L_{u_{t}^{'}} w^{'} (t, s_{t})) d t + g (s_{T})] \\ = \int_{0}^{T} (E_{p (t, s)} [H (t, s, u, w^{'})] - E_{p^{'} (t, s)} [H (t, s, u^{'}, w^{'})]) d t \\ - \int_{0}^{T} (E_{p (t, s)} [L_{u} w^{'} (t, s)] - E_{p^{'} (t, s)} [L_{u^{'}} w^{'} (t, s)]) d t \\ + E_{p (T, s)} [g (s)] - E_{p^{'} (T, s)} [g (s)] . \end{matrix}

(A62)

Because

L_{u_{t}}

and

L_{u_{t}^{'}}

are the conjugates of

L_{u_{t}}^{†}

and

L_{u_{t}^{'}}^{†}

, respectively,

\begin{matrix} J [u] - J [u^{'}] & = \int_{0}^{T} (E_{p (t, s)} [H (t, s, u, w^{'})] - E_{p^{'} (t, s)} [H (t, s, u^{'}, w^{'})]) d t \\ - \int_{0}^{T} \int (L_{u}^{†} p (t, s) - L_{u^{'}}^{†} p^{'} (t, s)) w^{'} (t, s) d s d t \\ + E_{p (T, s)} [g (s)] - E_{p^{'} (T, s)} [g (s)] . \end{matrix}

(A63)

From the FP Equation (17),

\begin{matrix} J [u] - J [u^{'}] & = \int_{0}^{T} (E_{p (t, s)} [H (t, s, u, w^{'})] - E_{p^{'} (t, s)} [H (t, s, u^{'}, w^{'})]) d t \\ - \int_{0}^{T} \int \frac{\partial (p (t, s) - p^{'} (t, s))}{\partial t} w^{'} (t, s) d s d t \\ + E_{p (T, s)} [g (s)] - E_{p^{'} (T, s)} [g (s)] . \end{matrix}

(A64)

From the integration by parts and

p (0, s) - p^{'} (0, s) = p_{0} (s) - p_{0} (s) = 0

,

\begin{matrix} J [u] - J [u^{'}] & = \int_{0}^{T} (E_{p (t, s)} [H (t, s, u, w^{'})] - E_{p^{'} (t, s)} [H (t, s, u^{'}, w^{'})]) d t \\ + \int_{0}^{T} \int (p (t, s) - p^{'} (t, s)) \frac{\partial w^{'} (t, s)}{\partial t} d s d t \\ + E_{p (T, s)} [g (s)] - E_{p^{'} (T, s)} [g (s)] - \int (p (T, s) - p^{'} (T, s)) w^{'} (T, s) d s . \end{matrix}

(A65)

From the HJB Equation (16), Equation (13) is obtained.

Appendix C.2. Derivation of Result in Section 3.2

In this subsection, we show that Equation (20) is the necessary condition of the optimal control function of ML-POSC. It corresponds to Pontryagin’s minimum principle on the probability density function space. We define the control function

\begin{matrix} u^{ε} (t, z) : = \{\begin{matrix} u^{*} (t, z) & (t, z) \in ([0, T] \times R^{d_{z}}) \ (E_{ε_{1}} \times F_{ε_{2}}), \\ u (t, z) & (t, z) \in E_{ε_{1}} \times F_{ε_{2}}, \end{matrix} \end{matrix}

(A66)

where

E_{ε_{1}} : = [t^{'}, t^{'} + ε_{1}] \subseteq [0, T]

,

F_{ε_{2}} : = [z^{'}, z^{'} + ε_{2}] \subseteq R^{d_{z}}

, and

^{\forall} u : [0, T] \times R^{d_{z}} \to R^{d_{u}}

. From Equation (13),

J [u^{ε}] - J [u^{*}]

can be calculated as follows:

\begin{matrix} J [u^{ε}] - J [u^{*}] & = \int_{0}^{T} (E_{p^{ε} (t, s)} [H (t, s, u^{ε}, w^{*})] - E_{p^{ε} (t, s)} [H (t, s, u^{*}, w^{*})]) d t \\ = \int_{E_{ε_{1}}} \int_{F_{ε_{2}}} (E_{p_{t}^{ε} (x | z)} [H (t, s, u, w^{*})] - E_{p_{t}^{ε} (x | z)} [H (t, s, u^{*}, w^{*})]) p_{t}^{ε} (z) d z d t . \end{matrix}

Letting

ε_{1} \to 0

and

ε_{2} \to 0

,

\begin{matrix} J [u^{ε}] - J [u^{*}] & = (E_{p_{t^{'}}^{*} (x^{'} | z^{'})} [H (t^{'}, s^{'}, u, w^{*})] - E_{p_{t^{'}}^{*} (x^{'} | z^{'})} [H (t^{'}, s^{'}, u^{*}, w^{*})]) p_{t^{'}}^{*} (z^{'}) d z d t . \end{matrix}

Because

u^{*}

is the optimal control function, the following inequality is satisfied:

\begin{matrix} 0 \leq J [u^{ε}] - J [u^{*}] & = (E_{p_{t^{'}}^{*} (x^{'} | z^{'})} [H (t^{'}, s^{'}, u, w^{*})] - E_{p_{t^{'}}^{*} (x^{'} | z^{'})} [H (t^{'}, s^{'}, u^{*}, w^{*})]) p_{t^{'}}^{*} (z^{'}) d z d t . \end{matrix}

Therefore, Equation (20) is the necessary condition of the optimal control function of ML-POSC.

Appendix C.3. Derivation of Result in Section 3.3

In this subsection, we show that Equation (20) is the sufficient condition of the optimal control function of ML-POSC if the expected Hamiltonian

\bar{H} (t, p, u, w)

is convex with respect to p and u. We define the arbitrary control function

^{\forall} u : [0, T] \times R^{d_{z}} \to R^{d_{u}}

. From Equation (13),

J [u] - J [u^{*}]

is given by the following equation:

\begin{matrix} J [u] - J [u^{*}] & = \int_{0}^{T} (E_{p (t, s)} [H (t, s, u, w^{*})] - E_{p (t, s)} [H (t, s, u^{*}, w^{*})]) d t . \end{matrix}

(A67)

Because

\bar{H} (t, p, u, w)

is convex with respect to p and u, the following inequality is satisfied:

\begin{matrix} E_{p (t, s)} [H (t, s, u, w^{*})] & = \bar{H} (t, p, u, w^{*}) \\ \geq \bar{H} (t, p^{*}, u^{*}, w^{*}) + \int \frac{δ \bar{H} (t, p^{*}, u^{*}, w^{*})}{δ p} (s) (p (t, s) - p^{*} (t, s)) d s \\ + \int {(\frac{δ \bar{H} (t, p^{*}, u^{*}, w^{*})}{δ u} (z))}^{⊤} (u (t, z) - u^{*} (t, z)) d z . \end{matrix}

(A68)

Because

\begin{matrix} \frac{δ \bar{H} (t, p^{*}, u^{*}, w^{*})}{δ p} (s) & = {\frac{δ}{δ p} (\int p (s) H (t, s, u^{*}, w^{*}) d s)|}_{p = p^{*}} \end{matrix}

\begin{matrix} = H (t, s, u^{*}, w^{*}), \\ \frac{δ \bar{H} (t, p^{*}, u^{*}, w^{*})}{δ u} (z) & = {\frac{δ}{δ u} (\int p_{t}^{*} (z) E_{p_{t}^{*} (x | z)} [H (t, s, u, w^{*})] d z)|}_{u = u^{*}} \end{matrix}

(A69)

\begin{matrix} = p_{t}^{*} (z) \frac{\partial E_{p_{t}^{*} (x | z)} [H (t, s, u^{*}, w^{*})]}{\partial u}, \end{matrix}

(A70)

the above inequality can be calculated as follows:

\begin{matrix} E_{p (t, s)} [H (t, s, u, w^{*})] & \geq \int p^{*} (t, s) H (t, s, u^{*}, w^{*}) d s + \int H (t, s, u^{*}, w^{*}) (p (t, s) - p^{*} (t, s)) d s \\ + \int p_{t}^{*} (z) {(\frac{\partial E_{p_{t}^{*} (x | z)} [H (t, s, u^{*}, w^{*})]}{\partial u})}^{⊤} (u (t, z) - u^{*} (t, z)) d z \\ = E_{p (t, s)} [H (t, s, u^{*}, w^{*})] \\ + E_{p_{t}^{*} (z)} [{(\frac{\partial E_{p_{t}^{*} (x | z)} [H (t, s, u^{*}, w^{*})]}{\partial u})}^{⊤} (u (t, z) - u^{*} (t, z))] . \end{matrix}

(A71)

Hence, the following inequality is satisfied:

\begin{matrix} J [u] - J [u^{*}] & \geq \int_{0}^{T} E_{p_{t}^{*} (z)} [{(\frac{\partial E_{p_{t}^{*} (x | z)} [H (t, s, u^{*}, w^{*})]}{\partial u})}^{⊤} (u (t, z) - u^{*} (t, z))] d t . \end{matrix}

(A72)

Because

u^{*}

satisfies Equation (20), the following stationary condition is satisfied:

\begin{matrix} \frac{\partial E_{p_{t}^{*} (x | z)} [H (t, s, u^{*}, w^{*})]}{\partial u} = 0 . \end{matrix}

(A73)

Hence, the following inequality is satisfied:

\begin{matrix} J [u] - J [u^{*}] & \geq 0 \end{matrix}

(A74)

Therefore, Equation (20) is the sufficient condition of the optimal control function of ML-POSC if

\bar{H} (t, p, u, w)

is convex with respect to p and u.

Appendix C.4. Derivation of Result in Section 3.5

In this subsection, we show that Equation (29) is the sufficient condition of the optimal control function of COSC without assuming the convexity of the expected Hamiltonian. We define the arbitrary control function

^{\forall} u : [0, T] \times R^{d_{s}} \to R^{d_{u}}

. From Equation (13),

J [u] - J [u^{*}]

is given by the following equation:

\begin{matrix} J [u] - J [u^{*}] & = \int_{0}^{T} (E_{p (t, s)} [H (t, s, u, w^{*})] - E_{p (t, s)} [H (t, s, u^{*}, w^{*})]) d t . \end{matrix}

(A75)

From (29), the following inequality is satisfied:

\begin{matrix} J [u] - J [u^{*}] & \geq \int_{0}^{T} (E_{p (t, s)} [H (t, s, u^{*}, w^{*})] - E_{p (t, s)} [H (t, s, u^{*}, w^{*})]) d t = 0 . \end{matrix}

(A76)

Therefore, Equation (29) is the sufficient condition of the optimal control function of COSC.

Appendix C.5. Derivation of Result in Section 4.2 by the Similar Way as Pontyragin’s Minimum Principle

In this subsection, we derive Equation (31) from Equation (30) by the similar way as Pontyragin’s minimum principle. From Equation (13), the following equality is satisfied:

\begin{matrix} J [u_{0 : t - d t}, u_{t}, u_{t + d t : T - d t}] - J [u_{0 : t - d t}, u_{t}^{*}, u_{t + d t : T - d t}] \\ = (E_{p_{t} (s)} [H (t, s, u_{t}, w_{t + d t})] - E_{p_{t} (s)} [H (t, s, u_{t}^{*}, w_{t + d t})]) d t \\ = E_{p_{t} (z)} [E_{p_{t} (x | z)} [H (t, s, u_{t}, w_{t + d t})] - E_{p_{t} (x | z)} [H (t, s, u_{t}^{*}, w_{t + d t})]] d t . \end{matrix}

(A77)

Therefore, Equation (31) is equivalent with Equation (30).

Appendix C.6. Derivation of Result in Section 4.2 by the Time Discretized Method

In this subsection, we derive Equation (31) from Equation (30) by the time discretized method. Equation (30) can be calculated as follows:

\begin{matrix} u_{t}^{*} & = arg min_{u_{t}} J [u_{0 : T - d t}] \\ = arg min_{u_{t}} E_{p (s_{0 : T}; u_{0 : T - d t})} [\int_{0}^{T} f (τ, s_{τ}, u_{τ}) d τ + g (s_{T})] \\ = arg min_{u_{t}} E_{p (s_{t : T}; u_{0 : T - d t})} [\int_{t}^{T} f (τ, s_{τ}, u_{τ}) d τ + g (s_{T})] \\ = arg min_{u_{t}} E_{p (s_{t : T}; u_{0 : T - d t})} [f (t, s_{t}, u_{t}) d t + \int_{t + d t}^{T} f (τ, s_{τ}, u_{τ}) d τ + g (s_{T})] \\ = arg min_{u_{t}} E_{p_{t} (s_{t})} [f (t, s_{t}, u_{t}) d t + E_{p (s_{t + d t : T} | s_{t}; u_{t : T - d t})} [\int_{t + d t}^{T} f (τ, s_{τ}, u_{τ}) d τ + g (s_{T})]] \\ = arg min_{u_{t}} E_{p_{t} (s_{t})} [f (t, s_{t}, u_{t}) d t + E_{p (s_{t + d t} | s_{t}; u_{t})} [w_{t + d t} (s_{t + d t})]], \end{matrix}

(A78)

where

p_{t} (s)

is the solution of the FP Equation (33) driven by

u_{0 : t - d t}

, and

w_{t + d t} (s)

is defined as follows:

\begin{matrix} w_{t + d t} (s) : = E_{p (s_{t + 2 d t : T} | s_{t + d t} = s; u_{t + d t : T - d t})} [\int_{t + d t}^{T} f (τ, s_{τ}, u_{τ}) d τ + g (s_{T})] . \end{matrix}

(A79)

From Ito’s lemma,

\begin{matrix} u_{t}^{*} & = arg min_{u_{t}} E_{p_{t} (s_{t})} [f (t, s_{t}, u_{t}) d t + w_{t + d t} (s_{t}) + L_{u_{t}} w_{t + d t} (s_{t}) d t] \\ = arg min_{u_{t}} E_{p_{t} (s_{t})} [f (t, s_{t}, u_{t}) d t + L_{u_{t}} w_{t + d t} (s_{t}) d t] \\ = arg min_{u_{t}} E_{p_{t} (s)} [H (t, s, u_{t}, w_{t + d t})] . \end{matrix}

(A80)

Because control

u_{t}

is a function of memory z in ML-POSC, the minimization by

u_{t}

can be exchanged with the expectation by

p_{t} (z)

as follows:

\begin{matrix} u_{t}^{*} (z) = arg min_{u_{t}} E_{p_{t} (x | z)} [H (t, s, u_{t}, w_{t + d t})] . \end{matrix}

(A81)

Therefore, Equation (31) is derived from Equation (30). Finally, we prove that

w_{t} (s)

is the solution of the HJB Equation (32) driven by

u_{t + d t : T - d t}

.

w_{t} (s)

can be calculated as follows:

\begin{matrix} w_{t} (s) & = E_{p (s_{t + d t : T} | s_{t} = s; u_{t : T - d t})} [\int_{t}^{T} f (τ, s_{τ}, u_{τ}) d τ + g (s_{T})] \\ = f (t, s, u_{t}) d t + E_{p (s_{t + d t} | s_{t} = s; u_{t})} [w_{t + d t} (s_{t + d t})] \\ = f (t, s, u_{t}) d t + w_{t + d t} (s) + L_{u_{t}} w_{t + d t} (s) d t \\ = w_{t + d t} (s) + H (t, s, u_{t}, w_{t + d t}) d t, \end{matrix}

(A82)

where

w_{T} (s) = g (s)

. Therefore,

w_{t} (s)

defined by Equation (A79) is the solution of the HJB Equation (32) driven by

u_{t + d t : T - d t}

.

Appendix C.7. Derivation of Result in Section 4.3

In this subsection, we mainly derive the inequality of the forward step (35). The inequality of the backward step (34) can be derived in a similar way. In the forward step,

u_{0 : t - d t}^{k + 1}

and

u_{t + d t : T - d t}^{k}

are given, and

u_{t}^{k + 1}

is defined by

\begin{matrix} u_{t}^{k + 1} (z) : = arg min_{u_{t}} E_{p_{t}^{k + 1} (x | z)} [H (t, s, u_{t}, w_{t + d t}^{k})] . \end{matrix}

(A83)

From the equivalence of Equations (30) and (31), the following equation is satisfied:

\begin{matrix} u_{t}^{k + 1} = arg min_{u_{t}} J [u_{0 : t - d t}^{k + 1}, u_{t}, u_{t + d t : T - d t}^{k}] . \end{matrix}

(A84)

Therefore, the inequality of the forward step (35) is satisfied.

Appendix C.8. Derivation of Result in Section 4.4

In this subsection, we show that FBSM for ML-POSC converges to Pontryagin’s minimum principle (20). More specifically, we prove that if

J [u_{0 : T - d t}^{k + 1}] = J [u_{0 : T - d t}^{k}]

holds,

u_{0 : T - d t}^{k + 1}

satisfies Pontryagin’s minimum principle (20). We mainly consider the forward step. We can make a similar discussion in the backward step. If

J [u_{0 : T - d t}^{k + 1}] = J [u_{0 : T - d t}^{k}]

holds, then

J [u_{0 : t}^{k + 1}, u_{t + d t : T - d t}^{k}] = J [u_{0 : t - d t}^{k + 1}, u_{t : T - d t}^{k}]

holds from Equation (35). Because

J [u_{0}^{k + 1}, u_{d t : T - d t}^{k}] = J [u_{0 : T - d t}^{k}]

holds,

u_{0}^{k + 1} = u_{0}^{k}

holds. Then, because

J [u_{0}^{k}, u_{d t}^{k + 1}, u_{2 d t : T - d t}^{k}] = J [u_{0 : T - d t}^{k}]

holds,

u_{d t}^{k + 1} = u_{d t}^{k}

holds. Iterating this procedure from

t = 0

to

t = T - d t

,

u_{0 : T - d t}^{k + 1} = u_{0 : T - d t}^{k}

holds. Therefore, because the HJB equation and the FP equation depend on the same control function

u_{0 : T - d t}^{k + 1} = u_{0 : T - d t}^{k}

,

u_{0 : T - d t}^{k + 1}

satisfies Pontryagin’s minimum principle (20).

Appendix C.9. Derivation of Result in Section 5.3

In this subsection, we show that FBSM is reduced from Algorithm 1 to Algorithm 2 in the LQG problem of ML-POSC.

We first consider the initial step. We assume that the control function is initialized by

\begin{matrix} u^{0} (t, z) = - R^{- 1} B^{⊤} (Π^{0} K (Λ^{0}) (s - μ) + Ψ μ), \end{matrix}

(A85)

where

Π^{0}

is arbitrary and

Λ^{0}

is the solution of

{\dot{Λ}}^{0} = F (Λ^{0}, Π^{0})

given

Λ^{0} (0) = Λ_{0}

. When the control function is initialized by (A85), the solution of the FP equation is given by the Gaussian distribution

p_{t}^{0} (s) : = N (s | μ, Λ^{0})

, where

μ

is the solution of (42) and

Λ^{0}

is the solution of

{\dot{Λ}}^{0} = F (Λ^{0}, Π^{0})

given

Λ^{0} (0) = Λ_{0}

.

We then consider the backward step. When the solution of the FP equation is given by the Gaussian distribution

p_{t}^{k} (s) : = N (s | μ, Λ^{k})

, the solution of the HJB equation is given by the quadratic function

w_{t}^{k + 1} (s) = s^{⊤} Π^{k + 1} s + {(α^{k + 1})}^{⊤} s + β^{k + 1}

, where

Π^{k + 1}

,

α^{k + 1}

, and

β^{k + 1}

are the solutions of the following ODEs:

\begin{matrix} - {\dot{Π}}^{k + 1} & = G (Λ^{k}, Π^{k + 1}), \end{matrix}

(A86)

\begin{matrix} - {\dot{α}}^{k + 1} & = {(A - B R^{- 1} B^{⊤} Π^{k + 1})}^{⊤} α^{k + 1} \\ - 2 {(I - K (Λ^{k}))}^{⊤} Π^{k + 1} B R^{- 1} B^{⊤} Π^{k + 1} (I - K (Λ^{k})) μ, \end{matrix}

(A87)

\begin{matrix} - {\dot{β}}^{k + 1} & = (Π^{k + 1} σ σ^{⊤}) - \frac{1}{4} {(α^{k + 1})}^{⊤} B R^{- 1} B^{⊤} α^{k + 1} \\ + μ^{⊤} {(I - K (Λ^{k}))}^{⊤} Π^{k + 1} B R^{- 1} B^{⊤} Π^{k + 1} (I - K (Λ^{k})) μ, \end{matrix}

(A88)

where

Π^{k + 1} (T) = P

,

α^{k + 1} (T) = 0

, and

β^{k + 1} (T) = 0

.

We finally consider the forward step. When the solution of the HJB equation is given by the quadratic function

w_{t}^{k} (s) = s^{⊤} Π^{k} s + {(α^{k})}^{⊤} s + β^{k}

, the solution of the FP equation is given by the Gaussian distribution

p_{t}^{k + 1} (s) : = N (s | μ, Λ^{k + 1})

, where

μ

is the solution of (42) and

Λ^{k + 1}

is the solution of

{\dot{Λ}}^{k + 1} = F (Λ^{k + 1}, Π^{k})

given

Λ^{k + 1} (0) = Λ_{0}

. Therefore, FBSM is reduced from Algorithm 1 to Algorithm 2 in the LQG problem of ML-POSC. The details of these calculations are almost the same with [14].

References

Fox, R.; Tishby, N. Minimum-information LQG control Part II: Retentive controllers. In Proceedings of the 2016 IEEE 55th Conference on Decision and Control (CDC), Las Vegas, NV, USA, 12–14 December 2016; pp. 5603–5609. [Google Scholar] [CrossRef] [Green Version]
Fox, R.; Tishby, N. Minimum-information LQG control part I: Memoryless controllers. In Proceedings of the 2016 IEEE 55th Conference on Decision and Control (CDC), Las Vegas, NV, USA, 12–14 December 2016; pp. 5610–5616. [Google Scholar] [CrossRef] [Green Version]
Li, W.; Todorov, E. An Iterative Optimal Control and Estimation Design for Nonlinear Stochastic System. In Proceedings of the 45th IEEE Conference on Decision and Control, San Diego, CA, USA, 13–15 December 2006; pp. 3242–3247. [Google Scholar] [CrossRef]
Li, W.; Todorov, E. Iterative linearization methods for approximately optimal control and estimation of non-linear stochastic system. Int. J. Control. 2007, 80, 1439–1453. [Google Scholar] [CrossRef]
Nakamura, K.; Kobayashi, T.J. Connection between the Bacterial Chemotactic Network and Optimal Filtering. Phys. Rev. Lett. 2021, 126, 128102. [Google Scholar] [CrossRef] [PubMed]
Nakamura, K.; Kobayashi, T.J. Optimal sensing and control of run-and-tumble chemotaxis. Phys. Rev. Res. 2022, 4, 013120. [Google Scholar] [CrossRef]
Pezzotta, A.; Adorisio, M.; Celani, A. Chemotaxis emerges as the optimal solution to cooperative search games. Phys. Rev. E 2018, 98, 042401. [Google Scholar] [CrossRef] [Green Version]
Borra, F.; Cencini, M.; Celani, A. Optimal collision avoidance in swarms of active Brownian particles. J. Stat. Mech. Theory Exp. 2021, 2021, 083401. [Google Scholar] [CrossRef]
Davis, M.H.A.; Varaiya, P. Dynamic Programming Conditions for Partially Observable Stochastic Systems. SIAM J. Control. 1973, 11, 226–261. [Google Scholar] [CrossRef]
Bensoussan, A. Stochastic Control of Partially Observable Systems; Cambridge University Press: Cambridge, UK, 1992. [Google Scholar] [CrossRef]
Fabbri, G.; Gozzi, F.; Święch, A. Stochastic Optimal Control in Infinite Dimension. In Probability Theory and Stochastic Modelling; Springer International Publishing: Cham, Switzerland, 2017; Volume 82. [Google Scholar] [CrossRef]
Wang, G.; Wu, Z.; Xiong, J. An Introduction to Optimal Control of FBSDE with Incomplete Information; Springer Briefs in Mathematics; Springer International Publishing: Cham, Switzerland, 2018. [Google Scholar] [CrossRef]
Bensoussan, A.; Yam, S.C.P. Mean field approach to stochastic control with partial information. ESAIM Control. Optim. Calc. Var. 2021, 27, 89. [Google Scholar] [CrossRef]
Tottori, T.; Kobayashi, T.J. Memory-Limited Partially Observable Stochastic Control and Its Mean-Field Control Approach. Entropy 2022, 24, 1599. [Google Scholar] [CrossRef]
Kushner, H. Optimal stochastic control. IRE Trans. Autom. Control. 1962, 7, 120–122. [Google Scholar] [CrossRef]
Yong, J.; Zhou, X.Y. Stochastic Controls; Springer: New York, NY, USA, 1999. [Google Scholar] [CrossRef]
Nisio, M. Stochastic Control Theory. In Probability Theory and Stochastic Modelling; Springer: Tokyo, Japan, 2015; Volume 72. [Google Scholar] [CrossRef]
Bensoussan, A. Estimation and Control of Dynamical Systems. In Interdisciplinary Applied Mathematics; Springer International Publishing: Cham, Switzerland, 2018; Volume 48. [Google Scholar] [CrossRef]
Kushner, H.J.; Dupuis, P.G. Numerical Methods for Stochastic Control Problems in Continuous Time; Springer: New York, NY, USA, 1992. [Google Scholar] [CrossRef]
Fleming, W.H.; Soner, H.M. Controlled Markov Processes and Viscosity Solutions, 2nd ed.; Number 25 in Applications of Mathematics; Springer: New York, NY, USA, 2006. [Google Scholar] [CrossRef]
Puterman, M.L. Markov Decision Processes: Discrete Stochastic Dynamic Programming; Wiley-Interscience: Hoboken, NJ, USA, 2014. [Google Scholar]
Pontryagin, L.S. Mathematical Theory of Optimal Processes; CRC Press: Boca Raton, FL, USA, 1987. [Google Scholar]
Vinter, R. Optimal Control; Birkhäuser Boston: Boston, MA, USA, 2010. [Google Scholar] [CrossRef]
Lewis, F.L.; Vrabie, D.; Syrmos, V.L. Optimal Control; John Wiley & Sons: New York, NY, USA, 2012. [Google Scholar]
Aschepkov, L.T.; Dolgy, D.V.; Kim, T.; Agarwal, R.P. Optimal Control; Springer International Publishing: Cham, Switzerland, 2016. [Google Scholar] [CrossRef]
Bensoussan, A.; Frehse, J.; Yam, P. Mean Field Games and Mean Field Type Control Theory; Springer Briefs in Mathematics; Springer: New York, NY, USA, 2013. [Google Scholar] [CrossRef]
Carmona, R.; Delarue, F. Probabilistic Theory of Mean Field Games with Applications I; Number Volume 83 in Probability Theory and Stochastic Modelling; Springer Nature: Cham, Switzerland, 2018. [Google Scholar] [CrossRef]
Carmona, R.; Delarue, F. Probabilistic Theory of Mean Field Games with Applications II; Volume 84, Probability Theory and Stochastic Modelling; Springer International Publishing: Cham, Switzerland, 2018. [Google Scholar] [CrossRef]
Carmona, R.; Delarue, F. The Master Equation for Large Population Equilibriums. In Stochastic Analysis and Applications 2014; Crisan, D., Hambly, B., Zariphopoulou, T., Eds.; Springer International Publishing: Cham, Switzerland, 2014; Volume 100, pp. 77–128. [Google Scholar] [CrossRef] [Green Version]
Bensoussan, A.; Frehse, J.; Yam, S.C.P. The Master equation in mean field theory. J. Math. Pures Appl. 2015, 103, 1441–1474. [Google Scholar] [CrossRef]
Bensoussan, A.; Frehse, J.; Yam, S.C.P. On the interpretation of the Master Equation. Stoch. Process. Their Appl. 2017, 127, 2093–2137. [Google Scholar] [CrossRef] [Green Version]
Krylov, I.; Chernous’ko, F. On a method of successive approximations for the solution of problems of optimal control. USSR Comput. Math. Math. Phys. 1963, 2, 1371–1382. [Google Scholar] [CrossRef]
Mitter, S.K. Successive approximation methods for the solution of optimal control problems. Automatica 1966, 3, 135–149. [Google Scholar] [CrossRef]
Chernousko, F.L.; Lyubushin, A.A. Method of successive approximations for solution of optimal control problems. Optim. Control. Appl. Methods 1982, 3, 101–114. [Google Scholar] [CrossRef]
Lenhart, S.; Workman, J.T. Optimal Control Applied to Biological Models; Chapman and Hall/CRC: New York, NY, USA, 2007. [Google Scholar] [CrossRef]
Sharp, J.A.; Burrage, K.; Simpson, M.J. Implementation and acceleration of optimal control for systems biology. J. R. Soc. Interface 2021, 18, 20210241. [Google Scholar] [CrossRef]
Hackbusch, W. A numerical method for solving parabolic equations with opposite orientations. Computing 1978, 20, 229–240. [Google Scholar] [CrossRef]
McAsey, M.; Mou, L.; Han, W. Convergence of the forward-backward sweep method in optimal control. Comput. Optim. Appl. 2012, 53, 207–226. [Google Scholar] [CrossRef]
Carlini, E.; Silva, F.J. Semi-Lagrangian schemes for mean field game models. In Proceedings of the 52nd IEEE Conference on Decision and Control, Firenze, Italy, 10–13 December 2013; pp. 3115–3120. [Google Scholar] [CrossRef] [Green Version]
Carlini, E.; Silva, F.J. A Fully Discrete Semi-Lagrangian Scheme for a First Order Mean Field Game Problem. SIAM J. Numer. Anal. 2014, 52, 45–67. [Google Scholar] [CrossRef] [Green Version]
Carlini, E.; Silva, F.J. A semi-Lagrangian scheme for a degenerate second order mean field game system. Discret. Contin. Dyn. Syst. 2015, 35, 4269. [Google Scholar] [CrossRef]
Lauriere, M. Numerical Methods for Mean Field Games and Mean Field Type Control. arXiv 2021, arXiv:2106.06231. [Google Scholar]
Wonham, W.M. On the Separation Theorem of Stochastic Control. SIAM J. Control. 1968, 6, 312–326. [Google Scholar] [CrossRef]
Li, Q.; Chen, L.; Tai, C.; E, W. Maximum Principle Based Algorithms for Deep Learning. J. Mach. Learn. Res. 2018, 18, 1–29. [Google Scholar]
Liu, X.; Frank, J. Symplectic Runge–Kutta discretization of a regularized forward–backward sweep iteration for optimal control problems. J. Comput. Appl. Math. 2021, 383, 113133. [Google Scholar] [CrossRef]
Bellman, R. Dynamic Programming; Princeton University Press: Princeton, NJ, USA, 1957. [Google Scholar]
Howard, R.A. Dynamic Programming and Markov Processes; John Wiley: Oxford, UK, 1960. [Google Scholar]
Kappen, H.J. Linear Theory for Control of Nonlinear Stochastic Systems. Phys. Rev. Lett. 2005, 95, 200201. [Google Scholar] [CrossRef] [Green Version]
Kappen, H.J. Path integrals and symmetry breaking for optimal control theory. J. Stat. Mech. Theory Exp. 2005, 2005, P11011. [Google Scholar] [CrossRef] [Green Version]
Satoh, S.; Kappen, H.J.; Saeki, M. An Iterative Method for Nonlinear Stochastic Optimal Control Based on Path Integrals. IEEE Trans. Autom. Control. 2017, 62, 262–276. [Google Scholar] [CrossRef]
Cacace, S.; Camilli, F.; Goffi, A. A policy iteration method for Mean Field Games. arXiv 2021, arXiv:2007.04818. [Google Scholar] [CrossRef]
Laurière, M.; Song, J.; Tang, Q. Policy iteration method for time-dependent Mean Field Games systems with non-separable Hamiltonians. arXiv 2021, arXiv:2110.02552. [Google Scholar] [CrossRef]
Camilli, F.; Tang, Q. Rates of convergence for the policy iteration method for Mean Field Games systems. arXiv 2022, arXiv:2108.00755. [Google Scholar] [CrossRef]
Ruthotto, L.; Osher, S.J.; Li, W.; Nurbekyan, L.; Fung, S.W. A machine learning framework for solving high-dimensional mean field game and mean field control problems. Proc. Natl. Acad. Sci. USA 2020, 117, 9183–9193. [Google Scholar] [CrossRef] [Green Version]
Lin, A.T.; Fung, S.W.; Li, W.; Nurbekyan, L.; Osher, S.J. Alternating the population and control neural networks to solve high-dimensional stochastic mean-field games. Proc. Natl. Acad. Sci. USA 2021, 118, e2024713118. [Google Scholar] [CrossRef]
Laurière, M.; Pironneau, O. Dynamic programming for mean-field type control. C. R. Math. 2014, 352, 707–713. [Google Scholar] [CrossRef] [Green Version]
Laurière, M.; Pironneau, O. Dynamic programming for mean-field type control. J. Optim. Theory Appl. 2016, 169, 902–924. [Google Scholar] [CrossRef]
Pham, H.; Wei, X. Bellman equation and viscosity solutions for mean-field stochastic control problem. ESAIM Control. Optim. Calc. Var. 2018, 24, 437–461. [Google Scholar] [CrossRef]

Figure 1. Schematic diagram of the relationship between the backward dynamics, the optimal control function, and the forward dynamics in (a) COSC, (b) ML-POSC, (c) deterministic control, and (d) MFSC.

w^{*}

,

p^{*}

,

λ^{*}

, and

s^{*}

are the solutions of the HJB equation, the FP equation, the adjoint equation, and the state equation, respectively.

u^{*}

is the optimal control function. The arrows indicate the dependence of variables. The variable at the head of an arrow depends on the variable at the tail of the arrow. (a) In COSC, because the optimal control function

u^{*}

depends only on the HJB equation

w^{*}

, it can be obtained by solving the HJB equation

w^{*}

backward in time from the terminal condition, which is called the value iteration method. (b) In ML-POSC, because the optimal control function

u^{*}

depends on the FP equation

p^{*}

as well as the HJB equation

w^{*}

(orange), it cannot be obtained by the value iteration method. In this paper, we propose FBSM for ML-POSC, which computes the HJB equation

w^{*}

and the FP equation

p^{*}

alternately. Because the coupling of the HJB equation

w^{*}

and the FP equation

p^{*}

is limited only to the optimal control function

u^{*}

, the convergence of FBSM is guaranteed in ML-POSC. (c) In deterministic control, because the coupling of the adjoint equation

λ^{*}

and the state equation

s^{*}

is not limited to the optimal control function

u^{*}

(green), the convergence of FBSM is not guaranteed. (d) In MFSC, because the coupling of the HJB equation

w^{*}

and the FP equation

p^{*}

is not limited to the optimal control function

u^{*}

(green), the convergence of FBSM is not guaranteed.

Figure 1. Schematic diagram of the relationship between the backward dynamics, the optimal control function, and the forward dynamics in (a) COSC, (b) ML-POSC, (c) deterministic control, and (d) MFSC.

w^{*}

,

p^{*}

,

λ^{*}

, and

s^{*}

are the solutions of the HJB equation, the FP equation, the adjoint equation, and the state equation, respectively.

u^{*}

is the optimal control function. The arrows indicate the dependence of variables. The variable at the head of an arrow depends on the variable at the tail of the arrow. (a) In COSC, because the optimal control function

u^{*}

depends only on the HJB equation

w^{*}

, it can be obtained by solving the HJB equation

w^{*}

backward in time from the terminal condition, which is called the value iteration method. (b) In ML-POSC, because the optimal control function

u^{*}

depends on the FP equation

p^{*}

as well as the HJB equation

w^{*}

(orange), it cannot be obtained by the value iteration method. In this paper, we propose FBSM for ML-POSC, which computes the HJB equation

w^{*}

and the FP equation

p^{*}

alternately. Because the coupling of the HJB equation

w^{*}

and the FP equation

p^{*}

is limited only to the optimal control function

u^{*}

, the convergence of FBSM is guaranteed in ML-POSC. (c) In deterministic control, because the coupling of the adjoint equation

λ^{*}

and the state equation

s^{*}

is not limited to the optimal control function

u^{*}

(green), the convergence of FBSM is not guaranteed. (d) In MFSC, because the coupling of the HJB equation

w^{*}

and the FP equation

p^{*}

is not limited to the optimal control function

u^{*}

(green), the convergence of FBSM is not guaranteed.

Figure 2. The relationship between Bellman’s dynamic programming principle (top) and Pontryagin’s minimum principle (bottom) on the state space (left) and on the probability density function space (right). The left-hand side corresponds to deterministic control, which is briefly reviewed in Appendix A. The right-hand side corresponds to ML-POSC and MFSC, which are shown in Section 3 and Appendix B, respectively. The conventional approach in ML-POSC [14] and MFSC [30,31] can be interpreted as the conversion from Bellman’s dynamic programming principle (top right) to Pontryagin’s minimum principle (bottom right) on the probability density function space.

Figure 3. Schematic diagram of the effect of updating the control function to the forward and backward dynamics in (a) ML-POSC, (b) deterministic control, and (c) MFSC.

w_{0 : T}

,

p_{0 : T}

,

λ_{0 : T}

, and

s_{0 : T}

are the solutions of the HJB equation, the FP equation, the adjoint equation, and the state equation, respectively.

u_{0 : T - d t}

is a given control function. The arrows indicate the dependence of variables. The variable at the head of an arrow depends on the variable at the tail of the arrow. (a) In ML-POSC, while the update from

u_{t}

to

u_{t}^{'}

(yellow) changes

w_{0 : t}

and

p_{t + d t : T}

to

w_{0 : t}^{'}

and

p_{t + d t : T}^{'}

, respectively (red), it does not change

p_{0 : t}

and

w_{t + d t : T}

(blue). From this property, the convergence of FBSM is guaranteed in ML-POSC. (b) In deterministic control, the update from

u_{t}

to

u_{t}^{'}

(yellow) changes

λ_{t + d t : T}

to

λ_{t + d t : T}^{'}

as well (red) because the adjoint equation depends on the state equation (green). Because FBSM does not take into account the change of

λ_{t + d t : T}

, the convergence of FBSM is not guaranteed in deterministic control. (c) In MFSC, the update from

u_{t}

to

u_{t}^{'}

(yellow) changes

w_{t + d t : T}

to

w_{t + d t : T}^{'}

as well (red) because the HJB equation depends on the FP equation (green). Because FBSM does not take into account the change of

w_{t + d t : T}

, the convergence of FBSM is not guaranteed in MFSC.

Figure 3. Schematic diagram of the effect of updating the control function to the forward and backward dynamics in (a) ML-POSC, (b) deterministic control, and (c) MFSC.

w_{0 : T}

,

p_{0 : T}

,

λ_{0 : T}

, and

s_{0 : T}

are the solutions of the HJB equation, the FP equation, the adjoint equation, and the state equation, respectively.

u_{0 : T - d t}

is a given control function. The arrows indicate the dependence of variables. The variable at the head of an arrow depends on the variable at the tail of the arrow. (a) In ML-POSC, while the update from

u_{t}

to

u_{t}^{'}

(yellow) changes

w_{0 : t}

and

p_{t + d t : T}

to

w_{0 : t}^{'}

and

p_{t + d t : T}^{'}

, respectively (red), it does not change

p_{0 : t}

and

w_{t + d t : T}

(blue). From this property, the convergence of FBSM is guaranteed in ML-POSC. (b) In deterministic control, the update from

u_{t}

to

u_{t}^{'}

(yellow) changes

λ_{t + d t : T}

to

λ_{t + d t : T}^{'}

as well (red) because the adjoint equation depends on the state equation (green). Because FBSM does not take into account the change of

λ_{t + d t : T}

, the convergence of FBSM is not guaranteed in deterministic control. (c) In MFSC, the update from

u_{t}

to

u_{t}^{'}

(yellow) changes

w_{t + d t : T}

to

w_{t + d t : T}^{'}

as well (red) because the HJB equation depends on the FP equation (green). Because FBSM does not take into account the change of

w_{t + d t : T}

, the convergence of FBSM is not guaranteed in MFSC.

Figure 4. The elements of the control gain matrix

Π^{k} (t) \in R^{2 \times 2}

(a–c) and the precision matrix

Λ^{k} (t) \in R^{2 \times 2}

(d–f) obtained by FBSM (Algorithm 2) in the numerical experiment of the LQG problem of ML-POSC. Because

Π_{z x}^{k} (t) = Π_{x z}^{k} (t)

and

Λ_{z x}^{k} (t) = Λ_{x z}^{k} (t)

,

Π_{z x}^{k} (t)

and

Λ_{z x}^{k} (t)

are not visualized. The darkest curve corresponds to the first iteration

k = 0

, and the brightest curve corresponds to the last iteration

k = 50

.

Π^{0} (t)

is initialized by

Π^{0} (t) = O

.

Figure 4. The elements of the control gain matrix

Π^{k} (t) \in R^{2 \times 2}

(a–c) and the precision matrix

Λ^{k} (t) \in R^{2 \times 2}

(d–f) obtained by FBSM (Algorithm 2) in the numerical experiment of the LQG problem of ML-POSC. Because

Π_{z x}^{k} (t) = Π_{x z}^{k} (t)

and

Λ_{z x}^{k} (t) = Λ_{x z}^{k} (t)

,

Π_{z x}^{k} (t)

and

Λ_{z x}^{k} (t)

are not visualized. The darkest curve corresponds to the first iteration

k = 0

, and the brightest curve corresponds to the last iteration

k = 50

.

Π^{0} (t)

is initialized by

Π^{0} (t) = O

.

Figure 5. Performance of FBSM in the numerical experiment of the LQG problem of ML-POSC. (a) The objective function

J [u^{k}]

with respect to the iteration k. (b–d) Stochastic simulation of state

x_{t}

(b), memory

z_{t}

(c), and the cumulative cost (d) for 100 samples. The expectation of the cumulative cost at

t = 10

corresponds to the objective function (49). Blue and orange curves correspond to the first iteration

k = 0

and the last iteration

k = 50

, respectively.

Figure 5. Performance of FBSM in the numerical experiment of the LQG problem of ML-POSC. (a) The objective function

J [u^{k}]

with respect to the iteration k. (b–d) Stochastic simulation of state

x_{t}

(b), memory

z_{t}

(c), and the cumulative cost (d) for 100 samples. The expectation of the cumulative cost at

t = 10

corresponds to the objective function (49). Blue and orange curves correspond to the first iteration

k = 0

and the last iteration

k = 50

, respectively.

Figure 6. The solutions of the HJB equation

w^{k} (t, s)

(a–j) and the FP equation

p^{k} (t, s)

(k–t) at the first iteration

k = 0

(a–e,k–o) and at the last iteration

k = 50

(f–j,p–t) of FBSM (Algorithm 1) in the numerical experiment of the non-LQG problem of ML-POSC.

u^{0} (t, z)

is initialized by

u^{0} (t, z) = 0

.

Figure 6. The solutions of the HJB equation

w^{k} (t, s)

(a–j) and the FP equation

p^{k} (t, s)

(k–t) at the first iteration

k = 0

(a–e,k–o) and at the last iteration

k = 50

(f–j,p–t) of FBSM (Algorithm 1) in the numerical experiment of the non-LQG problem of ML-POSC.

u^{0} (t, z)

is initialized by

u^{0} (t, z) = 0

.

Figure 7. Performance of FBSM in the numerical experiment of the non-LQG problem of ML-POSC. (a) The objective function

J [u^{k}]

with respect to the iteration k. (b) Stochastic simulation of the state

x_{t}

for 100 samples. The black rectangles and the cross represent the obstacles and the goal, respectively. Blue and orange curves correspond to the first iteration

k = 0

and the last iteration

k = 50

, respectively. (c) The objective function (55), which is computed from 100 samples.

Figure 7. Performance of FBSM in the numerical experiment of the non-LQG problem of ML-POSC. (a) The objective function

J [u^{k}]

with respect to the iteration k. (b) Stochastic simulation of the state

x_{t}

for 100 samples. The black rectangles and the cross represent the obstacles and the goal, respectively. Blue and orange curves correspond to the first iteration

k = 0

and the last iteration

k = 50

, respectively. (c) The objective function (55), which is computed from 100 samples.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Tottori, T.; Kobayashi, T.J. Forward-Backward Sweep Method for the System of HJB-FP Equations in Memory-Limited Partially Observable Stochastic Control. Entropy 2023, 25, 208. https://doi.org/10.3390/e25020208

AMA Style

Tottori T, Kobayashi TJ. Forward-Backward Sweep Method for the System of HJB-FP Equations in Memory-Limited Partially Observable Stochastic Control. Entropy. 2023; 25(2):208. https://doi.org/10.3390/e25020208

Chicago/Turabian Style

Tottori, Takehiro, and Tetsuya J. Kobayashi. 2023. "Forward-Backward Sweep Method for the System of HJB-FP Equations in Memory-Limited Partially Observable Stochastic Control" Entropy 25, no. 2: 208. https://doi.org/10.3390/e25020208

APA Style

Tottori, T., & Kobayashi, T. J. (2023). Forward-Backward Sweep Method for the System of HJB-FP Equations in Memory-Limited Partially Observable Stochastic Control. Entropy, 25(2), 208. https://doi.org/10.3390/e25020208

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Forward-Backward Sweep Method for the System of HJB-FP Equations in Memory-Limited Partially Observable Stochastic Control

Abstract

1. Introduction

2. Memory-Limited Partially Observable Stochastic Control

2.1. Problem Formulation

2.2. Problem Reformulation

3. Pontryagin’s Minimum Principle

3.1. Preliminary

3.2. Necessary Condition

3.3. Sufficient Condition

3.4. Relationship with Bellman’s Dynamic Programming Principle

3.5. Relationship with Completely Observable Stochastic Control

4. Forward-Backward Sweep Method

4.1. Forward-Backward Sweep Method

4.2. Preliminary

4.3. Monotonicity

4.4. Convergence to Pontryagin’s Minimum Principle

5. Linear-Quadratic-Gaussian Problem

5.1. Problem Formulation

5.2. Pontryagin’s Minimum Principle

5.3. Forward-Backward Sweep Method

6. Numerical Experiments

6.1. LQG Problem

6.2. Non-LQG Problem

7. Discussion

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

Abbreviations

Appendix A. Deterministic Control

Appendix A.1. Problem Formulation

Appendix A.2. Preliminary

Appendix A.3. Necessary Condition

Appendix A.4. Sufficient Condition

Appendix A.5. Relationship with Bellman’s Dynamic Programming Principle

Appendix B. Mean-Field Stochastic Control

Appendix B.1. Problem Formulation

Appendix B.2. Preliminary

Appendix B.3. Necessary Condition

Appendix B.4. Sufficient Condition

Appendix B.5. Relationship with Bellman’s Dynamic Programming Principle

Appendix C. Derivation of Main Results

Appendix C.1. Derivation of Result in Section 3.1

Appendix C.2. Derivation of Result in Section 3.2

Appendix C.3. Derivation of Result in Section 3.3

Appendix C.4. Derivation of Result in Section 3.5

Appendix C.5. Derivation of Result in Section 4.2 by the Similar Way as Pontyragin’s Minimum Principle

Appendix C.6. Derivation of Result in Section 4.2 by the Time Discretized Method

Appendix C.7. Derivation of Result in Section 4.3

Appendix C.8. Derivation of Result in Section 4.4

Appendix C.9. Derivation of Result in Section 5.3

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI