Forward-Backward Sweep Method for the System of HJB-FP Equations in Memory-Limited Partially Observable Stochastic Control

Memory-limited partially observable stochastic control (ML-POSC) is the stochastic optimal control problem under incomplete information and memory limitation. To obtain the optimal control function of ML-POSC, a system of the forward Fokker–Planck (FP) equation and the backward Hamilton–Jacobi–Bellman (HJB) equation needs to be solved. In this work, we first show that the system of HJB-FP equations can be interpreted via Pontryagin’s minimum principle on the probability density function space. Based on this interpretation, we then propose the forward-backward sweep method (FBSM) for ML-POSC. FBSM is one of the most basic algorithms for Pontryagin’s minimum principle, which alternately computes the forward FP equation and the backward HJB equation in ML-POSC. Although the convergence of FBSM is generally not guaranteed in deterministic control and mean-field stochastic control, it is guaranteed in ML-POSC because the coupling of the HJB-FP equations is limited to the optimal control function in ML-POSC.


Introduction
In many practical applications of the stochastic optimal control theory, several constraints need to be considered. In the cases of small devices [1,2] and biological systems [3][4][5][6][7][8], for example, incomplete information and memory limitation become predominant because their sensors are extremely noisy and their memory resources are severely limited. To take into account one of these constraints, incomplete information, partially observable stochastic control (POSC) has been extensively studied in the stochastic optimal control theory [9][10][11][12][13]. However, because POSC cannot take into account the other constraint, memory limitation, it is not practical enough for designing memory-limited controllers for small devices and biological systems. To resolve this problem, memory-limited POSC (ML-POSC) has recently been proposed [14]. Because ML-POSC formulates noisy observation and limited memory explicitly, ML-POSC can take into account both incomplete information and memory limitation in the stochastic optimal control problem.
However, ML-POSC cannot be solved in a similar way as completely observable stochastic control (COSC), which is the most basic stochastic optimal control problem [15][16][17][18]. In COSC, the optimal control function depends only on the Hamilton-Jacobi-Bellman (HJB) equation, which is a time-backward partial differential equation given a terminal condition (Figure 1a) [15][16][17][18]. Therefore, the optimal control function of COSC can be obtained by solving the HJB equation backward in time from the terminal condition, which is called the value iteration method [19][20][21]. In contrast, the optimal control function of ML-POSC depends not only on the HJB equation but also on the Fokker-Planck (FP) equation, which is a time-forward partial differential equation given an initial condition (Figure 1b) [14]. Because the HJB equation and the FP equation interact with each other through the optimal control function in ML-POSC, the optimal control function of ML-POSC cannot be obtained by the value iteration method.
To propose an algorithm to solve ML-POSC, we first show that the system of HJB-FP equations can be interpreted via Pontryagin's minimum principle on the probability density function space. Pontryagin's minimum principle is one of the most representative approaches to the deterministic optimal control problem, which converts it into the twopoint boundary value problem of the forward state equation and the backward adjoint equation [22][23][24][25]. We formally show that the system of HJB-FP equations is an extension of the system of adjoint and state equations from the deterministic optimal control problem to the stochastic optimal control problem.
The system of HJB-FP equations also appears in the mean-field stochastic control (MFSC) [26][27][28]. Although the relationship between the system of HJB-FP equations and Pontryagin's minimum principle has been briefly mentioned in MFSC [29][30][31], its details have not yet been investigated. In this work, we investigate it in more detail by deriving the system of HJB-FP equations in a similar way to Pontryagin's minimum principle. We note that our derivations are formal, not analytical, and more mathematically rigorous proofs remain future challenges. However, our results are consistent with many conventional results and also provide a useful perspective in proposing an algorithm.
We then propose the forward-backward sweep method (FBSM) for ML-POSC. FBSM is an algorithm to compute the forward FP equation and the backward HJB equation alternately, which can be interpreted as an extension of the value iteration method. FBSM has been proposed in Pontryagin's minimum principle of the deterministic optimal control problem, which computes the forward state equation and the backward adjoint equation alternately [32][33][34]. Because FBSM is easy to implement, it has been used in many applications [35,36]. However, the convergence of FBSM is not guaranteed in deterministic control except for special cases [37,38] because the coupling of adjoint and state equations is not limited to the optimal control function ( Figure 1c). In contrast, we show that the convergence of FBSM is generally guaranteed in ML-POSC because the coupling of the HJB-FP equations is limited only to the optimal control function ( Figure 1b).
FBSM is called the fixed-point iteration method in MFSC [39][40][41][42]. Although the fixedpoint iteration method is the most basic algorithm to solve MFSC, its convergence is not guaranteed for the same reason as deterministic control ( Figure 1d). Therefore, ML-POSC is a special and nice class of optimal control problems where FBSM or the fixed-point iteration method is guaranteed to converge. This paper is organized as follows: In Section 2, we formulate ML-POSC. In Section 3, we derive the system of HJB-FP equations of ML-POSC from the viewpoint of Pontryagin's minimum principle. In Section 4, we propose FBSM for ML-POSC and prove its convergence. In Section 5, we apply FBSM to the linear-quadratic-Gaussian (LQG) problem. In Section 6, we verify the convergence of FBSM by numerical experiments. In Section 7, we discuss our work. In Appendix A, we briefly review Pontryagin's minimum principle of deterministic control. In Appendix B, we derive the system of HJB-FP equations of MFSC from the viewpoint of Pontryagin's minimum principle. In Appendix C, we show the detailed derivations of our results. Figure 1. Schematic diagram of the relationship between the backward dynamics, the optimal control function, and the forward dynamics in (a) COSC, (b) ML-POSC, (c) deterministic control, and (d) MFSC. w * , p * , λ * , and s * are the solutions of the HJB equation, the FP equation, the adjoint equation, and the state equation, respectively. u * is the optimal control function. The arrows indicate the dependence of variables. The variable at the head of an arrow depends on the variable at the tail of the arrow. (a) In COSC, because the optimal control function u * depends only on the HJB equation w * , it can be obtained by solving the HJB equation w * backward in time from the terminal condition, which is called the value iteration method. (b) In ML-POSC, because the optimal control function u * depends on the FP equation p * as well as the HJB equation w * (orange), it cannot be obtained by the value iteration method. In this paper, we propose FBSM for ML-POSC, which computes the HJB equation w * and the FP equation p * alternately. Because the coupling of the HJB equation w * and the FP equation p * is limited only to the optimal control function u * , the convergence of FBSM is guaranteed in ML-POSC. (c) In deterministic control, because the coupling of the adjoint equation λ * and the state equation s * is not limited to the optimal control function u * (green), the convergence of FBSM is not guaranteed. (d) In MFSC, because the coupling of the HJB equation w * and the FP equation p * is not limited to the optimal control function u * (green), the convergence of FBSM is not guaranteed.

Memory-Limited Partially Observable Stochastic Control
In this section, we briefly review the formulation of ML-POSC [14], which is the stochastic optimal control problem under incomplete information and memory limitation.

Problem Formulation
This subsection outlines the formulation of ML-POSC [14]. The state of the system x t ∈ R d x at time t ∈ [0, T] evolves by the following stochastic differential equation (SDE): where x 0 obeys p 0 (x 0 ), u t ∈ R d u is the control, and ω t ∈ R d ω is the standard Wiener process. In COSC [15][16][17][18], because the controller can completely observe the state x t , it determines the control u t based on the state x t as u t = u(t, x t ). By contrast, in POSC [9][10][11][12][13] and ML-POSC [14], the controller cannot directly observe the state x t and instead obtains the observation y t ∈ R d y , which evolves by the following SDE: where y 0 obeys p 0 (y 0 ), and ν t ∈ R d ν is the standard Wiener process. In POSC [9][10][11][12][13], because the controller can completely memorize the observation history y 0:t := {y τ |τ ∈ [0, t]}, it determines the control u t based on the observation history y 0:t as u t = u(t, y 0:t ). In ML-POSC [14], by contrast, because the controller cannot completely memorize the observation history y 0:t , it compresses the observation history y 0:t into the finite-dimensional memory z t ∈ R d z , which evolves by the following SDE: where z 0 obeys p 0 (z 0 ), v t ∈ R d v is the control, and ξ t ∈ R d ξ is the standard Wiener process. The memory dimension d z is determined by the available memory size of the controller. In addition, the memory noise ξ t represents the intrinsic stochasticity of the memory to be used. Therefore, unlike the conventional POSC, ML-POSC can explicitly take into account the memory size and noise of the controller. Furthermore, because the memory dynamics (3) depends on the memory control v t , it can be optimized through the memory control v t , which is expected to realize the optimal compression of the observation history y 0:t into the limited memory z t . In ML-POSC [14], the controller determines the state control u t and the memory control v t based on the memory z t as follows: The objective function of ML-POSC is given by the following expected cumulative cost function: where f is the cost function, g is the terminal cost function, p(x 0:T , y 0:T , z 0:T ; u, v) is the probability of x 0:T , y 0:T , and z 0:T given u and v as parameters, and E p [·] is the expectation with respect to the probability p. Because the cost function f depends on the memory control v t , ML-POSC can explicitly take into account the memory control cost, which is also impossible with the conventional POSC. ML-POSC is the problem of finding the optimal state control function u * and the optimal memory control function v * that minimize the expected cumulative cost function J[u, v] as follows: ML-POSC first formulates the finite-dimensional and stochastic memory dynamics explicitly, then optimizes the memory control by considering the memory control cost. As a result, unlike the conventional POSC, ML-POSC is a practical framework for memorylimited controllers where the memory size, noise, and cost are imposed and non-negligible.
The previous work [14] has shown the validity and effectiveness of ML-POSC. In the LQG problem of conventional POSC, the observation history y 0:T can be compressed into the Kalman filter without a loss of performance [10,18,43]. Because the Kalman filter is finite-dimensional, it can be interpreted as the finite-dimensional memory z t and discussed in terms of ML-POSC. The previous work [14] has proven that the optimal memory dynamics of ML-POSC become the Kalman filter in this problem, which indicates that ML-POSC is a consistent framework with the conventional POSC. Furthermore, the previous work [14] has demonstrated the effectiveness of ML-POSC in the LQG problem with memory limitation and in the non-LQG problem by numerical experiments.

Problem Reformulation
Although the formulation of ML-POSC in the previous subsection is intuitive, it is inconvenient for further mathematical investigations. To address this problem, we reformulate ML-POSC in this subsection. The formulation in this subsection is simpler and more general than that in the previous subsection.
First, we define an extended state s t as follows: where d s = d x + d z . The extended state s t evolves by the following SDE: where s 0 obeys p 0 (s 0 ),ũ t ∈ R dũ is the control, andω t ∈ R dω is the standard Wiener process. ML-POSC determines the controlũ t ∈ R dũ based on the memory z t as follows: The extended state SDE (8) includes the previous SDEs (1)-(3) as a special case because they can be represented as follows: where p 0 (s 0 ) = p 0 (x 0 )p 0 (z 0 ). The objective function of ML-POSC is given by the following expected cumulative cost function: wheref is the cost function andg is the terminal cost function. It is obvious that this objective function (11) is more general than that in the previous subsection (5). ML-POSC is the problem of finding the optimal control functionũ * that minimizes the expected cumulative cost function J[ũ] as follows: In the following sections, we mainly consider the formulation of this subsection because it is simpler and more general than that in the previous subsection. Moreover, we omit· for simplicity of notation.

Pontryagin's Minimum Principle
If the control u t is determined based on the extended state s t as u t = u(t, s t ), ML-POSC is the same problem with COSC of the extended state, and its optimality conditions can be obtained in the conventional way [15][16][17][18]. In reality, however, because ML-POSC determines the control u t based only on the memory z t as u t = u(t, z t ), its optimality conditions cannot be obtained in a similar way as COSC. In the previous work [14], the optimality conditions of ML-POSC were obtained by employing a mathematical technique of MFSC [30,31].
In this section, we obtain the optimality conditions of ML-POSC by employing Pontryagin's minimum principle [22][23][24][25] on the probability density function space ( Figure 2 (bottom right)). The conventional approach in ML-POSC [14] and MFSC [30,31] can be interpreted as a conversion from Bellman's dynamic programming principle ( Figure 2 (top right)) to Pontryagin's minimum principle (Figure 2 (bottom right)) on the probability density function space.
In Appendix A, we briefly review Pontryagin's minimum principle in deterministic control ( Figure 2 (left)). In this section, we obtain the optimality conditions of ML-POSC in a similar way as Appendix A (Figure 2 (right)). Furthermore, in Appendix B, we obtain the optimality conditions of MFSC in a similar way as Appendix A (Figure 2 (right)). MFSC is more general than ML-POSC except for the partial observability. In particular, the expected Hamiltonian is non-linear with respect to the probability density function in MFSC, while it is linear in ML-POSC.
Although our derivations are formal, not analytical, and more mathematically rigorous proofs remain future challenges, our results are consistent with the conventional results of COSC [15][16][17][18], ML-POSC [14], and MFSC [26][27][28]30,31], and also provide a useful perspective in proposing an algorithm. The relationship between Bellman's dynamic programming principle (top) and Pontryagin's minimum principle (bottom) on the state space (left) and on the probability density function space (right). The left-hand side corresponds to deterministic control, which is briefly reviewed in Appendix A. The right-hand side corresponds to ML-POSC and MFSC, which are shown in Section 3 and Appendix B, respectively. The conventional approach in ML-POSC [14] and MFSC [30,31] can be interpreted as the conversion from Bellman's dynamic programming principle (top right) to Pontryagin's minimum principle (bottom right) on the probability density function space.

Preliminary
In this subsection, we show a useful result in obtaining Pontryagin's minimum principle. Given arbitrary control functions u and u , J[u] − J[u ] can be calculated as follows: where H is the Hamiltonian, which is defined as follows: H(t, s, u, w) := f (t, s, u) + L u w(t, s). (14) L u is the backward diffusion operator, which is defined as follows: where D(t, s, u) := σ(t, s, u)σ (t, s, u). w (t, s) is the solution of the following Hamilton-Jacobi-Bellman (HJB) equation driven by u : where w (T, s) = g(s). p(t, s) is the solution of the following Fokker-Planck (FP) equation driven by u: where p(0, s) = p 0 (s). L † u is the forward diffusion operator, which is defined as follows: L † u is the conjugate of L u as follows: We derive Equation (13) in Appendix C.1.

Necessary Condition
In this subsection, we show the necessary condition of the optimal control function of ML-POSC. It corresponds to Pontryagin's minimum principle on the probability density function space ( Figure 2 (bottom right)). If u * is the optimal control function of ML-POSC (12), then the following equation is satisfied: where w * (t, s) is the solution of the following HJB equation driven by u * : where w * (T, s) = g(s). p * t (x|z) := p * (t, s)/ p * (t, s)dx is the conditional probability density function of state x given memory z, and p * (t, s) is the solution of the following FP equation driven by u * : where p * (0, s) = p 0 (s). We derive this result in Appendix C.2.
In deterministic control, Pontryagin's minimum principle can be expressed by the derivatives of the Hamiltonian (Figure 2 (bottom left)). Similarly, the system of HJB-FP Equations (21) and (22) can be expressed by the variations of the expected Hamiltonian as follows: where p * (0, s) = p 0 (s) and w * (T, s) = g(s) (Figure 2 (bottom right)). Therefore, the system of HJB-FP equations can be interpreted via Pontryagin's minimum principle on the probability density function space.

Sufficient Condition
Pontryagin's minimum principle (20) is only a necessary condition and generally not a sufficient condition. Pontryagin's minimum principle (20) becomes a necessary and sufficient condition if the expected HamiltonianH(t, p, u, w) is convex with respect to p and u. We obtain this result in Appendix C.3.

Relationship with Bellman's Dynamic Programming Principle
From Bellman's dynamic programming principle on the probability density function space ( Figure 2 (top right)) [14], the optimal control function of ML-POSC is given by the following equation: where V * (t, p) is the value function on the probability density function space, which is the solution of the following Bellman equation: where V * (T, p) = E p(s) [g(s)]. More specifically, the optimal control function of ML-POSC is given by u * (t, z) = u * (t, z, p * ), where p * is the solution of the FP Equation (22). Because the Bellman Equation (27) is a functional differential equation, it cannot be solved even numerically. To resolve this problem, the previous work [14] converted the Bellman Equation (27) into the HJB Equation (21) by defining where p * is the solution of FP Equation (22). This approach can be interpreted as the conversion from Bellman's dynamic programming principle ( Figure 2 (top right)) to Pontryagin's minimum principle ( Figure 2 (bottom right)) on the probability density function space.

Relationship with Completely Observable Stochastic Control
In the COSC of the extended state, the control u t is determined based on the extended state s t as u t = u(t, s t ). Therefore, in the COSC of the extended state, Pontryagin's minimum principle on the probability density function space is given by the following equation: where w * (t, s) is the solution of the HJB Equation (21). Because this proof is almost identical to that of Section 3.2, it is omitted in this paper. While the optimal control function of ML-POSC (20) depends on the FP equation and the HJB equation, the optimal control function of COSC (29) depends only on the HJB equation. From this nice property of COSC, Equation (29) is not only a necessary condition but also a sufficient condition without assuming the convexity of the expected Hamiltonian. We derive this result in Appendix C.4.
This result is consistent with the conventional result of COSC [15][16][17][18]. Unlike ML-POSC and MFSC, COSC can be solved by Bellman's dynamic programming principle on the state space. In COSC, Pontryagin's minimum principle on the probability density function space is equivalent to Bellman's dynamic programming principle on the state space. Because Bellman's dynamic programming principle on the state space is a necessary and sufficient condition, Pontryagin's minimum principle on the probability density function space may also become a necessary and sufficient condition.

Forward-Backward Sweep Method
In this section, we propose FBSM for ML-POSC and then prove its convergence by employing the interpretation of the system of HJB-FP equations by Pontryagin's minimum principle introduced in the previous section.

Forward-Backward Sweep Method
In this subsection, we propose FBSM for ML-POSC, which is summarized in Algorithm 1. Pontryagin's minimum principle is only a necessary condition of the optimal control function, not a sufficient condition. Therefore, the control function obtained by FBSM is not necessarily the global optimum except in the case where the expected Hamiltonian is convex. Nevertheless, the control function obtained by FBSM is expected to be superior to most control functions because it is locally optimal.
FBSM has been used in deterministic control [32,34,35,38] and MFSC [39][40][41][42]. However, the convergence of FBSM for these problems is not guaranteed because the backward dynamics depend on the forward dynamics even without the optimal control function (Figure 1c,d). In contrast, the convergence of FBSM is guaranteed in ML-POSC because the backward HJB equation does not depend on the forward FP equation without the optimal control function ( Figure 1b). More specifically, in FBSM for ML-POSC, the objective function J[u k 0:T−dt ] monotonically decreases and finally converges to Pontryagin's minimum principle. In the following subsections, we prove this nice property of FBSM for ML-POSC.

Preliminary
In this subsection, we show an important result in proving the convergence of FBSM for ML-POSC. We suppose that u 0:t−dt,t+dt:T−dt := {u 0 , ..., u t−dt , u t+dt , ..., u T−dt } is given and only u t is optimized as follows: In ML-POSC, u * t can be calculated as follows: where w t+dt (s) is the solution of the following time-discretized HJB equation driven by u t+dt:T−dt : where w T (s) = g(s). p t (x|z) := p t (s)/ p t (s)dx is the conditional probability density function of state x given memory z, and p t (s) is the solution of the following time-discretized FP equation driven by u 0:t−dt : where p 0 (s). Equation (31) is obtained by the similar way to Pontyragin's minimum principle in Appendix C.5 and also by the time discretization method in Appendix C.6. Importantly, w t+dt does not depend on u t in ML-POSC (Figure 3a) while λ t+dt and w t+dt depend on u t in deterministic control ( Figure 3b) and MFSC (Figure 3c), respectively. Therefore, u * t can be obtained without modifying w t+dt in ML-POSC, which is essentially different from deterministic control and MFSC. From this nice property, the convergence of FBSM is guaranteed in ML-POSC. The variable at the head of an arrow depends on the variable at the tail of the arrow. (a) In ML-POSC, while the update from u t to u t (yellow) changes w 0:t and p t+dt:T to w 0:t and p t+dt:T , respectively (red), it does not change p 0:t and w t+dt:T (blue). From this property, the convergence of FBSM is guaranteed in ML-POSC. (b) In deterministic control, the update from u t to u t (yellow) changes λ t+dt:T to λ t+dt:T as well (red) because the adjoint equation depends on the state equation (green). Because FBSM does not take into account the change of λ t+dt:T , the convergence of FBSM is not guaranteed in deterministic control. (c) In MFSC, the update from u t to u t (yellow) changes w t+dt:T to w t+dt:T as well (red) because the HJB equation depends on the FP equation (green). Because FBSM does not take into account the change of w t+dt:T , the convergence of FBSM is not guaranteed in MFSC.

Monotonicity
In FBSM for ML-POSC, the objective function is monotonically non-increasing with respect to the update of the control function at each time step. More specifically, is satisfied in the backward step, and is satisfied in the forward step. We prove this result in Appendix C.7. Furthermore, in FBSM for ML-POSC, the objective function is monotonically non-increasing with respect to the update of the control function at each iteration step as follows: Equation (36) is obviously satisfied from Equations (34) and (35).

Convergence to Pontryagin's Minimum Principle
We assume that J[u 0:T−dt ] has a lower bound. From Equation (36), FBSM for ML-POSC is guaranteed to converge to the local minimum. Furthermore, we assume that if the candidate of u k+1 t includes u k t , then set u k+1 t at u k t . Under these assumptions, FBSM for ML-POSC converges to Pontryagin's minimum principle (20). More specifically, if J[u k+1 0:T−dt ] = J[u k 0:T−dt ] holds, u k+1 0:T−dt satisfies Pontryagin's minimum principle (20). We prove this result in Appendix C.8.
Therefore, unlike deterministic control and MFSC, in FBSM for ML-POSC, the objective function J[u k 0:T−dt ] monotonically decreases and finally converges to the local minimum at which the control function u k 0:T−dt satisfies Pontryagin's minimum principle (20).

Linear-Quadratic-Gaussian Problem
In this section, we apply FBSM to the LQG problem of ML-POSC [14]. In the LQG problem of ML-POSC, the system of HJB-FP equations is reduced from partial differential equations to ordinary differential equations.

Problem Formulation
In the LQG problem of ML-POSC, the extended state SDE (8) is given as follows [14]: where s 0 obeys the Gaussian distribution p 0 (s 0 ) := N (s 0 |µ 0 , Λ 0 ) where µ 0 is the mean vector and Λ 0 is the precision matrix. The objective function (11) is given as follows: where Q(t) O, R(t) O, and P O. The LQG problem of ML-POSC is the problem of finding the optimal control function u * that minimizes the objective function J[u] as follows:

Pontryagin's Minimum Principle
In the LQG problem of ML-POSC, Pontryagin's minimum principle (20) can be calculated as follows [14]: where K(Λ) is defined as follows: where µ(t) and Λ(t) are the mean vector and the precision matrix of the extended state, respectively, which correspond to the solution of the FP Equation (22). We note that E p t (z|x) [s] = K(Λ)(s − µ) + µ is satisfied. µ(t) and Λ(t) are the solutions of the following ordinary differential equations (ODEs): where µ(0) = µ 0 and Λ(0) = Λ 0 . Ψ(t) and Π(t) are the control gain matrices of the deterministic and stochastic extended state, respectively, which correspond to the solution of the HJB Equation (21). Ψ(t) and Π(t) are the solutions of the following ODEs: where Ψ(T) = Π(T) = P. The ODE of Ψ (44) is the Riccati equation [16][17][18], which also appears in the LQG problem of COSC. In contrast, the ODE of Π (45) is the partially observable Riccati equation [14], which appears only in the LQG problem of ML-POSC. The above result is obtained in [14]. The ODE of Ψ (44) can be solved backward in time from the terminal condition. Using Ψ, the ODE of µ (42) can be solved forward in time from the initial condition. In contrast, the ODEs of Π (45) and Λ (43) cannot be solved in a similar way as the ODEs of Ψ (44) and µ (42) because they interact with each other, which is a similar problem to the system of HJB-FP equations.

Forward-Backward Sweep Method
In the LQG problem of ML-POSC, FBSM is reduced from Algorithm 1 to Algorithm 2. F (Λ, Π) and G(Λ, Π) are defined by the right-hand sides of the ODEs of Λ (43) and Π (45), respectively, as follows: This result is obtained in Appendix C.9. Importantly, in the LQG problem of ML-POSC, FBSM computes the ODEs of Λ (43) and Π (45) instead of the FP Equation (22) and the HJB Equation (21).

Numerical Experiments
In this section, we verify the convergence of FBSM in ML-POSC by performing numerical experiments on the LQG and non-LQG problems. The setting of the numerical experiments is the same as the previous work [14].

LQG Problem
In this subsection, we verify the convergence of FBSM for ML-POSC by conducting a numerical experiment on the LQG problem. We consider state x t ∈ R, observation y t ∈ R, and memory z t ∈ R, which evolve by the following SDEs: where x 0 and z 0 obey the standard Gaussian distributions, y 0 is an arbitrary real number, ω t ∈ R and ν t ∈ R are independent standard Wiener processes, and u t = u(t, z t ) ∈ R and v t = v(t, z t ) ∈ R are the controls. The objective function to be minimized is given as follows: J[u, v] := E p(x 0:10 ,y 0:10 ,z 0:10 ;u,v) 10 0 Therefore, the objective of this problem is to minimize the state variance with small state and memory controls. This problem corresponds to the LQG problem, which is defined by (37) and (38). By defining s t := (x t , z t ) ∈ R 2 ,ũ t := (u t , v t ) ∈ R 2 , andω t := (ω t , ν t ) ∈ R 2 , the SDEs (46)- (48) can be rewritten as follows: which corresponds to (37). Furthermore, the objective function (49) which corresponds to (38). We apply the FBSM of the LQG problem (Algorithm 2) to this problem. Π 0 (t) is initialized by Π 0 (t) = O. To solve the ODEs of Π k (t) and Λ k (t), we use the fourth-order Runge-Kutta method. Figure 4 shows the control gain matrix Π k (t) ∈ R 2×2 and the precision matrix Λ k (t) ∈ R 2×2 obtained by FBSM. The color of each curve represents the iteration k. The darkest curve corresponds to the first iteration k = 0, and the brightest curve corresponds to the last iteration k = 50. Importantly, Π k (t) and Λ k (t) converge with respect to the iteration k. . The elements of the control gain matrix Π k (t) ∈ R 2×2 (a-c) and the precision matrix Λ k (t) ∈ R 2×2 (d-f) obtained by FBSM (Algorithm 2) in the numerical experiment of the LQG problem of ML-POSC. Because Π k zx (t) = Π k xz (t) and Λ k zx (t) = Λ k xz (t), Π k zx (t) and Λ k zx (t) are not visualized. The darkest curve corresponds to the first iteration k = 0, and the brightest curve corresponds to the last iteration k = 50. Π 0 (t) is initialized by Π 0 (t) = O. Figure 5a shows the objective function J[u k ] with respect to iteration k. The objective function J[u k ] monotonically decreases with respect to iteration k, which is consistent with Section 4.3. This monotonicity of FBSM is the nice property of ML-POSC that is not guaranteed in deterministic control and MFSC. The objective function J[u k ] finally converges, and u k satisfies Pontryagin's minimum principle from Section 4.4. Figure 5b-d compare the performance of the control function u k at the first iteration k = 0 and the last iteration k = 50 by performing a stochastic simulation. At the first iteration k = 0, the distributions of state and memory are unstable, and the cumulative cost diverges. In contrast, at the last iteration k = 50, the distributions of state and memory are stabilized and the cumulative cost is smaller. This result indicates that FBSM improves the performance in ML-POSC.  [14], they are comparing different things. While Figure 5b-d demonstrate the performance improvement by the FBSM iteration, the previous work [14] compares the performance of the partially observable Riccati Equation (45) with that of the conventional Riccati Equation (44).

Non-LQG Problem
In this subsection, we verify the convergence of FBSM in ML-POSC by conducting a numerical experiment on the non-LQG problem. We consider state x t ∈ R, observation y t ∈ R, and memory z t ∈ R, which evolve by the following SDEs: where x 0 and z 0 obey the Gaussian distributions p 0 (x 0 ) = N (x 0 |0, 0.01) and p 0 (z 0 ) = N (z 0 |0, 0.01), respectively. y 0 is an arbitrary real number, ω t ∈ R and ν t ∈ R are independent standard Wiener processes, and u t = u(t, z t ) ∈ R is the control. For the sake of simplicity, memory control is not considered. The objective function to be minimized is given as follows: where The cost function is high in 0.3 ≤ t ≤ 0.6 and 0.1 ≤ |x| ≤ 2.0, which represents the obstacles. In addition, the terminal cost function is the lowest at x = 0, which represents the desirable goal. Therefore, the system should avoid the obstacles and reach the goal with a small control. Because the cost function is non-quadratic, it is a non-LQG problem. We apply the FBSM (Algorithm 1) to this problem. u 0 (t, z) is initialized by u 0 (t, z) = 0. To solve the HJB equation and the FP equation, we use the finite-difference method. Figure 6 shows w k (t, s) and p k (t, s) obtained by FBSM at the first iteration k = 0 and at the last iteration k = 50. From Appendix C.6, w k (t, s) is given as follows: Because u 0 (t, z) = 0, w 0 (t, s) reflects the cost function corresponding to the obstacles and the goal (Figure 6a-e). In contrast, because u 50 (t, z) = 0, w 50 (t, s) becomes more complex (Figure 6f-j). In particular, while w 0 (t, s) does not depend on memory z, w 50 (t, s) depends on memory z, which indicates that the control function u 50 (t, z) is adjusted by the memory z. We note that w 0 (1, s) ( Figure 6e) and w 50 (1, s) (Figure 6j) are the same because they are given by the terminal cost function as w 0 (1, s) = w 50 (1, s) = 10x 2 . Furthermore, while p 0 (t, s) is a unimodal distribution (Figure 6k-o), p 50 (t, s) is a bimodal distribution (Figure 6p-t), which can avoid the obstacles.  Figure 7a shows the objective function J[u k ] with respect to iteration k. The objective function J[u k ] monotonically decreases with respect to iteration k, which is consistent with Section 4.3. This monotonicity of FBSM is the nice property of ML-POSC that is not guaranteed in deterministic control and MFSC. The objective function J[u k ] finally converges, and its u k satisfies Pontryagin's minimum principle from Section 4.4. Figure 7b,c compare the performance of the control function u k at the first iteration k = 0 and the last iteration k = 50 by conducting the stochastic simulation. At the first iteration k = 0, the obstacles cannot be avoided, which results in a higher objective function. In contrast, at the last iteration k = 50, the obstacles can be avoided, which results in a lower objective function. This result indicates that FBSM improves the performance in ML-POSC. Although Figure 7b,c look similar to Figure 3a,b in the previous work [14], they are comparing different things. While Figure 7b,c demonstrate the performance improvement by the FBSM iteration, the previous work [14] compares the performance of ML-POSC with the local LQG approximation of the conventional POSC.

Discussion
In this work, we first showed that the system of HJB-FP equations corresponds to Pontryagin's minimum principle on the probability density function space. Although the relationship between the system of HJB-FP equations and Pontryagin's minimum principle has been briefly mentioned in MFSC [29][30][31], its details have not yet been investigated. We addressed this problem by deriving the system of HJB-FP equations in a similar way to Pontryagin's minimum principle. We then proposed FBSM to ML-POSC. Although the convergence of FBSM is generally not guaranteed in deterministic control [32,34,35,38] and MFSC [39][40][41][42], we proved the convergence in ML-POSC by noting the fact that the update of the current control function does not affect the future HJB equation in ML-POSC. Therefore, ML-POSC is a special and nice class where FBSM is guaranteed to converge.
Our derivation of Pontryagin's minimum principle on the probability density function space is formal, not analytical. Therefore, more mathematically rigorous proofs should be pursued in future work. Nevertheless, because our results are consistent with the conventional results of COSC [15][16][17][18], ML-POSC [14], and MFSC [26][27][28]30,31], they would be reliable except for special cases. Furthermore, our results provide a unified perspective on FBSM in deterministic control [32,34,35,38] and the fixed-point iteration method in MFSC [39][40][41][42], which have been studied independently. It clarifies the different properties of ML-POSC from deterministic control and MFSC, which ensures the convergence of FBSM.
The regularized FBSM has recently been proposed in deterministic control, which is guaranteed to converge even in the general deterministic control [44,45]. Our work gives an intuitive reason why the regularized FBSM is guaranteed to converge. In the regularized FBSM, the Hamiltonian is regularized, which makes the update of the control function smaller. When the regularization is sufficiently strong, the effect of the current control function on the future backward dynamics would be negligible. Therefore, the regularized FBSM of deterministic control would be guaranteed to converge for a similar reason to the FBSM of ML-POSC. However, the convergence of the regularized FBSM is much slower because the stronger regularization makes the update of the control function smaller. The FBSM of ML-POSC does not suffer from such a problem because the future backward dynamics already do not depend on the current control function without regularization.
Our work gives a hint about a modification of the fixed-point iteration method to ensure convergence in MFSC. Although the fixed-point iteration method is the most basic algorithm in MFSC, its convergence is not guaranteed [39][40][41][42]. Our work showed that the fixed-point iteration method is equivalent to the FBSM on the probability density function space. Therefore, the idea of regularized FBSM may also be applied to the fixed-point iteration method. More specifically, the fixed-point iteration method may be guaranteed to converge by regularizing the expected Hamiltonian.
In FBSM, we solve the HJB equation and the FP equation using the finite-difference method. However, because the finite-difference method is prone to the curse of dimensionality, it is difficult to solve high-dimensional ML-POSC. To resolve this problem, two directions can be considered. One direction is the policy iteration method [21,46,47]. Although the policy iteration method is almost the same as FBSM, only the update of the control function is different. While FBSM updates the system of HJB-FP equations and the control function simultaneously, the policy iteration method updates them separately. In the policy iteration method, the system of HJB-FP equations becomes linear, which can be solved by the sampling method [48][49][50]. Because the sampling method is more tractable than the finite-difference method, the policy iteration method may allow high-dimensional ML-POSC to be solved. Furthermore, the policy iteration method has recently been studied in MFSC [51][52][53]. However, its convergence is not guaranteed except for special cases in MFSC. In a similar way to FBSM, the convergence of the policy iteration method may be guaranteed in ML-POSC.
The other direction is machine learning. Neural network-based algorithms have recently been proposed in MFSC, which can solve high-dimensional problems efficiently [54,55]. By extending these algorithms, high-dimensional ML-POSC may be solved efficiently. Fur-thermore, unlike MFSC, the coupling of the HJB-FP equations is limited only to the optimal control function in ML-POSC. By exploiting this nice property, more efficient algorithms may be devised for ML-POSC.

Conflicts of Interest:
The authors declare no conflict of interest.

Abbreviations
The following abbreviations are used in this manuscript: Ordinary Differential Equation LQG Linear-Quadratic-Gaussian

Appendix A.1. Problem Formulation
In this subsection, we formulate deterministic control [22][23][24][25]. The state of the system s t ∈ R d s at time t ∈ [0, T] evolves according to the following ordinary differential equation (ODE): where the initial state is s 0 , and the control is u t = u(t) ∈ R d u . The objective function is given by the following cumulative cost function: where f is the cost function and g is the terminal cost function. Deterministic control is the problem of finding the optimal control function u * that minimizes the cumulative cost function J[u] as follows:

. Preliminary
In this subsection, we show a useful result in deriving Pontryagin's minimum principle. Given arbitrary control functions u and u , J[u] − J[u ] can be calculated as follows [16]: where H is the Hamiltonian, which is defined as follows: λ t is the solution of the following adjoint equation driven by u : where λ T = ∂g(s T )/∂s. s t and s t are the solutions of the state Equation (A1) driven by u and u , respectively. In the following, we derive Equation (A4). J[u] − J[u ] can be calculated as follows: From the state Equation (A1), From the integration by parts and s 0 − s 0 = 0, From the adjoint Equation (A6), Equation (A4) is obtained.

Appendix A.3. Necessary Condition
In this subsection, we show the necessary condition of the optimal control function of deterministic control. It corresponds to Pontryagin's minimum principle on the state space ( Figure 2 (bottom left)). If u * is the optimal control function of deterministic control (A3), then the following equation is satisfied [16]: where λ * t is the solution of the following adjoint equation driven by u * : where λ * T = ∂g(s * T )/∂s. s * t is the solution of the following state equation driven by u * : where In the following, we show that Equation (A10) is the necessary condition of the optimal control function of deterministic control. We define the control function: where can be calculated as follows:

. Sufficient Condition
Pontryagin's minimum principle (A10) is a necessary condition and generally not a sufficient condition. Pontryagin's minimum principle (A10) becomes a necessary and sufficient condition if the Hamiltonian H(t, s, u, λ) is convex with respect to s and u and the terminal cost function g(s) is convex with respect to s.
In the following, we show this result. We define the arbitrary control function ∀ u : is given by the following equation: Since H(t, s, u, λ) is convex with respect to s and u and g(s) is convex with respect to s, the following inequalities are satisfied: Hence, the following inequality is satisfied: Because u * satisfies (A10), the following stationary condition is satisfied: Hence, the following inequality is satisfied: Therefore, Equation (A10) is the sufficient condition of the optimal control function of deterministic control if H(t, s, u, λ) is convex with respect to s and u and g(s) is convex with respect to s.
where w * (T, s) = g(s). More specifically, the optimal control function of deterministic control is given by u * (t) = u * (t, s * t ), where s * t is the solution of the state Equation (A12). The HJB Equation (A24) can be converted into the adjoint Equation (A11) by defining where s * t is the solution of the state Equation (A12). This approach can be interpreted as the conversion from Bellman's dynamic programming principle (Figure 2 (top left)) to Pontryagin's minimum principle (Figure 2 (bottom left)) on the state space.
In the following, we obtain this result. First, we define By differentiating the HJB Equation (A24) with respect to s, the following equation is obtained: where Λ * (T, s) = ∂g(s)/∂s. Then the derivative of λ * t = Λ * (t, s * t ) with respect to t can be calculated as follows: By substituting Equation (A27) into Equation (A28), the following equation is obtained: From the state Equation (A12), ( * ) = 0 is satisfied. Therefore, λ * (t) satisfies the adjoint Equation (A11).
where f is the cost function, g is the terminal cost function, p(s 0:T ; u) is the probability of s 0:t := {s τ |τ ∈ [0, t]} given u as a parameter, and E p [·] is the expectation with respect to probability p. MFSC is the problem of finding the optimal control function u * that minimizes the expected cumulative cost function J[u] as follows:

. Preliminary
In this subsection, we show a useful result in deriving Pontryagin's minimum principle. Given arbitrary control functions u and u , J[u] − J[u ] can be calculated as follows: whereH andḡ are the expected Hamiltonian and terminal cost function, respectively, which are defined as follows: H is the Hamiltonian, which is defined as follows: H(t, s, p, u, w) := f (t, s, p, u) + L u w(t, s).

(A36)
L u is the backward diffusion operator, which is defined as follows: where D(t, s, p, u) := σ(t, s, p, u)σ (t, s, p, u). w is the solution of the following Hamilton-Jacobi-Bellman (HJB) equation driven by u : where w (T, s) = (δḡ(p )/δp)(s). p is the solution of the following Fokker-Planck (FP) equation driven by u: where p(0, s) = p 0 (s). p is the solution of the FP Equation (A39) driven by u . L † u is the forward diffusion operator, which is defined as follows: L † u is the conjugate of L u as follows: w(t, s)L † u p(t, s)ds = p(t, s)L u w(t, s)ds.
In the following, we derive Equation (A33). J[u] − J[u ] can be calculated as follows: Because L u t and L u t are the conjugates of L † u t and L † u t , respectively, From the FP Equation (A39), From the integration by parts and p(0, s) − p (0, s) = p 0 (s) − p 0 (s) = 0, From the HJB Equation (A38), Equation (A33) is obtained.

Appendix B.3. Necessary Condition
In this subsection, we show the necessary condition of the optimal control function of MFSC. It corresponds to Pontryagin's minimum principle on the probability density function space (Figure 2 (bottom right)). If u * is the optimal control function of MFSC (A32), then the following equation is satisfied: where w * is the solution of the following HJB equation driven by u * : where w * (T, s) = (δḡ(p * )/δp)(s). p * is the solution of the following FP equation driven by u * : where p * (0, s) = p 0 (s).
In the following, we show that Equation (A46) is the necessary condition of the optimal control function of MFSC. We define the control function where E ε 1 : can be calculated as follows: (H(t, s, p ε , u, w * ) − H(t, s, p ε , u * , w * ))p ε (t, s)dsdt.

(A51)
Because u * is the optimal control function, the following inequality is satisfied: Therefore, Equation (A46) is the necessary condition of the optimal control function of MFSC.

Appendix B.4. Sufficient Condition
Pontryagin's minimum principle (A46) is a necessary condition and generally not a sufficient condition. Pontryagin's minimum principle (A46) becomes a necessary and sufficient condition if the expected HamiltonianH(t, p, u, w) is convex with respect to p and u and the expected terminal cost functionḡ(p) is convex with respect to p.
Hence, the following inequality is satisfied: Therefore, Equation (A46) is the sufficient condition of the optimal control function of MFSC if the expected HamiltonianH(t, p, u, w) is convex with respect to p and u and the expected terminal cost functionḡ(p) is convex with respect to p.
Because the Bellman Equation (A60) is a functional differential equation, it cannot be solved even numerically. To resolve this problem, the previous works [30,31] converted the Bellman Equation (A60) into the HJB Equation (A47) by defining w * (t, s) := δV * (t, p * ) δp (s), where p * is the solution of FP Equation (A48). This approach can be interpreted as the conversion from Bellman's dynamic programming principle (Figure 2 (top right)) to Pontryagin's minimum principle (Figure 2 (bottom right)) on the probability density function space.

Appendix C. Derivation of Main Results
Appendix C.1. Derivation of Result in Section 3.1 In this subsection, we derive Equation (13  In this subsection, we show that FBSM for ML-POSC converges to Pontryagin's minimum principle (20). More specifically, we prove that if J[u k+1 0:T−dt ] = J[u k 0:T−dt ] holds, u k+1 0:T−dt satisfies Pontryagin's minimum principle (20). We mainly consider the forward step. We can make a similar discussion in the backward step.