Memory-Limited Partially Observable Stochastic Control and Its Mean-Field Control Approach

Control problems with incomplete information and memory limitation appear in many practical situations. Although partially observable stochastic control (POSC) is a conventional theoretical framework that considers the optimal control problem with incomplete information, it cannot consider memory limitation. Furthermore, POSC cannot be solved in practice except in special cases. In order to address these issues, we propose an alternative theoretical framework, memory-limited POSC (ML-POSC). ML-POSC directly considers memory limitation as well as incomplete information, and it can be solved in practice by employing the technique of mean-field control theory. ML-POSC can generalize the linear-quadratic-Gaussian (LQG) problem to include memory limitation. Because estimation and control are not clearly separated in the LQG problem with memory limitation, the Riccati equation is modified to the partially observable Riccati equation, which improves estimation as well as control. Furthermore, we demonstrate the effectiveness of ML-POSC for a non-LQG problem by comparing it with the local LQG approximation.


Introduction
Control problems of systems with incomplete information and memory limitation appear in many practical situations. These constraints become especially predominant when designing the control of small devices [1,2], and are important for understanding the control mechanisms of biological systems [3][4][5][6][7][8] because their sensors are extremely noisy and their controllers can only have severely limited memories.
Partially observable stochastic control (POSC) is a conventional theoretical framework that considers the optimal control problem with one of these constraints, namely, the incomplete information of the system state ( Figure 1b) [9]. Because the POSC controller cannot completely observe the state of the system, it determines the control based on the noisy observation history of the state. POSC can be solved in principle [10][11][12] by converting it to a completely observable stochastic control (COSC) of the posterior probability of the state, as the posterior probability represents the sufficient statistics of the observation history. The posterior probability and the optimal control are obtained by solving the Zakai equation and the Bellman equation, respectively.
However, POSC has three practical problems with respect to the implementation of the controller which originate from the ignorance of the other constraint, namely, the memory limitation of the controller [1,2]. First, a controller designed by POSC should ideally have an infinite-dimensional memory to store and compute the posterior probability from the observation history. Second, the memory of the controller cannot have intrinsic stochasticity ordinary differential equation), the local LQG approximation can be solved numerically. However, the performance of the local LQG approximation may be poor in a highly non-LQG problem, as the local LQG approximation ignores non-LQG information. In contrast, ML-POSC reduces the Bellman equation to the HJB equation while maintaining non-LQG information. We demonstrate that ML-POSC can provide a better result than the local LQG approximation in a non-LQG problem.
This paper is organized as follows: In Section 2, we briefly review the conventional POSC. In Section 3, we formulate ML-POSC. In Section 4, we propose the mean-field control approach to ML-POSC. In Section 5, we investigate the LQG problem of the conventional POSC based on ML-POSC. In Section 6, we generalize the LQG problem to include memory limitation. In Section 7, we show numerical experiments involving a LQG problem with memory limitation and a non-LQG problem. Finally, in Section 8, we discuss our work. Schematic diagram of (a) completely observable stochastic control (COSC), (b) partially observable stochastic control (POSC), and (c) memory-limited partially observable stochastic control (ML-POSC). The top and bottom figures represent the system and controller, respectively; x t ∈ R d x is the state of the system; y t ∈ R d y , z t ∈ R d z , and u t ∈ R d u are the observation, memory, and control of the controller, respectively. (a) In COSC, the controller can completely observe the state x t , and determines the control u t based on the state x t , i.e., u t = u(t, x t ). Only finite-dimensional memory is required to store the state x t , and the optimal control u * t is obtained by solving the Hamilton-Jacobi-Bellman (HJB) equation, which is a partial differential equation. (b) In POSC, the controller cannot completely observe the state x t ; instead, it obtains the noisy observation y t of the state x t . The control u t is determined based on the observation history y 0:t := {y τ |τ ∈ [0, t]}, i.e., u t = u(t, y 0:t ). An infinite-dimensional memory is implicitly assumed to store the observation history y 0:t . Furthermore, to obtain the optimal control u * t , the Bellman equation (a functional differential equation) needs to be solved, which is generally intractable, even numerically. (c) In ML-POSC, the controller is only accessible to the noisy observation y t of the state x t , as in POSC. In addition, it has only finite-dimensional memory z t , which cannot completely memorize the the observation history y 0:t . The controller of ML-POSC compresses the observation history y 0:t into the finite-dimensional memory z t , then determines the control u t based on the memory z t , i.e., u t = u(t, z t ). The optimal control u * t is obtained by solving the HJB equation (a partial differential equation), as in COSC.

Review of Partially Observable Stochastic Control
In this section, we briefly review the conventional POSC [11,15].

Problem Formulation
In this subsection, we formulate the conventional POSC [11,15]. The state x t ∈ R d x and the observation y t ∈ R d y at time t ∈ [0, T] evolve by the following stochastic differential equations (SDEs): where x 0 and y 0 obey p 0 (x 0 ) and p 0 (y 0 ), respectively, ω t ∈ R d ω and ν t ∈ R d ν are independent standard Wiener processes, and u t ∈ R d u is the control. Here, γ(t)γ (t) is assumed to be invertible. In POSC, because the controller cannot completely observe the state x t , the control u t is determined based on the observation history y 0:t := {y τ |τ ∈ [0, t]}, as follows: The objective function of POSC is provided by the following expected cumulative cost function: where f is the cost function, g is the terminal cost function, p(x 0:T , y 0:T ; u) is the probability of x 0:T and y 0:T given u as a parameter, and E p [·] is the expectation with respect to probability p. Throughout this paper, the time horizon T is assumed to be finite. POSC is the problem of finding the optimal control function u * that minimizes the objective function J[u] as follows:

Derivation of Optimal Control Function
In this subsection, we briefly review the derivation of the optimal control function of the conventional POSC [11,15]. We first define the unnormalized posterior probability density function q t (x) := p(x t = x, y 0:t ). We omit y 0:t for notational simplicity. Here, q t (x) obeys the following Zakai equation: where q 0 (x) = p 0 (x)p 0 (y) and L † is the forward diffusion operator, which is defined by where D(t, x, u) := σ(t, x, u)σ (t, x, u). Then, the objective function (4) can be calculated as follows: (6) and (8), POSC is converted into a COSC of q t . As a result, POSC can be approached in the similar way as COSC, and the optimal control function is provided by the following proposition.
Proposition 1 ([11,15]). The optimal control function of POSC is provided by where H is the Hamiltonian, which is defined by L is the backward diffusion operator, which is defined by We note that L is the conjugate of L † ; furthermore, V(t, q) is the value function, which is the solution of the following Bellman equation: where Proof. The proof is shown in [11,15].
The optimal control function u * (t, q) is obtained by solving the Bellman Equation (12). The controller determines the optimal control u * t = u * (t, q t ) based on the posterior probability q t . The posterior probability q t is obtained by solving the Zakai Equation (6). As a result, POSC can be solved in principle.
However, POSC has three practical problems with respect to the memory of the controller. First, the controller should have an infinite-dimensional memory to store and compute the posterior probability q t from the observation history y 0:t . Second, the memory of the controller cannot have intrinsic stochasticity other than the observation dy t to accurately compute the posterior probability q t via the Zakai Equation (6). Third, POSC does not consider the cost originating from the memory update, which can be regarded as a cost of estimation. In light of the dualistic roles played by estimation and control, considering only control cost by ignoring estimation cost is asymmetric. As a result, POCS is not practical for control problems where the memory size, noise, and cost are non-negligible.
Furthermore, POSC has another crucial problem in obtaining the optimal control function u * (t, q) by solving the Bellman Equation (12). Because the posterior probability q is infinite-dimensional, the associated Bellman Equation (12) becomes a functional differential equation. However, solving a functional differential equation is generally intractable even numerically. As a result, POCS cannot be solved in practice.

Memory-Limited Partially Observable Stochastic Control
In order to address the above-mentioned problems, we propose an alternative theoretical framework to the conventional POSC called ML-POSC. In this section, we formulate ML-POSC.

Problem Formulation
In this subsection, we formulate ML-POSC. ML-POSC determines the control u t based on the finite-dimensional memory z t ∈ R d z as follows: The memory dimension d z is determined not by the optimization but by the prescribed memory limitation of the controller to be used. Comparing (3) and (13), the memory z t can be interpreted as the compression of the observation history y 0:t . While the conventional POSC compresses the observation history y 0:t into the infinite-dimensional posterior probability q t , ML-POSC compresses it into the finite-dimensional memory z t . ML-POSC formulates the memory dynamics with the following SDE: where z 0 obeys p 0 (z 0 ), ξ t ∈ R d ξ is the standard Wiener process, and v t = v(t, z t ) ∈ R d v is the control for the memory dynamics. This memory dynamics has three important properties: (i) because it depends on the observation dy t , the memory z t can be interpreted as the compression of the observation history y 0:t ; (ii) because it depends on the standard Wiener process dξ t , ML-POSC can consider the memory noise explicitly; (iii) because it depends on the control v t , it can be optimized through the control v t . The objective function of ML-POSC is provided by the following expected cumulative cost function: Because the cost function f depends on the memory control v t as well as the state control u t , ML-POSC can consider the memory control cost (state estimation cost) as well as the state control cost explicitly. ML-POSC optimizes the state control function u and the memory control function v based on the objective function J[u, v], as follows: ML-POSC first postulates the finite-dimensional and stochastic memory dynamics explicitly, then jointly optimizes the state and memory control function by considering the state and memory control cost. As a result, unlike the conventional POSC, ML-POSC can consider memory limitation as well as incomplete information.

Problem Reformulation
Although the formulation of ML-POSC in the previous subsection clarifies its relationship with that of the conventional POSC, it is inconvenient for further mathematical investigations. In order to resolve this problem, we reformulate ML-POSC in this subsection. The formulation in this subsection is simpler and more general than that in the previous subsection.
We first define the extended state s t as follows: where d s = d x + d z . The extended state s t evolves by the following SDE: where s 0 obeys p 0 (s 0 ),ω t ∈ R dω is the standard Wiener process, andũ t ∈ R dũ is the control. ML-POSC determines the controlũ t ∈ R dũ based solely on the memory z t , as follows: The extended state SDE (18) includes the previous state, observation, and memory SDEs (1), (2) and (14) as a special case; they can be represented as follows: where p 0 (s 0 ) = p 0 (x 0 )p 0 (z 0 ). The objective function of ML-POSC is provided by the following expected cumulative cost function: wheref is the cost function andg is the terminal cost function. It is obvious that this objective function (21) is more general than the previous one (15). ML-POSC is the problem of finding the optimal control functionũ * that minimizes the objective function J[ũ] as follows: In the following section, we mainly consider the formulation in this subsection rather than that of the previous subsection, as it is simpler and more general. Moreover, we omit· for the notational simplicity.

Mean-Field Control Approach
If the control u t is determined based on the extended state s t , i.e., u t = u(t, s t ), ML-POSC is the same as COSC of the extended state s t , and can be solved by the conventional COSC approach [10]. However, because ML-POSC determines the control u t based solely on the memory z t , i.e., u t = u(t, z t ), ML-POSC cannot be solved in a similar way as COSC. In order to solve ML-POSC, we propose the mean-field control approach in this section. Because the mean-field control approach is more general than the COSC approach, it can solve COSC and ML-POSC in a unified way.

Derivation of Optimal Control Function
In this subsection, we propose the mean-field control approach to ML-POSC. We first show that ML-POSC can be converted into a deterministic control of the probability density function, which is similar to the conventional POSC [11,15]. This approach is used in the mean-field control as well [13,14,24,25]. The extended state SDE (18) can be converted into the following Fokker-Planck (FP) equation: where the initial condition is provided by p 0 (s) and the forward diffusion operator L † is defined by (7). The objective function of ML-POSC (21) can be calculated as follows: (23) and (24), ML-POSC is converted into a deterministic control of p t . As a result, ML-POSC can be approached in a similar way as the deterministic control, and the optimal control function is provided by the following lemma.

Lemma 1.
The optimal control function of ML-POSC is provided by where H is the Hamiltonian (10), p t (x|z) = p t (s)/ p t (s)dx is the conditional probability density function of a state x given memory z, p t (s) is the solution of the FP Equation (23), and V(t, p) is the solution of the following Bellman equation: Proof. The proof is shown in Appendix A.
The controller of ML-POSC determines the optimal control u * t = u * (t, z t ) based on the memory z t , not the posterior probability q t . Therefore, ML-POSC can consider memory limitation as well as incomplete information.
However, because the Bellman Equation (26) is a functional differential equation, it cannot be solved, even numerically, which is the same problem as the conventional POSC. We resolve this problem by employing the technique of the mean-field control theory [13,14] as follows. Theorem 1. The optimal control function of ML-POSC is provided by where H is the Hamiltonian (10), p t (x|z) = p t (s)/ p t (s)dx is the conditional probability density function of a state x given memory z, p t (s) is the solution of the FP Equation (23), and w(t, s) is the solution of the following Hamilton-Jacobi-Bellman (HJB) equation: where w(T, s) = g(s).

Proof. The proof is shown in Appendix B.
While the Bellman Equation (26) is a functional differential equation, the HJB Equation (28) is a partial differential equation. As a result, unlike the conventional POSC, ML-POSC can be solved in practice.
We note that the mean-field control technique is applicable to the conventional POSC as well, and we obtain the HJB equation of the conventional POSC [15]. However, the HJB equation of the conventional POSC is not closed by a partial differential equation due to the last term of the Bellman Equation (12). As a result, the mean-field control technique is not effective with the conventional POSC except in a special case [15].
In the conventional POSC, the state estimation (memory control) and the state control are clearly separated. As a result, the state estimation and the state control are optimized by the Zakai Equation (6) and the Bellman Equation (12), respectively. In contrast, because ML-POSC considers memory limitation as well as incomplete information, the state estimation and the state control are not clearly separated. As a result, ML-POSC jointly optimizes the state estimation and the state control based on the FP Equation (23) and the HJB Equation (28).

Comparison with Completely Observable Stochastic Control
In this subsection, we show the similarities and differences between ML-POSC and COSC of the extended state. While ML-POSC determines the control u t based solely on the memory z t , i.e., u t = u(t, z t ), COSC of the extended state determines the control u t based on the extended state s t , i.e., u t = u(t, s t ). The optimal control function of COSC of the extended state is provided by the following proposition.

Proposition 2 ([10]
). The optimal control function of COSC of the extended state is provided by where H is the Hamiltonian (10) and w(t, s) is the solution of the HJB Equation (28).
Proof. The conventional proof is shown in [10]. We note that it can be proven in a similar way as ML-POSC, which is shown in Appendix C.
Although the HJB Equation (28) is the same between ML-POSC and COSC, the optimal control function is different. While the optimal control function of COSC is provided by the minimization of the Hamiltonian (29), that of ML-POSC is provided by the minimization of the conditional expectation of the Hamiltonian (27). This is reasonable, as the controller of ML-POSC needs to estimate the state from the memory.

Numerical Algorithm
In this subsection, we briefly explain a numerical algorithm to obtain the optimal control function of ML-POSC (27). Because the optimal control function of COSC (29) depends only on the backward HJB Equation (28), it can be obtained by solving the HJB equation backwards from the terminal condition [10,26,27]. In contrast, because the optimal control function of ML-POSC (27) depends on the forward FP Equation (23) as well as the backward HJB Equation (28), it cannot be obtained in a similar way as COSC. Because the backward HJB equation depends on the forward FP equation through the optimal control function of ML-POSC, the HJB equation cannot be solved backwards from the terminal condition. As a result, ML-POSC needs to solve the system of HJB-FP equations.
The system of HJB-FP equations appears in the mean-field game and control [28][29][30], and many numerical algorithms have been developed [31][32][33]. Therefore, unlike the conventional POSC, ML-POSC can be solved in practice using these algorithms. Furthermore, unlike the mean-field game and control, the coupling of HJB-FP equations is limited to the optimal control function in ML-POSC. By exploiting this property, more efficient algorithms may be proposed for ML-POSC [34].
In this paper, we use the forward-backward sweep method (the fixed-point iteration method) to obtain the optimal control function of ML-POSC [33][34][35][36][37], which is one of the most basic algorithms for the system of HJB-FP equations. The forward-backward sweep method computes the forward FP Equation (23) and the backward HJB Equation (28) alternately. In the mean-field game and control, the convergence of the forward-backward sweep method is not guaranteed. In contrast, it is guaranteed in ML-POSC because the coupling of HJB-FP equations is limited to the optimal control function [34].

Linear-Quadratic-Gaussian Problem without Memory Limitation
In the LQG problem of the conventional POSC, the Zakai Equation (6) and the Bellman Equation (12) are reduced to the Kalman filter and the Riccati equation, respectively [9,23]. Because the infinite-dimensional Zakai equation is reduced to the finite-dimensional Kalman filter, the LQG problem of the conventional POSC can be discussed in terms of ML-POSC. In this section, we briefly review the LQG problem of the conventional POSC, then reproduce the Kalman filter and the Riccati equation from the viewpoint of ML-POSC. The LQG problem of the conventional POSC corresponds to the LQG problem without memory limitation, as it does not consider the memory noise and cost.

Review of Partially Observable Stochastic Control
In this subsection, we briefly review the LQG problem of the conventional POSC [9,23].
The state x t ∈ R d x and the observation y t ∈ R d y at time t ∈ [0, T] evolve by the following SDEs: where x 0 obeys the Gaussian distribution p 0 (x 0 ) = N (x 0 |µ x,0 , Σ xx,0 ), y 0 is an arbitrary real vector, ω t ∈ R d ω and ν t ∈ R d ν are independent standard Wiener processes, and u t = u(t, y 0:t ) ∈ R d u is the control. Here, γ(t)γ (t) is assumed to be invertible. The objective function is provided by the following expected cumulative cost function: where Q(t) O, R(t) O, and P O. The LQG problem of the conventional POSC is to find the optimal control function u * that minimizes the objective function J[u], as follows: In the LQG problem of the conventional POSC, the posterior probability is provided by the Gaussian distribution p(x t |y 0:t ) = N (x t |μ(t),Σ(t)), and u t = u(t, y 0:t ) is reduced to u t = u(t,μ t ) without loss of performance.
In the LQG problem of the conventional POSC, the Zakai Equation (6) and the Bellman Equation (12) are reduced to the Kalman filter (35) and (36) and the Riccati Equation (37), respectively.

Memory-Limited Partially Observable Stochastic Control
Because the infinite-dimensional Zakai Equation (6) is reduced to the finite-dimensional Kalman filter (35) and (36), the LQG problem of the conventional POSC can be discussed in terms of ML-POSC. In this subsection, we reproduce the Kalman filter (35) and (36) and the Riccati Equation (37) from the viewpoint of ML-POSC.
ML-POSC defines the finite-dimensional memory z t ∈ R d z . In the LQG problem of the conventional POSC, the memory dimension d z is the same as the state dimension d x . The controller of ML-POSC determines the control u t based on the memory z t , i.e., u t = u(t, z t ). The memory z t is assumed to evolve by the following SDE: where z 0 = µ 0,xx , while v t = v(t, z t ) ∈ R d z and κ t = κ(t, z t ) ∈ R d z ×d y are the memory controls. We note that the LQG problem of the conventional POSC does not consider the memory noise. The objective function of ML-POSC is provided by the following expected cumulative cost function: We note that the LQG problem of the conventional POSC does not consider the memory control cost. ML-POSC optimizes u, v, and κ based on J[u, v, κ], as follows: In the LQG problem of the conventional POSC, the probability of the extended state s t (17) is provided by the Gaussian distribution p t (s t ) = N (s t |µ(t), Σ(t)). The posterior probability of the state x t given the memory z t is provided by the Gaussian distribution p t (x t |z t ) = N (x t |µ x|z (t, z t ), Σ x|z (t)), where µ x|z (t, z t ) and Σ x|z (t) are provided as follows: Theorem 2. In the LQG problem without memory limitation, the optimal control functions of ML-POSC (40) are provided by From v * (t, z) and κ * (t, z), z t and Σ x|z (t) obey the following equations: where z 0 = µ x,0 and Σ x|z (0) = Σ xx,0 . Furthermore, µ x|z (t, z t ) = z t holds in this problem. Ψ(t) is the solution of the Riccati Equation (37).
Proof. The proof is shown in Appendix D.

Linear-Quadratic-Gaussian Problem with Memory Limitation
The LQG problem of the conventional POSC does not consider memory limitation because it does not consider the memory noise and cost. Furthermore, because the memory dimension is restricted to the state dimension, the memory dimension cannot be determined according to a given controller. ML-POSC can generalize the LQG problem to include the memory limitation. In this section, we discuss the LQG problem with memory limitation based on ML-POSC.

Problem Formulation
In this subsection, we formulate the LQG problem with memory limitation. The state and observation SDEs are the same as in the previous section, which are provided by (30) and (31), respectively. The controller of ML-POSC determines the control u t ∈ R d u based on the memory z t ∈ R d z , i.e., u t = u(t, z t ). Unlike the LQG problem of the conventional POSC, the memory dimension d z is not necessarily the same as the state dimension d x .
The memory z t is assumed to evolve according to the following SDE: where z 0 obeys the Gaussian distribution p 0 (z 0 ) = N (z 0 |µ z,0 , Σ zz,0 ), ξ t ∈ R d ξ is the standard Wiener process, and v t = v(t, z t ) ∈ R d v is the control. Because the initial condition z 0 is stochastic and the memory SDE (48) includes the intrinsic stochasticity dξ t , the LQG problem of ML-POSC can consider the memory noise explicitly. We note that κ(t) is independent of the memory z t . If κ(t) depends on the memory z t , the memory SDE (48) becomes non-linear and non-Gaussian. As a result, the optimal control functions cannot be derived explicitly in this case. In order to keep the memory SDE (48) linear and Gaussian for obtaining the optimal control functions explicitly, we restrict κ(t) being independent of the memory z t in the LQG problem with memory limitation. The LQG problem without memory limitation is the special case in which the optimal control κ * t = κ * (t, z t ) in (45) does not depend on the memory z t .
The objective function is provided by the following expected cumulative cost function: J[u, v] := E p(x 0:T ,y 0:T ,z 0:T ;u,v) For the sake of simplicity, we do not optimize κ(t), although this can be accomplished by considering unobservable stochastic control.

Problem Reformulation
Although the formulation of the LQG problem with memory limitation in the previous subsection clarifies its relationship with that of the LQG problem without memory limitation, it is inconvenient for further mathematical investigations. In order to resolve this problem, we reformulate the LQG problem with memory limitation based on the extended state s t (17). The formulation in this subsection is simpler and more general than that in the previous subsection.
In the LQG problem with memory limitation, the extended state SDE (18) is provided as follows: where s 0 obeys the Gaussian distribution p 0 (s 0 ) := N (s 0 |µ 0 , Σ 0 ),ω t ∈ R dω is the standard Wiener process, andũ t =ũ(t, z t ) ∈ R dũ is the control. The extended state SDE (51) includes the previous state, observation, and memory SDEs (30), (31) and (48) as a special case because they can be represented as follows: where p 0 (s 0 ) = p 0 (x 0 )p 0 (z 0 ). The objective function (21) is provided by the following expected cumulative cost function: This objective function (53) includes the previous objective function (49) as a special case because it can be represented as follows: The objective of the LQG problem with memory limitation is to find the optimal control functionũ * that minimizes the objective function J[ũ], as follows: In the following subsection, we mainly consider the formulation of this subsection rather than that of the previous subsection because it is simpler and more general. Moreover, we omit· for notational simplicity.

Derivation of Optimal Control Function
In this subsection, we derive the optimal control function of the LQG problem with memory limitation by applying Theorem 1. In the LQG problem with memory limitation, the probability of the extended state s at time t is provided by the Gaussian distribution p t (s) = N (s|µ(t), Σ(t)). By defining the stochastic extended stateŝ := s − µ, E p t (x|z) [s] is provided as follows: where K(t) is defined by By applying Theorem 1 to the LQG problem with memory limitation, we obtain the following theorem: Theorem 3. In the LQG problem with memory limitation, the optimal control function of ML-POSC is provided by where K(t) (57) depends on Σ(t), and µ(t) and Σ(t) are the solutions of the following ordinary differential equations: where µ(0) = µ 0 and Σ(0) = Σ 0 , while Ψ(t) and Π(t) are the solutions of the following ordinary differential equations: where Ψ(T) = Π(T) = P.

Proof. The proof is shown in Appendix E.
Here, (61) is the Riccati equation [9,10,23], which appears in the LQG problem without memory limitation as well (37). In contrast, (62) is a new equation of the LQG problem with memory limitation, which in this paper we call the partially observable Riccati equation. Because estimation and control are not clearly separated in the LQG problem with memory limitation, the Riccati Equation (61) for control is modified to include estimation, which corresponds to the partially observable Riccati Equation (62). As a result, the partially observable Riccati Equation (62) is able to improve estimation as well as control.
In order to support this interpretation, we analyze the partially observable Riccati Equation (62) by comparing it with the Riccati Equation (61). Because only the last term of (62) is different from (61), we denote it as follows: Q can be calculated as follows: where P xx := (ΠBR −1 B Π) xx . Because P xx O and Σ −1 zz Σ zx P xx Σ xz Σ −1 zz O, Π xx and Π zz may be larger than Ψ xx and Ψ zz , respectively. Because Π xx and Π zz are the negative feedback gains of the state x and the memory z, respectively, Q may decrease Σ xx and Σ zz . Moreover, when Σ xz is positive/negative, Π xz may be smaller/larger than Ψ xz , which may increase/decrease Σ xz . A similar discussion is possible for Σ zx , Π zx , and Ψ zx , as Σ, Π, and Ψ are symmetric matrices. As a result, Q may decrease the following conditional covariance matrix: which corresponds to the estimation error of the state from the memory. Therefore, the partially observable Riccati Equation (62) may improve estimation as well as control, which is different from the Riccati Equation (61). Because the problem in Section 6.1 is specialized more than that in Section 6.2, we can carry out a more specific discussion. In the problem in Section 6.1, Ψ xx is the same as the solution of the Riccati equation of the conventional POSC (37), and Ψ xz = O, Ψ zx = O, and Ψ zz = O are satisfied. As a result, the memory control does not appear in the Riccati equation of ML-POSC (61). In contrast, because of the last term of the partially observable Riccati Equation (62), Π xx is not the solution of the Riccati Equation (37), and Π xz = O, Π zx = O, and Π zz = O are satisfied. As a result, the memory control appears in the partially observable Riccati Equation (62), which may improve the state estimation.

Comparison with Completely Observable Stochastic Control
In this subsection, we compare ML-POSC with COSC of the extended state. By applying Proposition 2 in the LQG problem, the optimal control function of COSC of the extended state can be obtained as follows: Proposition 4 ( [10,23]). In the LQG problem, the optimal control function of COSC of the extended state is provided by where Ψ(t) is the solution of the Riccati Equation (61).
The optimal control function of COSC of the extended state (66) can be derived intuitively from that of ML-POSC (58). In ML-POSC, Kŝ = E p t (x|z) [ŝ] is the estimator of the stochastic extended state. In COSC of the extended state, because the stochastic extended state is completely observable, its estimator is provided byŝ, which corresponds to K = I. By changing the definition of K from (57) to K = I, the partially observable Riccati Equation (62) is reduced to the Riccati Equation (61), and the optimal control function of ML-POSC (58) is reduced to that of COSC (66). As a result, the optimal control function of ML-POSC (58) can be interpreted as the generalization of that of COSC (66).
While the second term is the same between (58) and (66), the first term is different. The second term is the control of the expected extended state µ, which does not depend on the realization. In contrast, the first term is the control of the stochastic extended stateŝ, which depends on the realization. The first term has two different points: (i) The estimators of the stochastic extended state in COSC and ML-POSC are provided byŝ and Kŝ = E p t (x|z) [ŝ], respectively, which is reasonable because ML-POSC needs to estimate the state from the memory; and (ii) The control gains of the stochastic extended state in COSC and ML-POSC are provided by Ψ and Π, respectively. While Ψ improves only control, Π improves estimation as well as control.

Numerical Algorithm
In the LQG problem, the partial differential equations are reduced to the ordinary differential equations. The FP Equation (23) is reduced to (59) and (60), and the HJB Equation (28) is reduced to (61) and (62). As a result, the optimal control function (58) can be obtained more easily in the LQG problem.
The Riccati Equation (61) can be solved backwards from the terminal condition. In contrast, the partially observable Riccati Equation (62) cannot be solved in the same way as the Riccati Equation (61), as it depends on the forward equation of Σ (60) through K (57). Because the forward equation of Σ (60) depends on the backward equation of Π (62) as well, they must be solved simultaneously.
A similar problem appears in the mean-field game and control, and numerous numerical methods have been developed to deal with it [33]. In this paper, we solve the system of (60) and (62) using the forward-backward sweep method, which computes (60) and (62) alternately [33,34]. In ML-POSC, the convergence of the forward-backward sweep method is guaranteed [34].

Numerical Experiments
In this section, we demonstrate the effectiveness of ML-POSC using numerical experiments on the LQG problem with memory limitation as well as on the non-LQG problem.

LQG Problem with Memory Limitation
In this subsection, we show the significance of the partially observable Riccati Equation (62) by a numerical experiment of the LQG problem with memory limitation. We consider the state x t ∈ R, the observation y t ∈ R, and the memory z t ∈ R, which evolve by the following SDEs: where x 0 and z 0 obey standard Gaussian distributions, y 0 is an arbitrary real number, ω t ∈ R and ν t ∈ R are independent standard Wiener processes, and u t = u(t, z t ) ∈ R and v t = v(t, z t ) ∈ R are the controls. The objective function to be minimized is provided as follows: Therefore, the objective of this problem is to minimize the state variance by the small state and memory controls. Because this problem includes the memory control cost, it corresponds to the LQG problem with memory limitation. Figure 2a-c shows the trajectories of Ψ and Π; Π xx and Π zz are larger than Ψ xx and Ψ zz , respectively, and Π xz is smaller than Ψ xz , which is consistent with our discussion in Section 6.3. Therefore, the partially observable Riccati equation may reduce the estimation error of the state from the memory. Moreover, while the memory control does not appear in the Riccati equation (Ψ xz = Ψ zz = 0), it appears in the partially observable Riccati equation (Π xz = 0, Π zz = 0), which is consistent with our discussion in Section 6.3. As a result, the memory control plays an important role in estimating the state from the memory. In order to clarify the significance of the partially observable Riccati Equation (62), we compare the performance of the optimal control function (58) with that of the following control function: in which Π is replaced with Ψ. This result is shown in Figure 2d-f. In the control function (71), the distributions of the state and the memory are unstable, and the cumulative cost diverges. By contrast, in the optimal control function (58), the distributions of the state and memory are stable, and the cumulative cost is smaller. This result indicates that the partially observable Riccati Equation (62) plays an important role in the LQG problem with memory limitation.

Non-LQG Problem
In this subsection, we investigate the potential effectiveness of ML-POSC for a non-LQG problem by comparing it with the local LQG approximation of the conventional POSC [3,4]. We consider the state x t ∈ R and the observation y t ∈ R, which evolve according to the following SDEs: where x 0 obeys the Gaussian distribution p 0 (x 0 ) = N (x 0 |0, 0.01), y 0 is an arbitrary real number, ω t ∈ R and ν t ∈ R are independent standard Wiener processes, and u t = u(t, y 0:t ) ∈ R is the control. The objective function to be minimized is provided as follows: where Q(t, x) := 1000 (0.3 ≤ t ≤ 0.6, 0.1 ≤ |x| ≤ 2.0), 0 (others).
The cost function is high on the black rectangles in Figure 3a, which represent the obstacles. In addition, the terminal cost function is the lowest on the black cross in Figure 3a, which represents the desirable goal. Therefore, the system should avoid the obstacles and reach the goal with the small control. Because the cost function is non-quadratic, it is a non-LQG problem, which cannot be solved exactly by the conventional POSC. In the local LQG approximation of the conventional POSC [3,4], the Zakai equation and the Bellman equation are locally approximated by the Kalman filter and the Riccati equation, respectively. Because the Bellman equation is reduced to the Riccati equation, the local LQG approximation can be solved numerically even in the non-LQG problem.
ML-POSC determines the control u t ∈ R based on the memory z t ∈ R, i.e., u t = u(t, z t ). The memory dynamics is formulated with the following SDE: where p 0 (z 0 ) = N (z 0 |0, 0.01). For the sake of simplicity, the memory control is not considered. Figure 3 is the numerical result comparing the local LQG approximation and ML-POSC. Because the local LQG approximation reduces the Bellman equation to the Riccati equation by ignoring non-LQG information, it cannot avoid the obstacles, which results in a higher objective function. In contrast, because ML-POSC reduces the Bellman equation to the HJB equation while maintaining non-LQG information, it can avoid the obstacles, which results in a lower objective function. Therefore, our numerical experiment shows that ML-POSC can be superior to local LQG approximation.

Discussion
In this work, we propose ML-POSC, which is an alternative theoretical framework to the conventional POSC. ML-POSC first formulates the finite-dimensional and stochastic memory dynamics explicitly, then optimizes the memory dynamics considering the memory cost. As a result, unlike the conventional POSC, ML-POSC can consider memory limitation as well as incomplete information. Furthermore, because the optimal control function of ML-POSC is obtained by solving the system of HJB-FP equations, ML-POSC can be solved in practice even in non-LQG problems. ML-POSC can generalize the LQG problem to include memory limitation. Because estimation and control are not clearly separated in the LQG problem with memory limitation, the Riccati equation can be modified to the partially observable Riccati equation, which improves estimation as well as control. Furthermore, ML-POSC can provide a better result than the local LQG approximation in a non-LQG problem, as ML-POSC reduces the Bellman equation while maintaining non-LQG information.
ML-POSC is effective for the state estimation problem as well, which is a part of the POSC problem. Although the state estimation problem can be solved in principle by the Zakai equation [38][39][40], it cannot be solved directly, as the Zakai equation is infinitedimensional. In order to resolve this problem, a particle filter is often used to approximate the infinite-dimensional Zakai equation as a finite number of particles [38][39][40]. However, because the performance of the particle filter is guaranteed only in the limit of a large number of particles, a particle filter may not be practical in cases where the available memory size is severely limited. Furthermore, a particle filter cannot take the memory noise and cost into account. ML-POSC resolves these problems, as it can optimize the state estimation under memory limitation.
ML-POSC may be extended from a single-agent system to a multi-agent system. POSC of a multi-agent system is called decentralized stochastic control (DSC) [41][42][43], which consists of a system and multiple controllers. In DSC, each controller needs to estimate the controls of the other controllers as well as the state of the system, which is essentially different from the conventional POSC. Because the estimation among the controllers is generally intractable, the conventional POSC approach cannot be straightforwardly extended to DSC. In contrast, ML-POSC compresses the observation history into the finite-dimensional memory, which simplifies estimation among the controllers. Therefore, ML-POSC may provide an effective approach to DSC. Actually, the finite-state controller, the idea of which is similar with ML-POSC, plays a key role in extending POMDP from a single-agent system to a multi-agent system [22,[44][45][46][47][48]. ML-POSC may be extended to a multi-agent system in a similar way as a finite-state controller.
ML-POSC can be naturally extended to the mean-field control setting [28][29][30] because ML-POSC is solved based on the mean-field control theory. Therefore, ML-POSC can be applied to an infinite number of homogeneous agents. Furthermore, ML-POSC can be extended to a risk-sensitive setting, as this is a special case of the mean-field control setting [28][29][30]. Therefore, ML-POSC can consider the variance of the cost as well as its expectation.
Nonetheless, more efficient algorithms are needed in order to solve ML-POSC with a high-dimensional state and memory. In the mean-field game and control, neural networkbased algorithms have recently been proposed which can solve high-dimensional problems efficiently [49,50]. By extending these algorithms, it might be possible to solve highdimensional ML-POSC efficiently. Furthermore, unlike the mean-field game and control, the coupling of HJB-FP equations is limited to the optimal control function in ML-POSC. By exploiting this property, more efficient algorithms for ML-POSC may be proposed [34].
the following equation is obtained: From the definition of the Hamiltonian H (10), the following Bellman equation is obtained: Because the control u is the function of the memory z in ML-POSC, the minimization by u can be exchanged with the expectation by p(z) as follows: Because the optimal control function is provided by the right-hand side of the Bellman Equation (A7) [10], the optimal control function is provided by Because the FP Equation (23) is deterministic, the optimal control function is provided by u * (t, z) = u * (t, z, p t ).

Appendix B. Proof of Theorem 1
We first define which satisfies W(T, p, s) = g(s). Differentiating the Bellman Equation (26) with respect to p, the following equation is obtained: where p t is the solution of the FP Equation (23). The time derivative of w(t, s) can be calculated as follows: By substituting (A12) into (A14), the following equation is obtained: From the FP Equation (23), ( * ) = 0 holds. Therefore, the HJB Equation (28) the optimal control functions are provided by We assume that p t (s) is provided by the Gaussian distribution p t (s) = N (s|µ(t), Σ(t)), and w(t, s) is provided by the quadratic function From the initial condition of the FP equation, are satisfied. From the terminal condition of the HJB equation, Ψ(T) = P, Φ(T) = O, and β(T) = 0 are satisfied. In this case, u * (t, z), E p t (x|z) [∂w(t, s)/∂z], and κ * (t, z) can be calculated as follows: We then assume that the following equations are satisfied: In this case, µ x|z , Σ x|z , u * (t, z), E p t (x|z) [∂w(t, s)/∂z], and κ * (t, z) can be calculated as follows: κ * (t, z) = Σ x|z H (γγ ) −1 .
Because v * (t, z) is arbitrary when E p t (x|z) [∂w(t, s)/∂z] = 0, we consider v * (t, z) with the following equation: In this case, the extended state SDE is provided by the following equation: where p 0 (s) = N (s|µ(0), Σ(0)), and Because the drift and diffusion coefficients of (A41) are linear and constant with respect to s, respectively, p t (s) becomes the Gaussian distribution, which is consistent with our assumption (A26), while µ(t) and Σ(t) evolve by the following ordinary differential equations: If µ x = µ z and Σ zz = Σ xz are satisfied, dµ x /dt = dµ z /dt and dΣ xz /dt = dΣ zz /dt are satisfied as well, which is consistent with our assumptions of µ x = µ z and Σ zz = Σ xz .