Memory-Limited Partially Observable Stochastic Control and Its Mean-Field Control Approach

Tottori, Takehiro; Kobayashi, Tetsuya J.

doi:10.3390/e24111599

Open AccessArticle

Memory-Limited Partially Observable Stochastic Control and Its Mean-Field Control Approach

by

Takehiro Tottori

^1,*

and

Tetsuya J. Kobayashi

^1,2,3,4

¹

Department of Mathematical Informatics, Graduate School of Information Science and Technology, The University of Tokyo, Tokyo 113-8654, Japan

²

Institute of Industrial Science, The University of Tokyo, Tokyo 153-8505, Japan

³

Department of Electrical Engineering and Information Systems, Graduate School of Engineering, The University of Tokyo, Tokyo 113-8654, Japan

⁴

Universal Biology Institute, The University of Tokyo, Tokyo 113-8654, Japan

^*

Author to whom correspondence should be addressed.

Entropy 2022, 24(11), 1599; https://doi.org/10.3390/e24111599

Submission received: 22 September 2022 / Revised: 28 October 2022 / Accepted: 28 October 2022 / Published: 3 November 2022

(This article belongs to the Special Issue Information Theory in Control Systems)

Download

Browse Figures

Review Reports Versions Notes

Abstract

:

Control problems with incomplete information and memory limitation appear in many practical situations. Although partially observable stochastic control (POSC) is a conventional theoretical framework that considers the optimal control problem with incomplete information, it cannot consider memory limitation. Furthermore, POSC cannot be solved in practice except in special cases. In order to address these issues, we propose an alternative theoretical framework, memory-limited POSC (ML-POSC). ML-POSC directly considers memory limitation as well as incomplete information, and it can be solved in practice by employing the technique of mean-field control theory. ML-POSC can generalize the linear-quadratic-Gaussian (LQG) problem to include memory limitation. Because estimation and control are not clearly separated in the LQG problem with memory limitation, the Riccati equation is modified to the partially observable Riccati equation, which improves estimation as well as control. Furthermore, we demonstrate the effectiveness of ML-POSC for a non-LQG problem by comparing it with the local LQG approximation.

Keywords:

decision-making; optimal control; stochastic control; incomplete information; memory limitation; mean-field control

1. Introduction

Control problems of systems with incomplete information and memory limitation appear in many practical situations. These constraints become especially predominant when designing the control of small devices [1,2], and are important for understanding the control mechanisms of biological systems [3,4,5,6,7,8] because their sensors are extremely noisy and their controllers can only have severely limited memories.

Partially observable stochastic control (POSC) is a conventional theoretical framework that considers the optimal control problem with one of these constraints, namely, the incomplete information of the system state (Figure 1b) [9]. Because the POSC controller cannot completely observe the state of the system, it determines the control based on the noisy observation history of the state. POSC can be solved in principle [10,11,12] by converting it to a completely observable stochastic control (COSC) of the posterior probability of the state, as the posterior probability represents the sufficient statistics of the observation history. The posterior probability and the optimal control are obtained by solving the Zakai equation and the Bellman equation, respectively.

However, POSC has three practical problems with respect to the implementation of the controller which originate from the ignorance of the other constraint, namely, the memory limitation of the controller [1,2]. First, a controller designed by POSC should ideally have an infinite-dimensional memory to store and compute the posterior probability from the observation history. Second, the memory of the controller cannot have intrinsic stochasticity other than the observation noise to accurately compute the posterior probability via the Zakai equation. Third, POSC does not consider the cost originating from the memory update, which can be regarded as a cost of estimation. In light of the dualistic roles played by estimation and control, considering only control cost by ignoring estimation cost is asymmetric. As a result, POSC is not practical for control problems where the memory size, noise, and cost are non-negligible. Therefore, we need an alternative theoretical framework considering memory limitation to circumvent these three problems.

Furthermore, POSC has another crucial problem in obtaining the optimal state control by solving the Bellman equation [3,4]. Because the posterior probability of the state is infinite-dimensional, POSC corresponds to an infinite-dimensional COSC. In the infinite-dimensional COSC, the Bellman equation becomes a functional differential equation, which needs to be solved in order to obtain the optimal state control. However, solving a functional differential equation is generally intractable, even numerically.

In this work, we propose an alternative theoretical framework to the conventional POSC which can address the above-mentioned two issues. We call it memory-limited POSC (ML-POSC), in which memory limitation as well as incomplete information are directly accounted (Figure 1c). The conventional POSC derives the Zakai equation without considering memory limitations. Then, the optimal state control is supposed to be derived by solving the Bellman equation, even though we do not have any practical way to do this. In contrast, ML-POSC first postulates the finite-dimensional and stochastic memory dynamics explicitly by taking the memory limitation into account and then jointly optimizes the memory dynamics and state control by considering the memory and control costs. As a result, unlike the conventional POSC, ML-POSC finds both the optimal state control and the optimal memory dynamics with given memory limitations. Furthermore, we show that the Bellman equation of ML-POSC can be reduced to the Hamilton–Jacobi–Bellman (HJB) equation by employing a trick from the mean-field control theory [13,14,15]. While the Bellman equation is a functional differential equation, the HJB equation is a partial differential equation. As a result, ML-POSC can be solved, at least numerically.

The idea behind ML-POSC is closely related to that of the finite-state controller [16,17,18,19,20,21,22]. Finite-state controllers have been studied using the partially observable Markov decision process (POMDP), that is, the discrete time and state POSC. The finite-dimensional memory of ML-POSC can be regarded as an extension of the finite-state controller of POMDP to the continuous time and state setting. Nonetheless, the algorithms of the finite-state controller cannot be directly extended to the continuous setting, as they strongly depend on the discreteness. Although Fox and Tishby extended the finite-state controller to the continuous setting, their algorithm is restricted to the special case [1,2]. ML-POSC resolves this problem by employing the technique of the mean-field control theory.

In the linear-quadratic-Gaussian (LQG) problem of the conventional POSC, the Zakai equation and the Bellman equation are reduced to the Kalman filter and the Riccati equation, respectively [9,23]. Because the infinite-dimensional Zakai equation is reduced to the finite-dimensional Kalman filter, the LQG problem of the conventional POSC can be discussed in terms of ML-POSC. We show that the Kalman filter corresponds to the optimal memory dynamics of ML-POSC. Moreover, ML-POSC can generalize the LQG problem to include memory limitations such as the memory noise and cost. Because estimation and control are not clearly separated in the LQG problem with memory limitation, the Riccati equation for control is modified to include estimation, which in this paper is called the partially observable Riccati equation. We demonstrate that the partially observable Riccati equation is superior to the conventional Riccati equation as concerns the LQG problem with memory limitation.

Then, we investigate the potential effectiveness of ML-POSC for a non-LQG problem by comparing it with the local LQG approximation of the conventional POSC [3,4]. In the local LQG approximation, the Zakai equation and the Bellman equation are locally approximated by the Kalman filter and the Riccati equation, respectively. Because the Bellman equation (a functional differential equation) is reduced to the Riccati equation (an ordinary differential equation), the local LQG approximation can be solved numerically. However, the performance of the local LQG approximation may be poor in a highly non-LQG problem, as the local LQG approximation ignores non-LQG information. In contrast, ML-POSC reduces the Bellman equation to the HJB equation while maintaining non-LQG information. We demonstrate that ML-POSC can provide a better result than the local LQG approximation in a non-LQG problem.

This paper is organized as follows: In Section 2, we briefly review the conventional POSC. In Section 3, we formulate ML-POSC. In Section 4, we propose the mean-field control approach to ML-POSC. In Section 5, we investigate the LQG problem of the conventional POSC based on ML-POSC. In Section 6, we generalize the LQG problem to include memory limitation. In Section 7, we show numerical experiments involving a LQG problem with memory limitation and a non-LQG problem. Finally, in Section 8, we discuss our work.

2. Review of Partially Observable Stochastic Control

In this section, we briefly review the conventional POSC [11,15].

2.1. Problem Formulation

In this subsection, we formulate the conventional POSC [11,15]. The state

x_{t} \in R^{d_{x}}

and the observation

y_{t} \in R^{d_{y}}

at time

t \in [0, T]

evolve by the following stochastic differential equations (SDEs):

\begin{matrix} d x_{t} & = b (t, x_{t}, u_{t}) d t + σ (t, x_{t}, u_{t}) d ω_{t}, \end{matrix}

(1)

\begin{matrix} d y_{t} & = h (t, x_{t}) d t + γ (t) d ν_{t}, \end{matrix}

(2)

where

x_{0}

and

y_{0}

obey

p_{0} (x_{0})

and

p_{0} (y_{0})

, respectively,

ω_{t} \in R^{d_{ω}}

and

ν_{t} \in R^{d_{ν}}

are independent standard Wiener processes, and

u_{t} \in R^{d_{u}}

is the control. Here,

γ (t) γ^{⊤} (t)

is assumed to be invertible. In POSC, because the controller cannot completely observe the state

x_{t}

, the control

u_{t}

is determined based on the observation history

y_{0 : t} : = {y_{τ} | τ \in [0, t]}

, as follows:

\begin{matrix} u_{t} = u (t, y_{0 : t}) . \end{matrix}

(3)

The objective function of POSC is provided by the following expected cumulative cost function:

\begin{matrix} J [u] : = E_{p (x_{0 : T}, y_{0 : T}; u)} [\int_{0}^{T} f (t, x_{t}, u_{t}) d t + g (x_{T})], \end{matrix}

(4)

where f is the cost function, g is the terminal cost function,

p (x_{0 : T}, y_{0 : T}; u)

is the probability of

x_{0 : T}

and

y_{0 : T}

given u as a parameter, and

E_{p} [\cdot]

is the expectation with respect to probability p. Throughout this paper, the time horizon T is assumed to be finite.

POSC is the problem of finding the optimal control function

u^{*}

that minimizes the objective function

J [u]

as follows:

\begin{matrix} u^{*} : = \underset{u}{argmin} J [u] . \end{matrix}

(5)

2.2. Derivation of Optimal Control Function

In this subsection, we briefly review the derivation of the optimal control function of the conventional POSC [11,15]. We first define the unnormalized posterior probability density function

q_{t} (x) : = p (x_{t} = x, y_{0 : t})

. We omit

y_{0 : t}

for notational simplicity. Here,

q_{t} (x)

obeys the following Zakai equation:

\begin{matrix} d q_{t} (x) = L^{†} q_{t} (x) d t + q_{t} (x) h^{⊤} (t, x) {(γ (t) γ^{⊤} (t))}^{- 1} d y_{t}, \end{matrix}

(6)

where

q_{0} (x) = p_{0} (x) p_{0} (y)

and

L^{†}

is the forward diffusion operator, which is defined by

\begin{matrix} L^{†} q (x) & : = - \sum_{i = 1}^{d_{x}} \frac{\partial (b_{i} (t, x, u) q (x))}{\partial x_{i}} + \frac{1}{2} \sum_{i, j = 1}^{d_{x}} \frac{\partial^{2} (D_{i j} (t, x, u) q (x))}{\partial x_{i} \partial x_{j}}, \end{matrix}

(7)

where

D (t, x, u) : = σ (t, x, u) σ^{⊤} (t, x, u)

. Then, the objective function (4) can be calculated as follows:

\begin{matrix} J [u] = E_{p (q_{0 : T}; u)} [\int_{0}^{T} \bar{f} (t, q_{t}, u_{t}) d t + \bar{g} (q_{T})], \end{matrix}

(8)

where

\bar{f} (t, q, u) : = E_{q (x)} [f (t, x, u)]

and

\bar{g} (q) : = E_{q (x)} [g (x)]

. From (6) and (8), POSC is converted into a COSC of

q_{t}

. As a result, POSC can be approached in the similar way as COSC, and the optimal control function is provided by the following proposition.

Proposition 1

([11,15]). The optimal control function of POSC is provided by

\begin{matrix} u^{*} (t, q) = \underset{u}{argmin} E_{q (x)} [H (t, x, u, \frac{δ V (t, q)}{δ q} (x))], \end{matrix}

(9)

where

H

is the Hamiltonian, which is defined by

\begin{matrix} H (t, x, u, \frac{δ V (t, q)}{δ q} (x)) : = f (t, x, u) + L \frac{δ V (t, q)}{δ q} (x) . \end{matrix}

(10)

L

is the backward diffusion operator, which is defined by

\begin{matrix} L q (x) & : = \sum_{i = 1}^{d_{x}} b_{i} (t, x, u) \frac{\partial q (x)}{\partial x_{i}} + \frac{1}{2} \sum_{i, j = 1}^{d_{x}} D_{i j} (t, x, u) \frac{\partial^{2} q (x)}{\partial x_{i} \partial x_{j}} . \end{matrix}

(11)

We note that

L

is the conjugate of

L^{†}

; furthermore,

V (t, q)

is the value function, which is the solution of the following Bellman equation:

\begin{matrix} - \frac{\partial V (t, q)}{\partial t} = E_{q (x)} [H (t, x, u^{*}, \frac{δ V (t, q)}{δ q} (x))] \\ + \frac{1}{2} E_{q (x) q (x^{'})} [\frac{δ}{δ q} \frac{δ V (t, q)}{δ q} (x, x^{'}) h^{⊤} (t, x) {(γ (t) γ^{⊤} (t))}^{- 1} h (t, x^{'})], \end{matrix}

(12)

where

V (T, q) = E_{q (x)} [g (x)]

.

Proof.

The proof is shown in [11,15]. □

The optimal control function

u^{*} (t, q)

is obtained by solving the Bellman Equation (12). The controller determines the optimal control

u_{t}^{*} = u^{*} (t, q_{t})

based on the posterior probability

q_{t}

. The posterior probability

q_{t}

is obtained by solving the Zakai Equation (6). As a result, POSC can be solved in principle.

However, POSC has three practical problems with respect to the memory of the controller. First, the controller should have an infinite-dimensional memory to store and compute the posterior probability

q_{t}

from the observation history

y_{0 : t}

. Second, the memory of the controller cannot have intrinsic stochasticity other than the observation

d y_{t}

to accurately compute the posterior probability

q_{t}

via the Zakai Equation (6). Third, POSC does not consider the cost originating from the memory update, which can be regarded as a cost of estimation. In light of the dualistic roles played by estimation and control, considering only control cost by ignoring estimation cost is asymmetric. As a result, POCS is not practical for control problems where the memory size, noise, and cost are non-negligible.

Furthermore, POSC has another crucial problem in obtaining the optimal control function

u^{*} (t, q)

by solving the Bellman Equation (12). Because the posterior probability q is infinite-dimensional, the associated Bellman Equation (12) becomes a functional differential equation. However, solving a functional differential equation is generally intractable even numerically. As a result, POCS cannot be solved in practice.

3. Memory-Limited Partially Observable Stochastic Control

In order to address the above-mentioned problems, we propose an alternative theoretical framework to the conventional POSC called ML-POSC. In this section, we formulate ML-POSC.

3.1. Problem Formulation

In this subsection, we formulate ML-POSC. ML-POSC determines the control

u_{t}

based on the finite-dimensional memory

z_{t} \in R^{d_{z}}

as follows:

\begin{matrix} u_{t} = u (t, z_{t}) . \end{matrix}

(13)

The memory dimension

d_{z}

is determined not by the optimization but by the prescribed memory limitation of the controller to be used. Comparing (3) and (13), the memory

z_{t}

can be interpreted as the compression of the observation history

y_{0 : t}

. While the conventional POSC compresses the observation history

y_{0 : t}

into the infinite-dimensional posterior probability

q_{t}

, ML-POSC compresses it into the finite-dimensional memory

z_{t}

.

ML-POSC formulates the memory dynamics with the following SDE:

\begin{matrix} d z_{t} = c (t, z_{t}, v_{t}) d t + κ (t, z_{t}, v_{t}) d y_{t} + η (t, z_{t}, v_{t}) d ξ_{t}, \end{matrix}

(14)

where

z_{0}

obeys

p_{0} (z_{0})

,

ξ_{t} \in R^{d_{ξ}}

is the standard Wiener process, and

v_{t} = v (t, z_{t}) \in R^{d_{v}}

is the control for the memory dynamics. This memory dynamics has three important properties: (i) because it depends on the observation

d y_{t}

, the memory

z_{t}

can be interpreted as the compression of the observation history

y_{0 : t}

; (ii) because it depends on the standard Wiener process

d ξ_{t}

, ML-POSC can consider the memory noise explicitly; (iii) because it depends on the control

v_{t}

, it can be optimized through the control

v_{t}

.

The objective function of ML-POSC is provided by the following expected cumulative cost function:

\begin{matrix} J [u, v] : = E_{p (x_{0 : T}, y_{0 : T}, z_{0 : T}; u, v)} [\int_{0}^{T} f (t, x_{t}, u_{t}, v_{t}) d t + g (x_{T})] . \end{matrix}

(15)

Because the cost function f depends on the memory control

v_{t}

as well as the state control

u_{t}

, ML-POSC can consider the memory control cost (state estimation cost) as well as the state control cost explicitly.

ML-POSC optimizes the state control function u and the memory control function v based on the objective function

J [u, v]

, as follows:

\begin{matrix} u^{*}, v^{*} : = \underset{u, v}{argmin} J [u, v] . \end{matrix}

(16)

ML-POSC first postulates the finite-dimensional and stochastic memory dynamics explicitly, then jointly optimizes the state and memory control function by considering the state and memory control cost. As a result, unlike the conventional POSC, ML-POSC can consider memory limitation as well as incomplete information.

3.2. Problem Reformulation

Although the formulation of ML-POSC in the previous subsection clarifies its relationship with that of the conventional POSC, it is inconvenient for further mathematical investigations. In order to resolve this problem, we reformulate ML-POSC in this subsection. The formulation in this subsection is simpler and more general than that in the previous subsection.

We first define the extended state

s_{t}

as follows:

\begin{matrix} s_{t} : = (\begin{matrix} x_{t} \\ z_{t} \end{matrix}) \in R^{d_{s}}, \end{matrix}

(17)

where

d_{s} = d_{x} + d_{z}

. The extended state

s_{t}

evolves by the following SDE:

\begin{matrix} d s_{t} = \tilde{b} (t, s_{t}, {\tilde{u}}_{t}) d t + \tilde{σ} (t, s_{t}, {\tilde{u}}_{t}) d {\tilde{ω}}_{t}, \end{matrix}

(18)

where

s_{0}

obeys

p_{0} (s_{0})

,

{\tilde{ω}}_{t} \in R^{d_{\tilde{ω}}}

is the standard Wiener process, and

{\tilde{u}}_{t} \in R^{d_{\tilde{u}}}

is the control. ML-POSC determines the control

{\tilde{u}}_{t} \in R^{d_{\tilde{u}}}

based solely on the memory

z_{t}

, as follows:

\begin{matrix} {\tilde{u}}_{t} = \tilde{u} (t, z_{t}) . \end{matrix}

(19)

The extended state SDE (18) includes the previous state, observation, and memory SDEs (1), (2) and (14) as a special case; they can be represented as follows:

d s_{t} = (\begin{matrix} b (t, x_{t}, u_{t}) \\ c (t, z_{t}, v_{t}) + κ (t, z_{t}, v_{t}) h (t, x_{t}) \end{matrix}) d t + (\begin{matrix} σ (t, x_{t}, u_{t}) & O & O \\ O & κ (t, z_{t}, v_{t}) γ (t) & η (t, z_{t}, v_{t}) \end{matrix}) (\begin{matrix} d ω_{t} \\ d ν_{t} \\ d ξ_{t} \end{matrix}),

(20)

where

p_{0} (s_{0}) = p_{0} (x_{0}) p_{0} (z_{0})

.

The objective function of ML-POSC is provided by the following expected cumulative cost function:

\begin{matrix} J [\tilde{u}] : = E_{p (s_{0 : T}; \tilde{u})} [\int_{0}^{T} \tilde{f} (t, s_{t}, {\tilde{u}}_{t}) d t + \tilde{g} (s_{T})], \end{matrix}

(21)

where

\tilde{f}

is the cost function and

\tilde{g}

is the terminal cost function. It is obvious that this objective function (21) is more general than the previous one (15).

ML-POSC is the problem of finding the optimal control function

{\tilde{u}}^{*}

that minimizes the objective function

J [\tilde{u}]

as follows:

\begin{matrix} {\tilde{u}}^{*} : = \underset{\tilde{u}}{argmin} J [\tilde{u}] . \end{matrix}

(22)

In the following section, we mainly consider the formulation in this subsection rather than that of the previous subsection, as it is simpler and more general. Moreover, we omit

\tilde{\cdot}

for the notational simplicity.

4. Mean-Field Control Approach

If the control

u_{t}

is determined based on the extended state

s_{t}

, i.e.,

u_{t} = u (t, s_{t})

, ML-POSC is the same as COSC of the extended state

s_{t}

, and can be solved by the conventional COSC approach [10]. However, because ML-POSC determines the control

u_{t}

based solely on the memory

z_{t}

, i.e.,

u_{t} = u (t, z_{t})

, ML-POSC cannot be solved in a similar way as COSC. In order to solve ML-POSC, we propose the mean-field control approach in this section. Because the mean-field control approach is more general than the COSC approach, it can solve COSC and ML-POSC in a unified way.

4.1. Derivation of Optimal Control Function

In this subsection, we propose the mean-field control approach to ML-POSC. We first show that ML-POSC can be converted into a deterministic control of the probability density function, which is similar to the conventional POSC [11,15]. This approach is used in the mean-field control as well [13,14,24,25]. The extended state SDE (18) can be converted into the following Fokker–Planck (FP) equation:

\begin{matrix} \frac{\partial p_{t} (s)}{\partial t} = L^{†} p_{t} (s), \end{matrix}

(23)

where the initial condition is provided by

p_{0} (s)

and the forward diffusion operator

L^{†}

is defined by (7). The objective function of ML-POSC (21) can be calculated as follows:

\begin{matrix} J [u] = \int_{0}^{T} \bar{f} (t, p_{t}, u_{t}) d t + \bar{g} (p_{T}), \end{matrix}

(24)

where

\bar{f} (t, p, u) : = E_{p (s)} [f (t, s, u)]

and

\bar{g} (p) : = E_{p (s)} [g (s)]

. From (23) and (24), ML-POSC is converted into a deterministic control of

p_{t}

. As a result, ML-POSC can be approached in a similar way as the deterministic control, and the optimal control function is provided by the following lemma.

Lemma 1.

The optimal control function of ML-POSC is provided by

\begin{matrix} u^{*} (t, z) = \underset{u}{argmin} E_{p_{t} (x | z)} [H (t, s, u, \frac{δ V (t, p_{t})}{δ p} (s))], \end{matrix}

(25)

where

H

is the Hamiltonian (10),

p_{t} (x | z) = p_{t} (s) / \int p_{t} (s) d x

is the conditional probability density function of a state x given memory z,

p_{t} (s)

is the solution of the FP Equation (23), and

V (t, p)

is the solution of the following Bellman equation:

\begin{matrix} - \frac{\partial V (t, p)}{\partial t} = E_{p (s)} [H (t, s, u^{*}, \frac{δ V (t, p)}{δ p} (s))], \end{matrix}

(26)

where

V (T, p) = E_{p (s)} [g (s)]

.

Proof.

The proof is shown in Appendix A. □

The controller of ML-POSC determines the optimal control

u_{t}^{*} = u^{*} (t, z_{t})

based on the memory

z_{t}

, not the posterior probability

q_{t}

. Therefore, ML-POSC can consider memory limitation as well as incomplete information.

However, because the Bellman Equation (26) is a functional differential equation, it cannot be solved, even numerically, which is the same problem as the conventional POSC. We resolve this problem by employing the technique of the mean-field control theory [13,14] as follows.

Theorem 1.

The optimal control function of ML-POSC is provided by

\begin{matrix} u^{*} (t, z) = \underset{u}{argmin} E_{p_{t} (x | z)} [H (t, s, u, w (t, s))], \end{matrix}

(27)

where

H

is the Hamiltonian (10),

p_{t} (x | z) = p_{t} (s) / \int p_{t} (s) d x

is the conditional probability density function of a state x given memory z,

p_{t} (s)

is the solution of the FP Equation (23), and

w (t, s)

is the solution of the following Hamilton–Jacobi–Bellman (HJB) equation:

\begin{matrix} - \frac{\partial w (t, s)}{\partial t} = H (t, s, u^{*}, w (t, s)), \end{matrix}

(28)

where

w (T, s) = g (s)

.

Proof.

The proof is shown in Appendix B. □

While the Bellman Equation (26) is a functional differential equation, the HJB Equation (28) is a partial differential equation. As a result, unlike the conventional POSC, ML-POSC can be solved in practice.

We note that the mean-field control technique is applicable to the conventional POSC as well, and we obtain the HJB equation of the conventional POSC [15]. However, the HJB equation of the conventional POSC is not closed by a partial differential equation due to the last term of the Bellman Equation (12). As a result, the mean-field control technique is not effective with the conventional POSC except in a special case [15].

In the conventional POSC, the state estimation (memory control) and the state control are clearly separated. As a result, the state estimation and the state control are optimized by the Zakai Equation (6) and the Bellman Equation (12), respectively. In contrast, because ML-POSC considers memory limitation as well as incomplete information, the state estimation and the state control are not clearly separated. As a result, ML-POSC jointly optimizes the state estimation and the state control based on the FP Equation (23) and the HJB Equation (28).

4.2. Comparison with Completely Observable Stochastic Control

In this subsection, we show the similarities and differences between ML-POSC and COSC of the extended state. While ML-POSC determines the control

u_{t}

based solely on the memory

z_{t}

, i.e.,

u_{t} = u (t, z_{t})

, COSC of the extended state determines the control

u_{t}

based on the extended state

s_{t}

, i.e.,

u_{t} = u (t, s_{t})

. The optimal control function of COSC of the extended state is provided by the following proposition.

Proposition 2

([10]). The optimal control function of COSC of the extended state is provided by

\begin{matrix} u^{*} (t, s) = \underset{u}{argmin} H (t, s, u, w (t, s)), \end{matrix}

(29)

where

H

is the Hamiltonian (10) and

w (t, s)

is the solution of the HJB Equation (28).

Proof.

The conventional proof is shown in [10]. We note that it can be proven in a similar way as ML-POSC, which is shown in Appendix C. □

Although the HJB Equation (28) is the same between ML-POSC and COSC, the optimal control function is different. While the optimal control function of COSC is provided by the minimization of the Hamiltonian (29), that of ML-POSC is provided by the minimization of the conditional expectation of the Hamiltonian (27). This is reasonable, as the controller of ML-POSC needs to estimate the state from the memory.

4.3. Numerical Algorithm

In this subsection, we briefly explain a numerical algorithm to obtain the optimal control function of ML-POSC (27). Because the optimal control function of COSC (29) depends only on the backward HJB Equation (28), it can be obtained by solving the HJB equation backwards from the terminal condition [10,26,27]. In contrast, because the optimal control function of ML-POSC (27) depends on the forward FP Equation (23) as well as the backward HJB Equation (28), it cannot be obtained in a similar way as COSC. Because the backward HJB equation depends on the forward FP equation through the optimal control function of ML-POSC, the HJB equation cannot be solved backwards from the terminal condition. As a result, ML-POSC needs to solve the system of HJB-FP equations.

The system of HJB-FP equations appears in the mean-field game and control [28,29,30], and many numerical algorithms have been developed [31,32,33]. Therefore, unlike the conventional POSC, ML-POSC can be solved in practice using these algorithms. Furthermore, unlike the mean-field game and control, the coupling of HJB-FP equations is limited to the optimal control function in ML-POSC. By exploiting this property, more efficient algorithms may be proposed for ML-POSC [34].

In this paper, we use the forward–backward sweep method (the fixed-point iteration method) to obtain the optimal control function of ML-POSC [33,34,35,36,37], which is one of the most basic algorithms for the system of HJB-FP equations. The forward–backward sweep method computes the forward FP Equation (23) and the backward HJB Equation (28) alternately. In the mean-field game and control, the convergence of the forward–backward sweep method is not guaranteed. In contrast, it is guaranteed in ML-POSC because the coupling of HJB-FP equations is limited to the optimal control function [34].

5. Linear-Quadratic-Gaussian Problem without Memory Limitation

In the LQG problem of the conventional POSC, the Zakai Equation (6) and the Bellman Equation (12) are reduced to the Kalman filter and the Riccati equation, respectively [9,23]. Because the infinite-dimensional Zakai equation is reduced to the finite-dimensional Kalman filter, the LQG problem of the conventional POSC can be discussed in terms of ML-POSC. In this section, we briefly review the LQG problem of the conventional POSC, then reproduce the Kalman filter and the Riccati equation from the viewpoint of ML-POSC. The LQG problem of the conventional POSC corresponds to the LQG problem without memory limitation, as it does not consider the memory noise and cost.

5.1. Review of Partially Observable Stochastic Control

In this subsection, we briefly review the LQG problem of the conventional POSC [9,23]. The state

x_{t} \in R^{d_{x}}

and the observation

y_{t} \in R^{d_{y}}

at time

t \in [0, T]

evolve by the following SDEs:

\begin{matrix} d x_{t} & = (A (t) x_{t} + B (t) u_{t}) d t + σ (t) d ω_{t}, \end{matrix}

(30)

\begin{matrix} d y_{t} & = H (t) x_{t} d t + γ (t) d ν_{t}, \end{matrix}

(31)

where

x_{0}

obeys the Gaussian distribution

p_{0} (x_{0}) = N (x_{0} |μ_{x, 0}, Σ_{x x, 0})

,

y_{0}

is an arbitrary real vector,

ω_{t} \in R^{d_{ω}}

and

ν_{t} \in R^{d_{ν}}

are independent standard Wiener processes, and

u_{t} = u (t, y_{0 : t}) \in R^{d_{u}}

is the control. Here,

γ (t) γ^{⊤} (t)

is assumed to be invertible. The objective function is provided by the following expected cumulative cost function:

\begin{matrix} J [u] : = E_{p (x_{0 : T}, y_{0 : T}; u)} [\int_{0}^{T} (x_{t}^{⊤} Q (t) x_{t} + u_{t}^{⊤} R (t) u_{t}) d t + x_{T}^{⊤} P x_{T}], \end{matrix}

(32)

where

Q (t) ⪰ O

,

R (t) ≻ O

, and

P ⪰ O

. The LQG problem of the conventional POSC is to find the optimal control function

u^{*}

that minimizes the objective function

J [u]

, as follows:

\begin{matrix} u^{*} : = \underset{u}{argmin} J [u] . \end{matrix}

(33)

In the LQG problem of the conventional POSC, the posterior probability is provided by the Gaussian distribution

p (x_{t} | y_{0 : t}) = N (x_{t} | \overset{ˇ}{μ} (t), \overset{ˇ}{Σ} (t))

, and

u_{t} = u (t, y_{0 : t})

is reduced to

u_{t} = u (t, {\overset{ˇ}{μ}}_{t})

without loss of performance.

Proposition 3

([9,23]). In the LQG problem without memory limitation, the optimal control function of POSC (33) is provided by

\begin{matrix} u^{*} (t, \overset{ˇ}{μ}) = - R^{- 1} B^{⊤} Ψ \overset{ˇ}{μ}, \end{matrix}

(34)

where

\overset{ˇ}{μ} (t)

and

\overset{ˇ}{Σ} (t)

are the solutions of the following Kalman filter:

\begin{matrix} d \overset{ˇ}{μ} = (A - B R^{- 1} B^{⊤} Ψ) \overset{ˇ}{μ} d t + \overset{ˇ}{Σ} H^{⊤} {(γ γ^{⊤})}^{- 1} (d y_{t} - H \overset{ˇ}{μ} d t), \end{matrix}

(35)

\begin{matrix} \frac{d \overset{ˇ}{Σ}}{d t} = σ σ^{⊤} + A \overset{ˇ}{Σ} + \overset{ˇ}{Σ} A^{⊤} - \overset{ˇ}{Σ} H^{⊤} {(γ γ^{⊤})}^{- 1} H \overset{ˇ}{Σ}, \end{matrix}

(36)

and where

\overset{ˇ}{μ} (0) = μ_{x, 0}

and

\overset{ˇ}{Σ} (0) = Σ_{x x, 0}

.

Ψ (t)

is the solution of the following Riccati equation:

\begin{matrix} - \frac{d Ψ}{d t} & = Q + A^{⊤} Ψ + Ψ A - Ψ B R^{- 1} B^{⊤} Ψ, \end{matrix}

(37)

where

Ψ (T) = P

.

Proof.

The proof is shown in [9,23]. □

In the LQG problem of the conventional POSC, the Zakai Equation (6) and the Bellman Equation (12) are reduced to the Kalman filter (35) and (36) and the Riccati Equation (37), respectively.

5.2. Memory-Limited Partially Observable Stochastic Control

Because the infinite-dimensional Zakai Equation (6) is reduced to the finite-dimensional Kalman filter (35) and (36), the LQG problem of the conventional POSC can be discussed in terms of ML-POSC. In this subsection, we reproduce the Kalman filter (35) and (36) and the Riccati Equation (37) from the viewpoint of ML-POSC.

ML-POSC defines the finite-dimensional memory

z_{t} \in R^{d_{z}}

. In the LQG problem of the conventional POSC, the memory dimension

d_{z}

is the same as the state dimension

d_{x}

. The controller of ML-POSC determines the control

u_{t}

based on the memory

z_{t}

, i.e.,

u_{t} = u (t, z_{t})

. The memory

z_{t}

is assumed to evolve by the following SDE:

\begin{matrix} d z_{t} = v_{t} d t + κ_{t} d y_{t}, \end{matrix}

(38)

where

z_{0} = μ_{0, x x}

, while

v_{t} = v (t, z_{t}) \in R^{d_{z}}

and

κ_{t} = κ (t, z_{t}) \in R^{d_{z} \times d_{y}}

are the memory controls. We note that the LQG problem of the conventional POSC does not consider the memory noise. The objective function of ML-POSC is provided by the following expected cumulative cost function:

\begin{matrix} J [u, v, κ] : = E_{p (x_{0 : T}, y_{0 : T}, z_{0 : T}; u, v, κ)} [\int_{0}^{T} (x_{t}^{⊤} Q (t) x_{t} + u_{t}^{⊤} R (t) u_{t}) d t + x_{T}^{⊤} P x_{T}] . \end{matrix}

(39)

We note that the LQG problem of the conventional POSC does not consider the memory control cost. ML-POSC optimizes u, v, and

κ

based on

J [u, v, κ]

, as follows:

\begin{matrix} u^{*}, v^{*}, κ^{*} : = \underset{u, v, κ}{argmin} J [u, v, κ] . \end{matrix}

(40)

In the LQG problem of the conventional POSC, the probability of the extended state

s_{t}

(17) is provided by the Gaussian distribution

p_{t} (s_{t}) = N (s_{t} | μ (t), Σ (t))

. The posterior probability of the state

x_{t}

given the memory

z_{t}

is provided by the Gaussian distribution

p_{t} (x_{t} | z_{t}) = N (x_{t} | μ_{x | z} (t, z_{t}), Σ_{x | z} (t))

, where

μ_{x | z} (t, z_{t})

and

Σ_{x | z} (t)

are provided as follows:

\begin{matrix} μ_{x | z} (t, z_{t}) & = μ_{x} (t) + Σ_{x z} (t) Σ_{z z}^{- 1} (t) (z_{t} - μ_{z} (t)), \end{matrix}

(41)

\begin{matrix} Σ_{x | z} (t) & = Σ_{x x} (t) - Σ_{x z} (t) Σ_{z z}^{- 1} (t) Σ_{z x} (t) . \end{matrix}

(42)

Theorem 2.

In the LQG problem without memory limitation, the optimal control functions of ML-POSC (40) are provided by

\begin{matrix} u^{*} (t, z) & = - R^{- 1} B^{⊤} Ψ z, \end{matrix}

(43)

\begin{matrix} v^{*} (t, z) & = (A - B R^{- 1} B^{⊤} Ψ - Σ_{x | z} H^{⊤} {(γ γ^{⊤})}^{- 1} H) z, \end{matrix}

(44)

\begin{matrix} κ^{*} (t, z) & = Σ_{x | z} H^{⊤} {(γ γ^{⊤})}^{- 1} . \end{matrix}

(45)

From

v^{*} (t, z)

and

κ^{*} (t, z)

,

z_{t}

and

Σ_{x | z} (t)

obey the following equations:

\begin{matrix} d z_{t} = (A - B R^{- 1} B^{⊤} Ψ) z_{t} d t + Σ_{x | z} H^{⊤} {(γ γ^{⊤})}^{- 1} (d y_{t} - H z_{t} d t), \end{matrix}

(46)

\begin{matrix} \frac{d Σ_{x | z}}{d t} = σ σ^{⊤} + A Σ_{x | z} + Σ_{x | z} A^{⊤} - Σ_{x | z} H^{⊤} {(γ γ^{⊤})}^{- 1} H Σ_{x | z}, \end{matrix}

(47)

where

z_{0} = μ_{x, 0}

and

Σ_{x | z} (0) = Σ_{x x, 0}

. Furthermore,

μ_{x | z} (t, z_{t}) = z_{t}

holds in this problem.

Ψ (t)

is the solution of the Riccati Equation (37).

Proof.

The proof is shown in Appendix D. □

In the LQG problem of the conventional POSC, the optimal memory dynamics of ML-POSC (46) and (47) corresponds to the Kalman filter (35) and (36). Furthermore, ML-POSC reproduces the Riccati Equation (37).

6. Linear-Quadratic-Gaussian Problem with Memory Limitation

The LQG problem of the conventional POSC does not consider memory limitation because it does not consider the memory noise and cost. Furthermore, because the memory dimension is restricted to the state dimension, the memory dimension cannot be determined according to a given controller. ML-POSC can generalize the LQG problem to include the memory limitation. In this section, we discuss the LQG problem with memory limitation based on ML-POSC.

6.1. Problem Formulation

In this subsection, we formulate the LQG problem with memory limitation. The state and observation SDEs are the same as in the previous section, which are provided by (30) and (31), respectively. The controller of ML-POSC determines the control

u_{t} \in R^{d_{u}}

based on the memory

z_{t} \in R^{d_{z}}

, i.e.,

u_{t} = u (t, z_{t})

. Unlike the LQG problem of the conventional POSC, the memory dimension

d_{z}

is not necessarily the same as the state dimension

d_{x}

.

The memory

z_{t}

is assumed to evolve according to the following SDE:

\begin{matrix} d z_{t} & = v_{t} d t + κ (t) d y_{t} + η (t) d ξ_{t}, \end{matrix}

(48)

where

z_{0}

obeys the Gaussian distribution

p_{0} (z_{0}) = N (z_{0} |μ_{z, 0}, Σ_{z z, 0})

,

ξ_{t} \in R^{d_{ξ}}

is the standard Wiener process, and

v_{t} = v (t, z_{t}) \in R^{d_{v}}

is the control. Because the initial condition

z_{0}

is stochastic and the memory SDE (48) includes the intrinsic stochasticity

d ξ_{t}

, the LQG problem of ML-POSC can consider the memory noise explicitly. We note that

κ (t)

is independent of the memory

z_{t}

. If

κ (t)

depends on the memory

z_{t}

, the memory SDE (48) becomes non-linear and non-Gaussian. As a result, the optimal control functions cannot be derived explicitly in this case. In order to keep the memory SDE (48) linear and Gaussian for obtaining the optimal control functions explicitly, we restrict

κ (t)

being independent of the memory

z_{t}

in the LQG problem with memory limitation. The LQG problem without memory limitation is the special case in which the optimal control

κ_{t}^{*} = κ^{*} (t, z_{t})

in (45) does not depend on the memory

z_{t}

.

The objective function is provided by the following expected cumulative cost function:

\begin{matrix} J [u, v] : = E_{p (x_{0 : T}, y_{0 : T}, z_{0 : T}; u, v)} [\int_{0}^{T} (x_{t}^{⊤} Q (t) x_{t} + u_{t}^{⊤} R (t) u_{t} + v_{t}^{⊤} M (t) v_{t}) d t + x_{T}^{⊤} P x_{T}], \end{matrix}

(49)

where

Q (t) ⪰ O

,

R (t) ≻ O

,

M (t) ≻ O

, and

P ⪰ O

. Because the cost function includes

v_{t}^{⊤} M (t) v_{t}

, the LQG problem of ML-POSC can consider the memory control cost explicitly. ML-POSC optimizes the state control function u and the memory control function v based on the objective function

J [u, v]

, as follows:

\begin{matrix} u^{*}, v^{*} : = \underset{u, v}{argmin} J [u, v] . \end{matrix}

(50)

For the sake of simplicity, we do not optimize

κ (t)

, although this can be accomplished by considering unobservable stochastic control.

6.2. Problem Reformulation

Although the formulation of the LQG problem with memory limitation in the previous subsection clarifies its relationship with that of the LQG problem without memory limitation, it is inconvenient for further mathematical investigations. In order to resolve this problem, we reformulate the LQG problem with memory limitation based on the extended state

s_{t}

(17). The formulation in this subsection is simpler and more general than that in the previous subsection.

In the LQG problem with memory limitation, the extended state SDE (18) is provided as follows:

\begin{matrix} d s_{t} = (\tilde{A} (t) s_{t} + \tilde{B} (t) {\tilde{u}}_{t}) d t + \tilde{σ} (t) d {\tilde{ω}}_{t}, \end{matrix}

(51)

where

s_{0}

obeys the Gaussian distribution

p_{0} (s_{0}) : = N (s_{0} |μ_{0}, Σ_{0})

,

{\tilde{ω}}_{t} \in R^{d_{\tilde{ω}}}

is the standard Wiener process, and

{\tilde{u}}_{t} = \tilde{u} (t, z_{t}) \in R^{d_{\tilde{u}}}

is the control. The extended state SDE (51) includes the previous state, observation, and memory SDEs (30), (31) and (48) as a special case because they can be represented as follows:

d s_{t} = ((\begin{matrix} A & O \\ κ H & O \end{matrix}) s_{t} + (\begin{matrix} B & O \\ O & I \end{matrix}) {\tilde{u}}_{t}) d t + (\begin{matrix} σ & O & O \\ O & κ γ & η \end{matrix}) (\begin{matrix} d ω_{t} \\ d ν_{t} \\ d ξ_{t} \end{matrix}),

(52)

where

p_{0} (s_{0}) = p_{0} (x_{0}) p_{0} (z_{0})

.

The objective function (21) is provided by the following expected cumulative cost function:

\begin{matrix} J [\tilde{u}] : = E_{p (s_{0 : T}; \tilde{u})} [\int_{0}^{T} (s_{t}^{⊤} \tilde{Q} (t) s_{t} + {\tilde{u}}_{t}^{⊤} \tilde{R} (t) {\tilde{u}}_{t}) d t + s_{T}^{⊤} \tilde{P} s_{T}], \end{matrix}

(53)

where

\tilde{Q} (t) ⪰ O

,

\tilde{R} (t) ≻ O

, and

\tilde{P} ⪰ O

. This objective function (53) includes the previous objective function (49) as a special case because it can be represented as follows:

J [\tilde{u}] = E_{p (s_{0 : T}; \tilde{u})} [\int_{0}^{T} (s_{t}^{⊤} (\begin{matrix} Q & O \\ O & O \end{matrix}) s_{t} + {\tilde{u}}_{t}^{⊤} (\begin{matrix} R & O \\ O & M \end{matrix}) {\tilde{u}}_{t}) d t + s_{T}^{⊤} (\begin{matrix} P & O \\ O & O \end{matrix}) s_{T}] .

(54)

The objective of the LQG problem with memory limitation is to find the optimal control function

{\tilde{u}}^{*}

that minimizes the objective function

J [\tilde{u}]

, as follows:

\begin{matrix} {\tilde{u}}^{*} : = \underset{\tilde{u}}{argmin} J [\tilde{u}] . \end{matrix}

(55)

In the following subsection, we mainly consider the formulation of this subsection rather than that of the previous subsection because it is simpler and more general. Moreover, we omit

\tilde{\cdot}

for notational simplicity.

6.3. Derivation of Optimal Control Function

In this subsection, we derive the optimal control function of the LQG problem with memory limitation by applying Theorem 1. In the LQG problem with memory limitation, the probability of the extended state s at time t is provided by the Gaussian distribution

p_{t} (s) = N (s | μ (t), Σ (t))

. By defining the stochastic extended state

\hat{s} : = s - μ

,

E_{p_{t} (x | z)} [s]

is provided as follows:

\begin{matrix} E_{p_{t} (x | z)} [s] = K (t) \hat{s} + μ (t), \end{matrix}

(56)

where

K (t)

is defined by

\begin{matrix} K (t) : = (\begin{matrix} O & Σ_{x z} (t) Σ_{z z}^{- 1} (t) \\ O & I \end{matrix}) . \end{matrix}

(57)

By applying Theorem 1 to the LQG problem with memory limitation, we obtain the following theorem:

Theorem 3.

In the LQG problem with memory limitation, the optimal control function of ML-POSC is provided by

\begin{matrix} u^{*} (t, z) = - R^{- 1} B^{⊤} (Π K \hat{s} + Ψ μ), \end{matrix}

(58)

where

K (t)

(57) depends on

Σ (t)

, and

μ (t)

and

Σ (t)

are the solutions of the following ordinary differential equations:

\begin{matrix} \frac{d μ}{d t} & = (A - B R^{- 1} B^{⊤} Ψ) μ, \end{matrix}

(59)

\begin{matrix} \frac{d Σ}{d t} & = σ σ^{⊤} + (A - B R^{- 1} B^{⊤} Π K) Σ + Σ {(A - B R^{- 1} B^{⊤} Π K)}^{⊤}, \end{matrix}

(60)

where

μ (0) = μ_{0}

and

Σ (0) = Σ_{0}

, while

Ψ (t)

and

Π (t)

are the solutions of the following ordinary differential equations:

\begin{matrix} - \frac{d Ψ}{d t} & = Q + A^{⊤} Ψ + Ψ A - Ψ B R^{- 1} B^{⊤} Ψ, \end{matrix}

(61)

\begin{matrix} - \frac{d Π}{d t} & = Q + A^{⊤} Π + Π A - Π B R^{- 1} B^{⊤} Π + {(I - K)}^{⊤} Π B R^{- 1} B^{⊤} Π (I - K), \end{matrix}

(62)

where

Ψ (T) = Π (T) = P

.

Proof.

The proof is shown in Appendix E. □

Here, (61) is the Riccati equation [9,10,23], which appears in the LQG problem without memory limitation as well (37). In contrast, (62) is a new equation of the LQG problem with memory limitation, which in this paper we call the partially observable Riccati equation. Because estimation and control are not clearly separated in the LQG problem with memory limitation, the Riccati Equation (61) for control is modified to include estimation, which corresponds to the partially observable Riccati Equation (62). As a result, the partially observable Riccati Equation (62) is able to improve estimation as well as control.

In order to support this interpretation, we analyze the partially observable Riccati Equation (62) by comparing it with the Riccati Equation (61). Because only the last term of (62) is different from (61), we denote it as follows:

\begin{matrix} Q : = {(I - K)}^{⊤} Π B R^{- 1} B^{⊤} Π (I - K) . \end{matrix}

(63)

Q

can be calculated as follows:

\begin{matrix} Q = (\begin{matrix} P_{x x} & - P_{x x} Σ_{x z} Σ_{z z}^{- 1} \\ - Σ_{z z}^{- 1} Σ_{z x} P_{x x} & Σ_{z z}^{- 1} Σ_{z x} P_{x x} Σ_{x z} Σ_{z z}^{- 1} \end{matrix}), \end{matrix}

(64)

where

P_{x x} : = {(Π B R^{- 1} B^{⊤} Π)}_{x x}

. Because

P_{x x} ⪰ O

and

Σ_{z z}^{- 1} Σ_{z x} P_{x x} Σ_{x z} Σ_{z z}^{- 1} ⪰ O

,

Π_{x x}

and

Π_{z z}

may be larger than

Ψ_{x x}

and

Ψ_{z z}

, respectively. Because

Π_{x x}

and

Π_{z z}

are the negative feedback gains of the state x and the memory z, respectively,

Q

may decrease

Σ_{x x}

and

Σ_{z z}

. Moreover, when

Σ_{x z}

is positive/negative,

Π_{x z}

may be smaller/larger than

Ψ_{x z}

, which may increase/decrease

Σ_{x z}

. A similar discussion is possible for

Σ_{z x}

,

Π_{z x}

, and

Ψ_{z x}

, as

Σ

,

Π

, and

Ψ

are symmetric matrices. As a result,

Q

may decrease the following conditional covariance matrix:

\begin{matrix} Σ_{x | z} : = Σ_{x x} - Σ_{x z} Σ_{z z}^{- 1} Σ_{z x}, \end{matrix}

(65)

which corresponds to the estimation error of the state from the memory. Therefore, the partially observable Riccati Equation (62) may improve estimation as well as control, which is different from the Riccati Equation (61).

Because the problem in Section 6.1 is specialized more than that in Section 6.2, we can carry out a more specific discussion. In the problem in Section 6.1,

Ψ_{x x}

is the same as the solution of the Riccati equation of the conventional POSC (37), and

Ψ_{x z} = O

,

Ψ_{z x} = O

, and

Ψ_{z z} = O

are satisfied. As a result, the memory control does not appear in the Riccati equation of ML-POSC (61). In contrast, because of the last term of the partially observable Riccati Equation (62),

Π_{x x}

is not the solution of the Riccati Equation (37), and

Π_{x z} \neq O

,

Π_{z x} \neq O

, and

Π_{z z} \neq O

are satisfied. As a result, the memory control appears in the partially observable Riccati Equation (62), which may improve the state estimation.

6.4. Comparison with Completely Observable Stochastic Control

In this subsection, we compare ML-POSC with COSC of the extended state. By applying Proposition 2 in the LQG problem, the optimal control function of COSC of the extended state can be obtained as follows:

Proposition 4

([10,23]). In the LQG problem, the optimal control function of COSC of the extended state is provided by

\begin{matrix} u^{*} (t, s) = - R^{- 1} B^{⊤} Ψ s = - R^{- 1} B^{⊤} (Ψ \hat{s} + Ψ μ), \end{matrix}

(66)

where

Ψ (t)

is the solution of the Riccati Equation (61).

Proof.

The proof is shown in [10,23]. □

The optimal control function of COSC of the extended state (66) can be derived intuitively from that of ML-POSC (58). In ML-POSC,

K \hat{s} = E_{p_{t} (x | z)} [\hat{s}]

is the estimator of the stochastic extended state. In COSC of the extended state, because the stochastic extended state is completely observable, its estimator is provided by

\hat{s}

, which corresponds to

K = I

. By changing the definition of K from (57) to

K = I

, the partially observable Riccati Equation (62) is reduced to the Riccati Equation (61), and the optimal control function of ML-POSC (58) is reduced to that of COSC (66). As a result, the optimal control function of ML-POSC (58) can be interpreted as the generalization of that of COSC (66).

While the second term is the same between (58) and (66), the first term is different. The second term is the control of the expected extended state

μ

, which does not depend on the realization. In contrast, the first term is the control of the stochastic extended state

\hat{s}

, which depends on the realization. The first term has two different points: (i) The estimators of the stochastic extended state in COSC and ML-POSC are provided by

\hat{s}

and

K \hat{s} = E_{p_{t} (x | z)} [\hat{s}]

, respectively, which is reasonable because ML-POSC needs to estimate the state from the memory; and (ii) The control gains of the stochastic extended state in COSC and ML-POSC are provided by

Ψ

and

Π

, respectively. While

Ψ

improves only control,

Π

improves estimation as well as control.

6.5. Numerical Algorithm

In the LQG problem, the partial differential equations are reduced to the ordinary differential equations. The FP Equation (23) is reduced to (59) and (60), and the HJB Equation (28) is reduced to (61) and (62). As a result, the optimal control function (58) can be obtained more easily in the LQG problem.

The Riccati Equation (61) can be solved backwards from the terminal condition. In contrast, the partially observable Riccati Equation (62) cannot be solved in the same way as the Riccati Equation (61), as it depends on the forward equation of

Σ

(60) through K (57). Because the forward equation of

Σ

(60) depends on the backward equation of

Π

(62) as well, they must be solved simultaneously.

A similar problem appears in the mean-field game and control, and numerous numerical methods have been developed to deal with it [33]. In this paper, we solve the system of (60) and (62) using the forward–backward sweep method, which computes (60) and (62) alternately [33,34]. In ML-POSC, the convergence of the forward–backward sweep method is guaranteed [34].

7. Numerical Experiments

In this section, we demonstrate the effectiveness of ML-POSC using numerical experiments on the LQG problem with memory limitation as well as on the non-LQG problem.

7.1. LQG Problem with Memory Limitation

In this subsection, we show the significance of the partially observable Riccati Equation (62) by a numerical experiment of the LQG problem with memory limitation. We consider the state

x_{t} \in R

, the observation

y_{t} \in R

, and the memory

z_{t} \in R

, which evolve by the following SDEs:

\begin{matrix} d x_{t} & = (x_{t} + u_{t}) d t + d ω_{t}, \end{matrix}

(67)

\begin{matrix} d y_{t} & = x_{t} d t + d ν_{t}, \end{matrix}

(68)

\begin{matrix} d z_{t} & = v_{t} d t + d y_{t}, \end{matrix}

(69)

where

x_{0}

and

z_{0}

obey standard Gaussian distributions,

y_{0}

is an arbitrary real number,

ω_{t} \in R

and

ν_{t} \in R

are independent standard Wiener processes, and

u_{t} = u (t, z_{t}) \in R

and

v_{t} = v (t, z_{t}) \in R

are the controls. The objective function to be minimized is provided as follows:

\begin{matrix} J [u, v] : = E [\int_{0}^{10} (x_{t}^{2} + u_{t}^{2} + v_{t}^{2}) d t] . \end{matrix}

(70)

Therefore, the objective of this problem is to minimize the state variance by the small state and memory controls. Because this problem includes the memory control cost, it corresponds to the LQG problem with memory limitation.

Figure 2a–c shows the trajectories of

Ψ

and

Π

;

Π_{x x}

and

Π_{z z}

are larger than

Ψ_{x x}

and

Ψ_{z z}

, respectively, and

Π_{x z}

is smaller than

Ψ_{x z}

, which is consistent with our discussion in Section 6.3. Therefore, the partially observable Riccati equation may reduce the estimation error of the state from the memory. Moreover, while the memory control does not appear in the Riccati equation (

Ψ_{x z} = Ψ_{z z} = 0

), it appears in the partially observable Riccati equation (

Π_{x z} \neq 0

,

Π_{z z} \neq 0

), which is consistent with our discussion in Section 6.3. As a result, the memory control plays an important role in estimating the state from the memory.

In order to clarify the significance of the partially observable Riccati Equation (62), we compare the performance of the optimal control function (58) with that of the following control function:

\begin{matrix} u^{Ψ} (t, z) = - R^{- 1} B^{⊤} (Ψ K \hat{s} + Ψ μ), \end{matrix}

(71)

in which

Π

is replaced with

Ψ

. This result is shown in Figure 2d–f. In the control function (71), the distributions of the state and the memory are unstable, and the cumulative cost diverges. By contrast, in the optimal control function (58), the distributions of the state and memory are stable, and the cumulative cost is smaller. This result indicates that the partially observable Riccati Equation (62) plays an important role in the LQG problem with memory limitation.

7.2. Non-LQG Problem

In this subsection, we investigate the potential effectiveness of ML-POSC for a non-LQG problem by comparing it with the local LQG approximation of the conventional POSC [3,4]. We consider the state

x_{t} \in R

and the observation

y_{t} \in R

, which evolve according to the following SDEs:

\begin{matrix} d x_{t} & = u_{t} d t + d ω_{t}, \end{matrix}

(72)

\begin{matrix} d y_{t} & = x_{t} d t + d ν_{t}, \end{matrix}

(73)

where

x_{0}

obeys the Gaussian distribution

p_{0} (x_{0}) = N (x_{0} | 0, 0.01)

,

y_{0}

is an arbitrary real number,

ω_{t} \in R

and

ν_{t} \in R

are independent standard Wiener processes, and

u_{t} = u (t, y_{0 : t}) \in R

is the control. The objective function to be minimized is provided as follows:

\begin{matrix} J [u] : = E [\int_{0}^{1} (Q (t, x_{t}) + u_{t}^{2}) d t + 10 x_{1}^{2}], \end{matrix}

(74)

where

\begin{matrix} Q (t, x) : = \{\begin{matrix} 1000 & (0.3 \leq t \leq 0.6, 0.1 \leq | x | \leq 2.0), \\ 0 & (o t h e r s) . \end{matrix} \end{matrix}

(75)

The cost function is high on the black rectangles in Figure 3a, which represent the obstacles. In addition, the terminal cost function is the lowest on the black cross in Figure 3a, which represents the desirable goal. Therefore, the system should avoid the obstacles and reach the goal with the small control. Because the cost function is non-quadratic, it is a non-LQG problem, which cannot be solved exactly by the conventional POSC.

In the local LQG approximation of the conventional POSC [3,4], the Zakai equation and the Bellman equation are locally approximated by the Kalman filter and the Riccati equation, respectively. Because the Bellman equation is reduced to the Riccati equation, the local LQG approximation can be solved numerically even in the non-LQG problem.

ML-POSC determines the control

u_{t} \in R

based on the memory

z_{t} \in R

, i.e.,

u_{t} = u (t, z_{t})

. The memory dynamics is formulated with the following SDE:

\begin{matrix} d z_{t} & = d y_{t}, \end{matrix}

(76)

where

p_{0} (z_{0}) = N (z_{0} | 0, 0.01)

. For the sake of simplicity, the memory control is not considered.

Figure 3 is the numerical result comparing the local LQG approximation and ML-POSC. Because the local LQG approximation reduces the Bellman equation to the Riccati equation by ignoring non-LQG information, it cannot avoid the obstacles, which results in a higher objective function. In contrast, because ML-POSC reduces the Bellman equation to the HJB equation while maintaining non-LQG information, it can avoid the obstacles, which results in a lower objective function. Therefore, our numerical experiment shows that ML-POSC can be superior to local LQG approximation.

8. Discussion

In this work, we propose ML-POSC, which is an alternative theoretical framework to the conventional POSC. ML-POSC first formulates the finite-dimensional and stochastic memory dynamics explicitly, then optimizes the memory dynamics considering the memory cost. As a result, unlike the conventional POSC, ML-POSC can consider memory limitation as well as incomplete information. Furthermore, because the optimal control function of ML-POSC is obtained by solving the system of HJB-FP equations, ML-POSC can be solved in practice even in non-LQG problems. ML-POSC can generalize the LQG problem to include memory limitation. Because estimation and control are not clearly separated in the LQG problem with memory limitation, the Riccati equation can be modified to the partially observable Riccati equation, which improves estimation as well as control. Furthermore, ML-POSC can provide a better result than the local LQG approximation in a non-LQG problem, as ML-POSC reduces the Bellman equation while maintaining non-LQG information.

ML-POSC is effective for the state estimation problem as well, which is a part of the POSC problem. Although the state estimation problem can be solved in principle by the Zakai equation [38,39,40], it cannot be solved directly, as the Zakai equation is infinite-dimensional. In order to resolve this problem, a particle filter is often used to approximate the infinite-dimensional Zakai equation as a finite number of particles [38,39,40]. However, because the performance of the particle filter is guaranteed only in the limit of a large number of particles, a particle filter may not be practical in cases where the available memory size is severely limited. Furthermore, a particle filter cannot take the memory noise and cost into account. ML-POSC resolves these problems, as it can optimize the state estimation under memory limitation.

ML-POSC may be extended from a single-agent system to a multi-agent system. POSC of a multi-agent system is called decentralized stochastic control (DSC) [41,42,43], which consists of a system and multiple controllers. In DSC, each controller needs to estimate the controls of the other controllers as well as the state of the system, which is essentially different from the conventional POSC. Because the estimation among the controllers is generally intractable, the conventional POSC approach cannot be straightforwardly extended to DSC. In contrast, ML-POSC compresses the observation history into the finite-dimensional memory, which simplifies estimation among the controllers. Therefore, ML-POSC may provide an effective approach to DSC. Actually, the finite-state controller, the idea of which is similar with ML-POSC, plays a key role in extending POMDP from a single-agent system to a multi-agent system [22,44,45,46,47,48]. ML-POSC may be extended to a multi-agent system in a similar way as a finite-state controller.

ML-POSC can be naturally extended to the mean-field control setting [28,29,30] because ML-POSC is solved based on the mean-field control theory. Therefore, ML-POSC can be applied to an infinite number of homogeneous agents. Furthermore, ML-POSC can be extended to a risk-sensitive setting, as this is a special case of the mean-field control setting [28,29,30]. Therefore, ML-POSC can consider the variance of the cost as well as its expectation.

Nonetheless, more efficient algorithms are needed in order to solve ML-POSC with a high-dimensional state and memory. In the mean-field game and control, neural network-based algorithms have recently been proposed which can solve high-dimensional problems efficiently [49,50]. By extending these algorithms, it might be possible to solve high-dimensional ML-POSC efficiently. Furthermore, unlike the mean-field game and control, the coupling of HJB-FP equations is limited to the optimal control function in ML-POSC. By exploiting this property, more efficient algorithms for ML-POSC may be proposed [34].

Author Contributions

Conceptualization, Formal analysis, Funding acquisition, Writing—original draft: T.T. and T.J.K.; Software, Visualization: T.T. All authors have read and agreed to the published version of the manuscript.

Funding

The first author received a JSPS Research Fellowship (Grant No. 21J20436). This work was supported by JSPS KAKENHI (Grant No. 19H05799) and JST CREST (Grant No. JPMJCR2011).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Not applicable.

Acknowledgments

We thank Kenji Kashima and Kaito Ito for useful discussions.

Conflicts of Interest

The authors declare no conflict of interest.

Abbreviations

The following abbreviations are used in this manuscript:

COSC	Completely Observable Stochastic Control
POSC	Partially Observable Stochastic Control
ML-POSC	Memory-Limited Partially Observable Stochastic Control
POMDP	Partially Observable Markov Decision Process
DSC	Decentralized Stochastic Control
LQG	Linear-Quadratic-Gaussian
HJB	Hamilton–Jacobi–Bellman
FP	Fokker–Planck
SDE	Stochastic Differential Equation

Appendix A. Proof of Lemma 1

We define the value function

V (t, p)

as follows:

\begin{matrix} V (t, p) : = min_{u_{t : T}} [\int_{t}^{T} \bar{f} (t, p_{τ}, u_{τ}) d τ + {\bar{g}}_{T} (p_{T})], \end{matrix}

(A1)

where

{p_{τ} | τ \in [t, T]}

is the solution of the FP Equation (23), where

p_{t} = p

. Then,

V (t, p)

can be calculated as follows:

\begin{matrix} V (t, p) & = min_{u} [\bar{f} (t, p, u) d t + V (t + d t, p + L^{†} p d t)] \\ = min_{u} [\bar{f} (t, p, u) d t + V (t, p) + \frac{\partial V (t, p)}{\partial t} d t + (\int \frac{δ V (t, p)}{δ p} (s) L^{†} p (s) d s) d t] . \end{matrix}

(A2)

By rearranging the above equation, the following equation is obtained:

\begin{matrix} - \frac{\partial V (t, p)}{\partial t} = min_{u} [\bar{f} (t, p, u) + \int \frac{δ V (t, p)}{δ p} (s) L^{†} p (s) d s] . \end{matrix}

(A3)

Because

\begin{matrix} \int \frac{δ V (t, p)}{δ p} (s) L^{†} p (s) d s = \int p (s) L \frac{δ V (t, p)}{δ p} (s) d s, \end{matrix}

(A4)

the following equation is obtained:

\begin{matrix} - \frac{\partial V (t, p)}{\partial t} & = min_{u} \int p (s) [f (t, s, u) + L \frac{δ V (t, p)}{δ p} (s)] d s . \end{matrix}

(A5)

From the definition of the Hamiltonian

H

(10), the following Bellman equation is obtained:

\begin{matrix} - \frac{\partial V (t, p)}{\partial t} & = min_{u} E_{p (s)} [H (t, s, u, \frac{δ V (t, p)}{δ p} (s))] . \end{matrix}

(A6)

Because the control u is the function of the memory z in ML-POSC, the minimization by u can be exchanged with the expectation by

p (z)

as follows:

\begin{matrix} - \frac{\partial V (t, p)}{\partial t} = E_{p (z)} [min_{u} E_{p (x | z)} [H (t, s, u, \frac{δ V (t, p)}{δ p} (s))]] . \end{matrix}

(A7)

Because the optimal control function is provided by the right-hand side of the Bellman Equation (A7) [10], the optimal control function is provided by

\begin{matrix} u^{*} (t, z, p) = \underset{u}{argmin} E_{p (x | z)} [H (t, s, u, \frac{δ V (t, p)}{δ p} (s))] . \end{matrix}

(A8)

Because the FP Equation (23) is deterministic, the optimal control function is provided by

u^{*} (t, z) = u^{*} (t, z, p_{t})

.

Appendix B. Proof of Theorem 1

We first define

\begin{matrix} W (t, p, s) : = \frac{δ V (t, p)}{δ p} (s), \end{matrix}

(A9)

which satisfies

W (T, p, s) = g (s)

. Differentiating the Bellman Equation (26) with respect to p, the following equation is obtained:

\begin{matrix} - \frac{\partial W (t, p, s)}{\partial t} = H (t, s, u^{*}, W) + E_{p (s^{'})} [L \frac{δ W (t, p, s^{'})}{δ p} (s)] . \end{matrix}

(A10)

Because

\begin{matrix} \int p (s^{'}) L \frac{δ W (t, p, s)}{δ p} (s^{'}) d s^{'} = \int \frac{δ W (t, p, s)}{δ p} (s^{'}) L^{†} p (s^{'}) d s^{'}, \end{matrix}

(A11)

the following equation is obtained:

\begin{matrix} - \frac{\partial W (t, p, s)}{\partial t} & = H (t, s, u^{*}, W) + \int \frac{δ W (t, p, s)}{δ p} (s^{'}) L^{†} p (s^{'}) d s^{'} . \end{matrix}

(A12)

We then define

\begin{matrix} w (t, s) : = W (t, p_{t}, s), \end{matrix}

(A13)

where

p_{t}

is the solution of the FP Equation (23). The time derivative of

w (t, s)

can be calculated as follows:

\begin{matrix} \frac{\partial w (t, s)}{\partial t} & = \frac{\partial W (t, p_{t}, s)}{\partial t} + \int \frac{δ W (t, p_{t}, s)}{δ p} (s^{'}) \frac{\partial p_{t} (s^{'})}{\partial t} d s^{'} . \end{matrix}

(A14)

By substituting (A12) into (A14), the following equation is obtained:

\begin{matrix} - \frac{\partial w (t, s)}{\partial t} & = H (t, s, u^{*}, w) - \int \frac{δ W (t, p_{t}, s)}{δ p} (s^{'}) \underset{(*)}{\underset{︸}{(\frac{\partial p_{t} (s^{'})}{\partial t} - L^{†} p_{t} (s^{'}))}} d s^{'} . \end{matrix}

(A15)

From the FP Equation (23),

(*) = 0

holds. Therefore, the HJB Equation (28) is obtained.

Appendix C. Proof of Proposition 2

From the proof of Lemma 1 (Appendix A), the Bellman Equation (A6) is obtained. Because the control u is the function of the extended state s in COSC of the extended state, the minimization by u can be exchanged with the expectation by

p (s)

as follows:

\begin{matrix} - \frac{\partial V (t, p)}{\partial t} = E_{p (s)} [min_{u} H (t, s, u, \frac{δ V (t, p)}{δ p} (s))] . \end{matrix}

(A16)

Because the optimal control function is provided by the right-hand side of the Bellman Equation (A16) [10], the optimal control function is provided by

\begin{matrix} u^{*} (t, s, p) = \underset{u}{argmin} H (t, s, u, \frac{δ V (t, p)}{δ p} (s)) . \end{matrix}

(A17)

Because the FP Equation (23) is deterministic, the optimal control function is provided by

u^{*} (t, s) = u^{*} (t, s, p_{t})

. The rest of the proof is the same as the proof of Theorem 1 (Appendix B).

Appendix D. Proof of Theorem 2

From Theorem 1, the optimal control functions

u^{*}

,

v^{*}

, and

κ^{*}

are provided by the minimization of the conditional expectation of the Hamiltonian, as follows:

\begin{matrix} u^{*} (t, z), v^{*} (t, z), κ^{*} (t, z) = \underset{u, v, κ}{argmin} E_{p_{t} (x | z)} [H (t, s, u, v, κ, w)] . \end{matrix}

(A18)

In the LQG problem of the conventional POSC, the Hamiltonian (10) is provided by

\begin{matrix} H (t, s, u, v, κ, w) & = x^{⊤} Q x + u^{⊤} R u + {(\frac{\partial w (t, s)}{\partial x})}^{⊤} (A x + B u) + {(\frac{\partial w (t, s)}{\partial z})}^{⊤} (v + κ H x) \\ + \frac{1}{2} tr \{\frac{\partial}{\partial x} {(\frac{\partial w (t, s)}{\partial x})}^{⊤} σ σ^{⊤}\} \\ + \frac{1}{2} tr \{\frac{\partial}{\partial z} {(\frac{\partial w (t, s)}{\partial z})}^{⊤} κ γ γ^{⊤} κ^{⊤}\} . \end{matrix}

(A19)

From

\begin{matrix} \frac{\partial E_{p_{t} (x | z)} [H]}{\partial u} & = 2 R u + B^{⊤} E_{p_{t} (x | z)} [\frac{\partial w (t, s)}{\partial x}], \end{matrix}

(A20)

\begin{matrix} \frac{\partial E_{p_{t} (x | z)} [H]}{\partial v} & = E_{p_{t} (x | z)} [\frac{\partial w (t, s)}{\partial z}], \end{matrix}

(A21)

\begin{matrix} \frac{\partial E_{p_{t} (x | z)} [H]}{\partial κ} & = E_{p_{t} (x | z)} [\frac{\partial w (t, s)}{\partial z} x^{⊤}] H^{⊤} + E_{p_{t} (x | z)} [\frac{\partial}{\partial z} {(\frac{\partial w (t, s)}{\partial z})}^{⊤}] κ γ γ^{⊤}, \end{matrix}

(A22)

the optimal control functions are provided by

\begin{matrix} u^{*} (t, z) & = - \frac{1}{2} R^{- 1} B^{⊤} E_{p_{t} (x | z)} [\frac{\partial w (t, s)}{\partial x}], \end{matrix}

(A23)

\begin{matrix} v^{*} (t, z) & = \{\begin{matrix} + \infty & E_{p_{t} (x | z)} [\frac{\partial w (t, s)}{\partial z}] < 0, \\ a r b i t r a r y & E_{p_{t} (x | z)} [\frac{\partial w (t, s)}{\partial z}] = 0, \\ - \infty & E_{p_{t} (x | z)} [\frac{\partial w (t, s)}{\partial z}] > 0, \end{matrix} \end{matrix}

(A24)

\begin{matrix} κ^{*} (t, z) & = - {(E_{p_{t} (x | z)} [\frac{\partial}{\partial z} {(\frac{\partial w (t, s)}{\partial z})}^{⊤}])}^{- 1} E_{p_{t} (x | z)} [\frac{\partial w (t, s)}{\partial z} x^{⊤}] H^{⊤} {(γ γ^{⊤})}^{- 1} . \end{matrix}

(A25)

We assume that

p_{t} (s)

is provided by the Gaussian distribution

\begin{matrix} p_{t} (s) = N (s | μ (t), Σ (t)), \end{matrix}

(A26)

and

w (t, s)

is provided by the quadratic function

\begin{matrix} w (t, s) & = x^{⊤} Ψ (t) x + {(x - z)}^{⊤} Φ (t) (x - z) + β (t) . \end{matrix}

(A27)

From the initial condition of the FP equation,

\begin{matrix} μ (0) & = (\begin{matrix} μ_{x} (0) \\ μ_{z} (0) \end{matrix}) = (\begin{matrix} μ_{x, 0} \\ μ_{x, 0} \end{matrix}), \end{matrix}

(A28)

\begin{matrix} Σ (0) & = (\begin{matrix} Σ_{x x} (0) & Σ_{x z} (0) \\ Σ_{z x} (0) & Σ_{z z} (0) \end{matrix}) = (\begin{matrix} Σ_{x x, 0} & O \\ O & O \end{matrix}) \end{matrix}

(A29)

are satisfied. From the terminal condition of the HJB equation,

Ψ (T) = P

,

Φ (T) = O

, and

β (T) = 0

are satisfied. In this case,

u^{*} (t, z)

,

E_{p_{t} (x | z)} [\partial w (t, s) / \partial z]

, and

κ^{*} (t, z)

can be calculated as follows:

\begin{matrix} u^{*} (t, z) = - R^{- 1} B^{⊤} ((Ψ + Φ) μ_{x | z} - Φ z), \end{matrix}

(A30)

\begin{matrix} E_{p_{t} (x | z)} [\frac{\partial w (t, s)}{\partial z}] = 2 Φ (z - μ_{x | z}), \end{matrix}

(A31)

\begin{matrix} κ^{*} (t, z) = (Σ_{x | z} + (μ_{x | z} - z) μ_{x | z}^{⊤}) H^{⊤} {(γ γ^{⊤})}^{- 1} . \end{matrix}

(A32)

We then assume that the following equations are satisfied:

\begin{matrix} μ_{x} & = μ_{z}, \end{matrix}

(A33)

\begin{matrix} Σ_{z z} & = Σ_{x z} . \end{matrix}

(A34)

In this case,

μ_{x | z}

,

Σ_{x | z}

,

u^{*} (t, z)

,

E_{p_{t} (x | z)} [\partial w (t, s) / \partial z]

, and

κ^{*} (t, z)

can be calculated as follows:

\begin{matrix} μ_{x | z} = z, \end{matrix}

(A35)

\begin{matrix} Σ_{x | z} = Σ_{x x} - Σ_{z z}, \end{matrix}

(A36)

\begin{matrix} u^{*} (t, z) = - R^{- 1} B^{⊤} Ψ z, \end{matrix}

(A37)

\begin{matrix} E_{p_{t} (x | z)} [\frac{\partial w (t, s)}{\partial z}] = 0, \end{matrix}

(A38)

\begin{matrix} κ^{*} (t, z) = Σ_{x | z} H^{⊤} {(γ γ^{⊤})}^{- 1} . \end{matrix}

(A39)

Because

v^{*} (t, z)

is arbitrary when

E_{p_{t} (x | z)} [\partial w (t, s) / \partial z] = 0

, we consider

v^{*} (t, z)

with the following equation:

\begin{matrix} v^{*} (t, z) = (A - B R^{- 1} B^{⊤} Ψ - Σ_{x | z} H^{⊤} {(γ γ^{⊤})}^{- 1} H) z . \end{matrix}

(A40)

In this case, the extended state SDE is provided by the following equation:

\begin{matrix} d s_{t} = \tilde{A} (t) s_{t} d t + \tilde{σ} (t) d {\tilde{ω}}_{t}, \end{matrix}

(A41)

where

p_{0} (s) = N (s | μ (0), Σ (0))

, and

\begin{matrix} \tilde{A} & : = (\begin{matrix} A & - B R^{- 1} B^{⊤} Ψ \\ Σ_{x | z} H^{⊤} {(γ γ^{⊤})}^{- 1} H & A - B R^{- 1} B^{⊤} Ψ - Σ_{x | z} H^{⊤} {(γ γ^{⊤})}^{- 1} H \end{matrix}), \end{matrix}

(A42)

\begin{matrix} \tilde{σ} & : = (\begin{matrix} σ & O \\ O & Σ_{x | z} H^{⊤} {(γ γ^{⊤})}^{- 1} γ \end{matrix}), d {\tilde{ω}}_{t} : = (\begin{matrix} d ω_{t} \\ d ν_{t} \end{matrix}) . \end{matrix}

(A43)

Because the drift and diffusion coefficients of (A41) are linear and constant with respect to s, respectively,

p_{t} (s)

becomes the Gaussian distribution, which is consistent with our assumption (A26), while

μ (t)

and

Σ (t)

evolve by the following ordinary differential equations:

\begin{matrix} \frac{d μ}{d t} & = \tilde{A} μ, \end{matrix}

(A44)

\begin{matrix} \frac{d Σ}{d t} & = \tilde{σ} {\tilde{σ}}^{⊤} + \tilde{A} Σ + Σ {\tilde{A}}^{⊤} . \end{matrix}

(A45)

If

μ_{x} = μ_{z}

and

Σ_{z z} = Σ_{x z}

are satisfied,

d μ_{x} / d t = d μ_{z} / d t

and

d Σ_{x z} / d t = d Σ_{z z} / d t

are satisfied as well, which is consistent with our assumptions of

μ_{x} = μ_{z}

and

Σ_{z z} = Σ_{x z}

.

From

v^{*}

and

κ^{*}

, the dynamics of

μ_{x | z} (t, z_{t}) = z_{t}

is provided by

\begin{matrix} d z_{t} = (A - B R^{- 1} B^{⊤} Ψ) z_{t} d t + Σ_{x | z} H^{⊤} {(γ γ^{⊤})}^{- 1} (d y_{t} - H z_{t} d t), \end{matrix}

(A46)

where

z_{0} = μ_{x, 0}

. From

d Σ_{x x} / d t

and

d Σ_{z z} / d t

, the dynamics of

Σ_{x | z} = Σ_{x x} - Σ_{z z}

is provided by

\begin{matrix} \frac{d Σ_{x | z}}{d t} = σ σ^{⊤} + A Σ_{x | z} + Σ_{x | z} A^{⊤} - Σ_{x | z} H^{⊤} {(γ γ^{⊤})}^{- 1} H Σ_{x | z}, \end{matrix}

(A47)

where

Σ_{x | z} (0) = Σ_{x x, 0}

. We note that (A46) and (A47) correspond to the Kalman filter (35) and (36).

By substituting

w (t, s)

,

u^{*} (t, z)

,

v^{*} (t, z)

, and

κ^{*} (t, z)

into the HJB Equation (28), we obtain the following ordinary differential equations:

\begin{matrix} - \frac{d Ψ}{d t} = Q + A^{⊤} Ψ + Ψ A - Ψ B R^{- 1} B^{⊤} Ψ, \\ - \frac{d Φ}{d t} = {(A - Σ_{x | z} H^{⊤} {(γ γ^{⊤})}^{- 1} H)}^{⊤} Φ + Φ (A - Σ_{x | z} H^{⊤} {(γ γ^{⊤})}^{- 1} H) \end{matrix}

(A48)

\begin{matrix} + Ψ B R^{- 1} B^{⊤} Ψ, \end{matrix}

(A49)

\begin{matrix} - \frac{d β}{d t} = tr \{(Ψ + Φ) σ σ^{⊤}\} + tr \{Φ Σ_{x | z} H^{⊤} {(γ γ^{⊤})}^{- 1} H Σ_{x | z}\}, \end{matrix}

(A50)

where

Ψ (T) = P

,

Φ (T) = O

, and

β (T) = 0

. If

Ψ (t)

,

Φ (t)

, and

β (t)

satisfy (A48), (A49), and (A50), respectively, the HJB Equation (28) is satisfied, which is consistent with our assumption (A27). We note that (A48) corresponds to the Riccati Equation (37).

Appendix E. Proof of Theorem 3

From Theorem 1, the optimal control function

u^{*}

is provided by the minimization of the conditional expectation of the Hamiltonian, as follows:

\begin{matrix} u^{*} (t, z) = \underset{u}{argmin} E_{p_{t} (x | z)} [H (t, s, u, w)] . \end{matrix}

(A51)

In the LQG problem with memory limitation, the Hamiltonian (10) is provided as follows:

\begin{matrix} H (t, s, u, w) & = s^{⊤} Q s + u^{⊤} R u + {(\frac{\partial w (t, s)}{\partial s})}^{⊤} (A s + B u) \\ + \frac{1}{2} tr \{\frac{\partial}{\partial s} {(\frac{\partial w (t, s)}{\partial s})}^{⊤} σ σ^{⊤}\} . \end{matrix}

(A52)

From

\begin{matrix} \frac{\partial E_{p_{t} (x | z)} [H (t, s, u, w)]}{\partial u} = 2 R u + B^{⊤} E_{p_{t} (x | z)} [\frac{\partial w (t, s)}{\partial s}], \end{matrix}

(A53)

the optimal control function is provided by

\begin{matrix} u^{*} (t, z) & = - \frac{1}{2} R^{- 1} B^{⊤} E_{p_{t} (x | z)} [\frac{\partial w (t, s)}{\partial s}] . \end{matrix}

(A54)

We assume that

p_{t} (s)

is provided by the Gaussian distribution

\begin{matrix} p_{t} (s) = N (s | μ (t), Σ (t)), \end{matrix}

(A55)

and

w (t, s)

is provided by the quadratic function

\begin{matrix} w (t, s) = s^{⊤} Π (t) s + α^{⊤} (t) s + β (t) . \end{matrix}

(A56)

From the initial condition of the FP equation,

μ (0) = μ_{0}

and

Σ (0) = Σ_{0}

are satisfied. From the terminal condition of the HJB equation,

Π (T) = P

,

α (T) = 0

, and

β (T) = 0

are satisfied. In this case, the optimal control function (A54) can be calculated as follows:

\begin{matrix} u^{*} (t, z) = - \frac{1}{2} R^{- 1} B^{⊤} (2 Π K \hat{s} + 2 Π μ + α), \end{matrix}

(A57)

where we use (56). Because the optimal control function (A57) is linear with respect to

\hat{s}

,

p_{t} (s)

is the Gaussian distribution, which is consistent with our assumption (A55).

By substituting (A56) and (A57) into the HJB Equation (28), we obtain the following ordinary differential equations:

\begin{matrix} - \frac{d Π}{d t} & = Q + A^{⊤} Π + Π A - Π B R^{- 1} B^{⊤} Π + Q, \end{matrix}

(A58)

\begin{matrix} - \frac{d α}{d t} & = {(A - B R^{- 1} B^{⊤} Π)}^{⊤} α - 2 Q μ, \end{matrix}

(A59)

\begin{matrix} - \frac{d β}{d t} & = tr (Π σ σ^{⊤}) - \frac{1}{4} α^{⊤} B R^{- 1} B^{⊤} α + μ^{⊤} Q μ, \end{matrix}

(A60)

where

Q : = {(I - K)}^{⊤} Π B R^{- 1} B^{⊤} Π (I - K)

. If

Π (t)

,

α (t)

, and

β (t)

satisfy (A58), (A59), and (A60), respectively, the HJB Equation (28) is satisfied, which is consistent with our assumption (A56).

By defining

Y (t)

by

α (t) = 2 Y (t) μ (t)

, the optimal control function (A57) can be calculated as follows:

\begin{matrix} u^{*} (t, z) = - R^{- 1} B^{⊤} (Π K \hat{s} + (Π + Y) μ) . \end{matrix}

(A61)

In this case,

μ (t)

obeys the following ordinary differential equation:

\begin{matrix} \frac{d μ}{d t} & = (A - B R^{- 1} B^{⊤} (Π + Y)) μ . \end{matrix}

(A62)

From

α (t) = 2 Y (t) μ (t)

, (A59) and (A62),

Y (t)

obeys the following ordinary differential equation:

\begin{matrix} - \frac{d Y}{d t} & = {(A - B R^{- 1} B^{⊤} Π)}^{⊤} Y + Y (A - B R^{- 1} B^{⊤} Π) - Y B R^{- 1} B^{⊤} Y - Q, \end{matrix}

(A63)

where

Y (T) = O

.

By defining

Ψ (t) : = Π (t) + Y (t)

, the optimal control function (A61) can be calculated as follows:

\begin{matrix} u^{*} (t, z) = - R^{- 1} B^{⊤} (Π K \hat{s} + Ψ μ) . \end{matrix}

(A64)

From

Ψ (t) = Π (t) + Y (t)

, (A58), and (A63),

Ψ (t)

obeys the following ordinary differential equation:

\begin{matrix} - \frac{d Ψ}{d t} & = Q + A^{⊤} Ψ + Ψ A - Ψ B R^{- 1} B^{⊤} Ψ, \end{matrix}

(A65)

where

Ψ (T) = O

. Therefore, the optimal control function (58) is obtained.

References

Fox, R.; Tishby, N. Minimum-information LQG control part I: Memoryless controllers. In Proceedings of the 2016 IEEE 55th Conference on Decision and Control (CDC), Las Vegas, NV, USA, 12–14 December 2016; pp. 5610–5616. [Google Scholar] [CrossRef] [Green Version]
Fox, R.; Tishby, N. Minimum-information LQG control Part II: Retentive controllers. In Proceedings of the 2016 IEEE 55th Conference on Decision and Control (CDC), Las Vegas, NV, USA, 12–14 December 2016; pp. 5603–5609. [Google Scholar] [CrossRef] [Green Version]
Li, W.; Todorov, E. An Iterative Optimal Control and Estimation Design for Nonlinear Stochastic System. In Proceedings of the 45th IEEE Conference on Decision and Control, San Diego, CA, USA, 13–15 December 2006; pp. 3242–3247. [Google Scholar] [CrossRef]
Li, W.; Todorov, E. Iterative linearization methods for approximately optimal control and estimation of non-linear stochastic system. Int. J. Control 2007, 80, 1439–1453. [Google Scholar] [CrossRef]
Nakamura, K.; Kobayashi, T.J. Connection between the Bacterial Chemotactic Network and Optimal Filtering. Phys. Rev. Lett. 2021, 126, 128102. [Google Scholar] [CrossRef] [PubMed]
Nakamura, K.; Kobayashi, T.J. Optimal sensing and control of run-and-tumble chemotaxis. Phys. Rev. Res. 2022, 4, 013120. [Google Scholar] [CrossRef]
Pezzotta, A.; Adorisio, M.; Celani, A. Chemotaxis emerges as the optimal solution to cooperative search games. Phys. Rev. E 2018, 98, 042401. [Google Scholar] [CrossRef] [Green Version]
Borra, F.; Cencini, M.; Celani, A. Optimal collision avoidance in swarms of active Brownian particles. J. Stat. Mech. Theory Exp. 2021, 2021, 083401. [Google Scholar] [CrossRef]
Bensoussan, A. Stochastic Control of Partially Observable Systems; Cambridge University Press: Cambridge, UK, 1992. [Google Scholar] [CrossRef]
Yong, J.; Zhou, X.Y. Stochastic Controls; Springer: New York, NY, USA, 1999. [Google Scholar] [CrossRef]
Nisio, M. Stochastic Control Theory. In Probability Theory and Stochastic Modelling; Springer: Tokyo, Japan, 2015; Volume 72. [Google Scholar] [CrossRef]
Fabbri, G.; Gozzi, F.; Święch, A. Stochastic Optimal Control in Infinite Dimension. In Probability Theory and Stochastic Modelling; Springer International Publishing: Cham, Switzerland, 2017; Volume 82. [Google Scholar] [CrossRef]
Bensoussan, A.; Frehse, J.; Yam, S.C.P. The Master equation in mean field theory. J. de Math. Pures et Appl. 2015, 103, 1441–1474. [Google Scholar] [CrossRef]
Bensoussan, A.; Frehse, J.; Yam, S.C.P. On the interpretation of the Master Equation. Stoch. Process. Their Appl. 2017, 127, 2093–2137. [Google Scholar] [CrossRef] [Green Version]
Bensoussan, A.; Yam, S.C.P. Mean field approach to stochastic control with partial information. ESAIM Control Optim. Calc. Var. 2021, 27, 89. [Google Scholar] [CrossRef]
Hansen, E. An Improved Policy Iteration Algorithm for Partially Observable MDPs. In Advances in Neural Information Processing Systems; MIT Press: Cambridge, MA, USA, 1998; Volume 10. [Google Scholar]
Hansen, E.A. Solving POMDPs by Searching in Policy Space. In Proceedings of the Fourteenth Conference on Uncertainty in Artificial Intelligence, Madison, WI, USA, 24–26 July 1998; pp. 211–219. [Google Scholar]
Kaelbling, L.P.; Littman, M.L.; Cassandra, A.R. Planning and acting in partially observable stochastic domains. Artif. Intell. 1998, 101, 99–134. [Google Scholar] [CrossRef] [Green Version]
Meuleau, N.; Kim, K.E.; Kaelbling, L.P.; Cassandra, A.R. Solving POMDPs by Searching the Space of Finite Policies. In Proceedings of the Fifteenth Conference on Uncertainty in Artificial Intelligence, Stockholm, Sweden, 30 July–1 August 1999; pp. 417–426. [Google Scholar]
Meuleau, N.; Peshkin, L.; Kim, K.E.; Kaelbling, L.P. Learning Finite-State Controllers for Partially Observable Environments. In Proceedings of the Fifteenth Conference on Uncertainty in Artificial Intelligence, Stockholm, Sweden, 30 July–1 August 1999; pp. 427–436. [Google Scholar]
Poupart, P.; Boutilier, C. Bounded Finite State Controllers. In Advances in Neural Information Processing Systems; MIT Press: Cambridge, MA, USA, 2003; Volume 16. [Google Scholar]
Amato, C.; Bonet, B.; Zilberstein, S. Finite-State Controllers Based on Mealy Machines for Centralized and Decentralized POMDPs. Proc. AAAI Conf. Artif. Intell. 2010, 24, 1052–1058. [Google Scholar] [CrossRef]
Bensoussan, A. Estimation and Control of Dynamical Systems. In Interdisciplinary Applied Mathematics; Springer International Publishing: Cham, Switzerland, 2018; Volume 48. [Google Scholar] [CrossRef]
Laurière, M.; Pironneau, O. Dynamic Programming for Mean-Field Type Control. J. Optim. Theory Appl. 2016, 169, 902–924. [Google Scholar] [CrossRef] [Green Version]
Pham, H.; Wei, X. Bellman equation and viscosity solutions for mean-field stochastic control problem. ESAIM Control Optim. Calc. Var. 2018, 24, 437–461. [Google Scholar] [CrossRef]
Kushner, H.J.; Dupuis, P.G. Numerical Methods for Stochastic Control Problems in Continuous Time; Springer: New York, NY, USA, 1992. [Google Scholar] [CrossRef]
Fleming, W.H.; Soner, H.M. Controlled Markov Processes and Viscosity Solutions, 2nd ed.; Number 25 in Applications of mathematics; Springer: New York, NY, USA, 2006. [Google Scholar] [CrossRef]
Bensoussan, A.; Frehse, J.; Yam, P. Mean Field Games and Mean Field Type Control Theory; Springer Briefs in Mathematics; Springer: New York, NY, USA, 2013. [Google Scholar] [CrossRef]
Carmona, R.; Delarue, F. Probabilistic Theory of Mean Field Games with Applications I; Number volume 83 in Probability theory and stochastic modelling; Springer Nature: Cham, Switzerland, 2018. [Google Scholar] [CrossRef]
Carmona, R.; Delarue, F. Probabilistic Theory of Mean Field Games with Applications II. In Probability Theory and Stochastic Modelling; Springer International Publishing: Cham, Switzerland, 2018; Volume 84. [Google Scholar] [CrossRef]
Achdou, Y. Finite Difference Methods for Mean Field Games. In Hamilton-Jacobi Equations: Approximations, Numerical Analysis and Applications: Cetraro, Italy 2011, Editors: Paola Loreti, Nicoletta Anna Tchou; Lecture Notes in Mathematics; Achdou, Y., Barles, G., Ishii, H., Litvinov, G.L., Eds.; Springer: Berlin/Heidelberg, Germany, 2013; pp. 1–47. [Google Scholar] [CrossRef]
Achdou, Y.; Laurière, M. Mean Field Games and Applications: Numerical Aspects. In Mean Field Games: Cetraro, Italy 2019; Lecture Notes in Mathematics; Achdou, Y., Cardaliaguet, P., Delarue, F., Porretta, A., Santambrogio, F., Cardaliaguet, P., Porretta, A., Eds.; Springer International Publishing: Cham, Switzerland, 2020; pp. 249–307. [Google Scholar] [CrossRef]
Lauriere, M. Numerical Methods for Mean Field Games and Mean Field Type Control. Mean Field Games 2021, 78, 221. [Google Scholar] [CrossRef]
Tottori, T.; Kobayashi, T.J. Pontryagin’s Minimum Principle and Forward-Backward Sweep Method for the System of HJB-FP Equations in Memory-Limited Partially Observable Stochastic Control. arXiv 2022, arXiv:2210.13040. [Google Scholar]
Carlini, E.; Silva, F.J. Semi-Lagrangian schemes for mean field game models. In Proceedings of the 52nd IEEE Conference on Decision and Control, Firenze, Italy, 10–13 December 2013; pp. 3115–3120. [Google Scholar] [CrossRef] [Green Version]
Carlini, E.; Silva, F.J. A Fully Discrete Semi-Lagrangian Scheme for a First Order Mean Field Game Problem. SIAM J. Numer. Anal. 2014, 52, 45–67. [Google Scholar] [CrossRef]
Carlini, E.; Silva, F.J. A semi-Lagrangian scheme for a degenerate second order mean field game system. Discret. Contin. Dyn. Syst. 2015, 35, 4269. [Google Scholar] [CrossRef]
Crisan, D.; Doucet, A. A survey of convergence results on particle filtering methods for practitioners. IEEE Trans. Signal Process. 2002, 50, 736–746. [Google Scholar] [CrossRef] [Green Version]
Budhiraja, A.; Chen, L.; Lee, C. A survey of numerical methods for nonlinear filtering problems. Phys. D Nonlinear Phenom. 2007, 230, 27–36. [Google Scholar] [CrossRef]
Bain, A.; Crisan, D. Fundamentals of Stochastic Filtering. In Stochastic Modelling and Applied Probability; Springer: New York, NY, USA, 2009; Volume 60. [Google Scholar] [CrossRef]
Nayyar, A.; Mahajan, A.; Teneketzis, D. Decentralized Stochastic Control with Partial History Sharing: A Common Information Approach. IEEE Trans. Autom. Control 2013, 58, 1644–1658. [Google Scholar] [CrossRef]
Charalambous, C.D.; Ahmed, N.U. Centralized Versus Decentralized Optimization of Distributed Stochastic Differential Decision Systems With Different Information Structures-Part I: A General Theory. IEEE Trans. Autom. Control 2017, 62, 1194–1209. [Google Scholar] [CrossRef]
Charalambous, C.D.; Ahmed, N.U. Centralized Versus Decentralized Optimization of Distributed Stochastic Differential Decision Systems With Different Information Structures—Part II: Applications. IEEE Trans. Autom. Control 2018, 63, 1913–1928. [Google Scholar] [CrossRef]
Oliehoek, F.A.; Amato, C. A Concise Introduction to Decentralized POMDPs; SpringerBriefs in Intelligent Systems; Springer International Publishing: Cham, Switzerland, 2016. [Google Scholar] [CrossRef] [Green Version]
Bernstein, D.S. Bounded Policy Iteration for Decentralized POMDPs. In Proceedings of the Nineteenth International Joint Conference on Artificial Intelligence, Edinburgh, UK, 30 July–5 August 2005; pp. 1287–1292. [Google Scholar]
Bernstein, D.S.; Amato, C.; Hansen, E.A.; Zilberstein, S. Policy Iteration for Decentralized Control of Markov Decision Processes. J. Artif. Intell. Res. 2009, 34, 89–132. [Google Scholar] [CrossRef] [Green Version]
Amato, C.; Bernstein, D.S.; Zilberstein, S. Optimizing Memory-Bounded Controllers for Decentralized POMDPs. In Proceedings of the Twenty-Third Conference on Uncertainty in Artificial Intelligence, Vancouver, BC, Canada, 19–22 July 2007; pp. 1–8. [Google Scholar]
Tottori, T.; Kobayashi, T.J. Forward and Backward Bellman Equations Improve the Efficiency of the EM Algorithm for DEC-POMDP. Entropy 2021, 23, 551. [Google Scholar] [CrossRef]
Ruthotto, L.; Osher, S.J.; Li, W.; Nurbekyan, L.; Fung, S.W. A machine learning framework for solving high-dimensional mean field game and mean field control problems. Proc. Natl. Acad. Sci. USA 2020, 117, 9183–9193. [Google Scholar] [CrossRef] [Green Version]
Lin, A.T.; Fung, S.W.; Li, W.; Nurbekyan, L.; Osher, S.J. Alternating the population and control neural networks to solve high-dimensional stochastic mean-field games. Proc. Natl. Acad. Sci. USA 2021, 118, e2024713118. [Google Scholar] [CrossRef] [PubMed]

Figure 1. Schematic diagram of (a) completely observable stochastic control (COSC), (b) partially observable stochastic control (POSC), and (c) memory-limited partially observable stochastic control (ML-POSC). The top and bottom figures represent the system and controller, respectively;

x_{t} \in R^{d_{x}}

is the state of the system;

y_{t} \in R^{d_{y}}

,

z_{t} \in R^{d_{z}}

, and

u_{t} \in R^{d_{u}}

are the observation, memory, and control of the controller, respectively. (a) In COSC, the controller can completely observe the state

x_{t}

, and determines the control

u_{t}

based on the state

x_{t}

, i.e.,

u_{t} = u (t, x_{t})

. Only finite-dimensional memory is required to store the state

x_{t}

, and the optimal control

u_{t}^{*}

is obtained by solving the Hamilton–Jacobi–Bellman (HJB) equation, which is a partial differential equation. (b) In POSC, the controller cannot completely observe the state

x_{t}

; instead, it obtains the noisy observation

y_{t}

of the state

x_{t}

. The control

u_{t}

is determined based on the observation history

y_{0 : t} : = {y_{τ} | τ \in [0, t]}

, i.e.,

u_{t} = u (t, y_{0 : t})

. An infinite-dimensional memory is implicitly assumed to store the observation history

y_{0 : t}

. Furthermore, to obtain the optimal control

u_{t}^{*}

, the Bellman equation (a functional differential equation) needs to be solved, which is generally intractable, even numerically. (c) In ML-POSC, the controller is only accessible to the noisy observation

y_{t}

of the state

x_{t}

, as in POSC. In addition, it has only finite-dimensional memory

z_{t}

, which cannot completely memorize the the observation history

y_{0 : t}

. The controller of ML-POSC compresses the observation history

y_{0 : t}

into the finite-dimensional memory

z_{t}

, then determines the control

u_{t}

based on the memory

z_{t}

, i.e.,

u_{t} = u (t, z_{t})

. The optimal control

u_{t}^{*}

is obtained by solving the HJB equation (a partial differential equation), as in COSC.

Figure 1. Schematic diagram of (a) completely observable stochastic control (COSC), (b) partially observable stochastic control (POSC), and (c) memory-limited partially observable stochastic control (ML-POSC). The top and bottom figures represent the system and controller, respectively;

x_{t} \in R^{d_{x}}

is the state of the system;

y_{t} \in R^{d_{y}}

,

z_{t} \in R^{d_{z}}

, and

u_{t} \in R^{d_{u}}

are the observation, memory, and control of the controller, respectively. (a) In COSC, the controller can completely observe the state

x_{t}

, and determines the control

u_{t}

based on the state

x_{t}

, i.e.,

u_{t} = u (t, x_{t})

. Only finite-dimensional memory is required to store the state

x_{t}

, and the optimal control

u_{t}^{*}

is obtained by solving the Hamilton–Jacobi–Bellman (HJB) equation, which is a partial differential equation. (b) In POSC, the controller cannot completely observe the state

x_{t}

; instead, it obtains the noisy observation

y_{t}

of the state

x_{t}

. The control

u_{t}

is determined based on the observation history

y_{0 : t} : = {y_{τ} | τ \in [0, t]}

, i.e.,

u_{t} = u (t, y_{0 : t})

. An infinite-dimensional memory is implicitly assumed to store the observation history

y_{0 : t}

. Furthermore, to obtain the optimal control

u_{t}^{*}

, the Bellman equation (a functional differential equation) needs to be solved, which is generally intractable, even numerically. (c) In ML-POSC, the controller is only accessible to the noisy observation

y_{t}

of the state

x_{t}

, as in POSC. In addition, it has only finite-dimensional memory

z_{t}

, which cannot completely memorize the the observation history

y_{0 : t}

. The controller of ML-POSC compresses the observation history

y_{0 : t}

into the finite-dimensional memory

z_{t}

, then determines the control

u_{t}

based on the memory

z_{t}

, i.e.,

u_{t} = u (t, z_{t})

. The optimal control

u_{t}^{*}

is obtained by solving the HJB equation (a partial differential equation), as in COSC.

Figure 2. Numerical simulation of the LQG problem with memory limitation. (a–c) Trajectories of the elements of

Ψ (t) \in R^{2 \times 2}

and

Π (t) \in R^{2 \times 2}

. Because

Ψ_{z x} (t) = Ψ_{x z} (t)

and

Π_{z x} (t) = Π_{x z} (t)

,

Ψ_{z x} (t)

and

Π_{z x} (t)

are not visualized. (d–f) Stochastic behaviors of the state

x_{t}

(d), the memory

z_{t}

(e), and the cumulative cost (f) for 100 samples. The expectation of the cumulative cost at

t = 10

corresponds to the objective function (70). Blue and orange curves are controlled by (71) and (58), respectively.

Figure 2. Numerical simulation of the LQG problem with memory limitation. (a–c) Trajectories of the elements of

Ψ (t) \in R^{2 \times 2}

and

Π (t) \in R^{2 \times 2}

. Because

Ψ_{z x} (t) = Ψ_{x z} (t)

and

Π_{z x} (t) = Π_{x z} (t)

,

Ψ_{z x} (t)

and

Π_{z x} (t)

are not visualized. (d–f) Stochastic behaviors of the state

x_{t}

(d), the memory

z_{t}

(e), and the cumulative cost (f) for 100 samples. The expectation of the cumulative cost at

t = 10

corresponds to the objective function (70). Blue and orange curves are controlled by (71) and (58), respectively.

Figure 3. Numerical simulation of the non-LQG problem for the local LQG approximation (blue) and ML-POSC (orange). (a) Stochastic behaviors of state

x_{t}

for 100 samples. The black rectangles and cross represent the obstacles and goal, respectively. (b) The objective function (74), computed from 100 samples.

Figure 3. Numerical simulation of the non-LQG problem for the local LQG approximation (blue) and ML-POSC (orange). (a) Stochastic behaviors of state

x_{t}

for 100 samples. The black rectangles and cross represent the obstacles and goal, respectively. (b) The objective function (74), computed from 100 samples.

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Tottori, T.; Kobayashi, T.J. Memory-Limited Partially Observable Stochastic Control and Its Mean-Field Control Approach. Entropy 2022, 24, 1599. https://doi.org/10.3390/e24111599

AMA Style

Tottori T, Kobayashi TJ. Memory-Limited Partially Observable Stochastic Control and Its Mean-Field Control Approach. Entropy. 2022; 24(11):1599. https://doi.org/10.3390/e24111599

Chicago/Turabian Style

Tottori, Takehiro, and Tetsuya J. Kobayashi. 2022. "Memory-Limited Partially Observable Stochastic Control and Its Mean-Field Control Approach" Entropy 24, no. 11: 1599. https://doi.org/10.3390/e24111599

APA Style

Tottori, T., & Kobayashi, T. J. (2022). Memory-Limited Partially Observable Stochastic Control and Its Mean-Field Control Approach. Entropy, 24(11), 1599. https://doi.org/10.3390/e24111599

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Memory-Limited Partially Observable Stochastic Control and Its Mean-Field Control Approach

Abstract

1. Introduction

2. Review of Partially Observable Stochastic Control

2.1. Problem Formulation

2.2. Derivation of Optimal Control Function

3. Memory-Limited Partially Observable Stochastic Control

3.1. Problem Formulation

3.2. Problem Reformulation

4. Mean-Field Control Approach

4.1. Derivation of Optimal Control Function

4.2. Comparison with Completely Observable Stochastic Control

4.3. Numerical Algorithm

5. Linear-Quadratic-Gaussian Problem without Memory Limitation

5.1. Review of Partially Observable Stochastic Control

5.2. Memory-Limited Partially Observable Stochastic Control

6. Linear-Quadratic-Gaussian Problem with Memory Limitation

6.1. Problem Formulation

6.2. Problem Reformulation

6.3. Derivation of Optimal Control Function

6.4. Comparison with Completely Observable Stochastic Control

6.5. Numerical Algorithm

7. Numerical Experiments

7.1. LQG Problem with Memory Limitation

7.2. Non-LQG Problem

8. Discussion

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

Abbreviations

Appendix A. Proof of Lemma 1

Appendix B. Proof of Theorem 1

Appendix C. Proof of Proposition 2

Appendix D. Proof of Theorem 2

Appendix E. Proof of Theorem 3

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI