Decentralized Stochastic Control with Finite-Dimensional Memories: A Memory Limitation Approach

Tottori, Takehiro; Kobayashi, Tetsuya J.

doi:10.3390/e25050791

Open AccessArticle

Decentralized Stochastic Control with Finite-Dimensional Memories: A Memory Limitation Approach

by

Takehiro Tottori

^1,*

and

Tetsuya J. Kobayashi

^1,2,3,4

¹

Department of Mathematical Informatics, Graduate School of Information Science and Technology, The University of Tokyo, Tokyo 113-8654, Japan

²

Institute of Industrial Science, The University of Tokyo, Tokyo 153-8505, Japan

³

Department of Electrical Engineering and Information Systems, Graduate School of Engineering, The University of Tokyo, Tokyo 113-8654, Japan

⁴

Universal Biology Institute, The University of Tokyo, Tokyo 113-8654, Japan

^*

Author to whom correspondence should be addressed.

Entropy 2023, 25(5), 791; https://doi.org/10.3390/e25050791

Submission received: 20 February 2023 / Revised: 4 April 2023 / Accepted: 9 May 2023 / Published: 12 May 2023

(This article belongs to the Section Information Theory, Probability and Statistics)

Download

Browse Figures

Review Reports Versions Notes

Abstract

:

Decentralized stochastic control (DSC) is a stochastic optimal control problem consisting of multiple controllers. DSC assumes that each controller is unable to accurately observe the target system and the other controllers. This setup results in two difficulties in DSC; one is that each controller has to memorize the infinite-dimensional observation history, which is not practical, because the memory of the actual controllers is limited. The other is that the reduction of infinite-dimensional sequential Bayesian estimation to finite-dimensional Kalman filter is impossible in general DSC, even for linear-quadratic-Gaussian (LQG) problems. In order to address these issues, we propose an alternative theoretical framework to DSC—memory-limited DSC (ML-DSC). ML-DSC explicitly formulates the finite-dimensional memories of the controllers. Each controller is jointly optimized to compress the infinite-dimensional observation history into the prescribed finite-dimensional memory and to determine the control based on it. Therefore, ML-DSC can be a practical formulation for actual memory-limited controllers. We demonstrate how ML-DSC works in the LQG problem. The conventional DSC cannot be solved except in the special LQG problems where the information the controllers have is independent or partially nested. We show that ML-DSC can be solved in more general LQG problems where the interaction among the controllers is not restricted.

Keywords:

decision-making; optimal control; stochastic control; multi-agent system; decentralized control; memory limitation; mean-field control

1. Introduction

Optimal control problems of a stochastic dynamical system by decentralized multiple controllers appear in various practical applications, including real-time communication [1,2], decentralized hypothesis testing [3], and networked control [4]. Such problems have been extensively studied in stochastic optimal control theory as decentralized stochastic control (DSC) [5,6,7,8,9,10]. DSC consists of a target system and multiple controllers (Figure 1a) and assumes that each controller cannot completely observe the state of the system and the controls of the other controllers. The information of the target system and the other controllers was only obtained via their noisy observations. Thus, each controller should be optimized to determine its control solely from its own observation history. Even for a pair of finite-dimensional state and observation, the observation history is infinite-dimensional. As a result, ideally, the optimal controller obtained theoretically should possess infinite-dimensional memory. In practical applications, however, the available memory size of each controller is finite-dimensional and often severely limited. Thus, we have to obtain the solutions based on finite-dimensional memory by employing approximation methods heuristically, which may impair the optimality of the ideal solution, especially when the available memory size is not sufficient.

Moreover, another difficulty arises in DSC due to the decentralized setting. If the number of controllers is one, or if all controllers share their observation histories, DSC is reduced to partially observable stochastic control (POSC), in which the observation histories of all controllers can be summarized optimally as the posterior probability of the state by sequential Bayesian estimation [11,12,13,14,15,16]. The posterior probability of the state is also infinite-dimensional, and thus the same problem as DSC still survives, even for POSC. Nevertheless, in POSC, this difficulty can be circumvented by focusing only on the linear-quadratic-Gaussian (LQG) setting under which the posterior probability of the state can be represented by the finite-dimensional mean vector and covariance matrix of Gaussian distribution and the sequential Bayesian estimation can be computed by the Kalman filter [11,12,14]. Therefore, POSC is practically solved, at least in the LQG problem. The difficulty in DCS is that this nice property of LQG is not retained.

In DCS, each controller cannot access the observation histories of the other controllers, as well as the state of the system, which causes each controller to estimate all the others from their own observation history. This hampers the Bayesian estimation of the posterior to be computed sequentially and prevents the infinite-dimensional observation history from being compressed into the finite-dimensional sufficient statistics, even for the LQG problem. Some theoretical studies have addressed this issue by restricting the interaction among the controllers. If the information the controllers have is independent [8,9,10] or partially nested [17,18,19,20,21,22], DSC can enjoy the nice property of the LQG problem and be solved explicitly and optimally with finite-dimensional memory. However, the LQG problem with more general interactions as well as non-LQG problems are still an open problem in DSC.

In order to address these issues, we propose an alternative theoretical framework to DSC, memory-limited DSC (ML-DSC), which is the decentralized version of memory-limited POSC (ML-POSC) [23,24]. The two major difficulties in DSC originate from the ignorance of constraints over controllers when we derive the optimal estimation and control solution. Unlike conventional DSC, ML-DSC explicitly formulates finite-dimensional memories of the controllers and their capacities (Figure 1b). In ML-DSC, each controller is optimized to compress the infinite-dimensional observation history into the prescribed finite-dimensional memory and to determine the control based on it. In other words, each controller controls not only the dynamics of the target system but also the dynamics of its own memory. The formulation of ML-DSC enables us to evade the difficulties mentioned above.

Furthermore, we provide a way to solve the optimization problem associated with the ML-DSC formulation. Specifically, we address the optimization problem by converting ML-DSC in the state space into the deterministic optimal control problem in the probability density function space. This technique has recently been used in mean-field stochastic control [25,26] and ML-POSC [23,24], and is also effective for ML-DSC. Following that, we can solve ML-DSC in a similar way to the deterministic optimal control problem on the probability density function space; the optimal control function of ML-DSC was obtained by jointly solving the Hamilton–Jacobi–Bellman (HJB) equation and the Fokker–Planck (FP) equation. HJB–FP equations also appear in mean-field stochastic game and control [25,26,27,28,29] and ML-POSC [23,24], and numerous numerical algorithms have been proposed [24,30,31,32]. Using these numerical algorithms, ML-DSC may be solved effectively, even in general problems. It should be noted that a similar idea to ML-DSC was also employed in a decentralized partially observable Markov decision process (DEC-POMDP) with the finite-state controller for more than a decade [33,34,35,36,37,38,39]. However, the algorithms of the finite-state controller of DEC-POMDP strongly depend on the discreteness, and thus they are not applicable to ML-DSC where the continuous time and state are considered.

We applied ML-DSC and our algorithm to the LQG problem. The conventional DSC can only be solved for special LQG problems where the information of the controllers is independent [8,9,10] or partially nested [17,18,19,20,21,22]. In contrast, ML-DSC can be solved even in LQG problems with more general interactions among the controllers. In LQG problems of POSC, estimation and control are clearly separated, and are optimized by the Kalman filter and the Riccati equation, respectively [11,14]. In the LQG problems of ML-DSC, estimation and control are not clearly separated and are jointly optimized by the modified Riccati equation, which is called the decentralized Riccati equation in this paper. We noted that this coupling of estimation and control also appears in the conventional DSC [17,18,19,20,21,22] and ML-POSC [23,24]. Therefore, it may be induced by decentralized structure and memory limitation. Finally, we conducted two numerical experiments for the LQG problems of ML-DSC. One controls one-dimensional divergent state dynamics, and the other controls two-dimensional oscillatory state dynamics. These numerical experiments demonstrate that the decentralized Riccati equation is superior to the Riccati equation in the LQG problems of ML-DSC.

The rest of this paper is organized as follows. In Section 2, we briefly review the conventional DSC. In Section 3, we formulate ML-DSC. In Section 4, we solve ML-DSC. In Section 5, we apply ML-DSC to the LQG problem. In Section 6, we conduct the numerical experiments of two LQG problems in ML-DSC. In Section 7, we discuss this paper.

2. Review of Decentralized Stochastic Control

In this section, we briefly review the conventional DSC (Figure 1a) [8,9,10]. DSC consists of a target system and N controllers. Vector

x_{t} \in R^{d_{x}}

is the state of the system at time

t \in [0, T]

, which evolves by the following stochastic differential equation (SDE):

\begin{matrix} d x_{t} & = b (t, x_{t}, u_{t}) d t + σ (t, x_{t}, u_{t}) d ω_{t}, \end{matrix}

(1)

where

x_{0}

obeys

p_{0} (x_{0})

,

ω_{t} \in R^{d_{ω}}

is the standard Wiener process,

u_{t}^{i} \in R^{d_{u}^{i}}

is the control of controller i, and

u_{t} : = (u_{t}^{1}, u_{t}^{2}, \dots, u_{t}^{N}) \in R^{d_{u}}

is the joint control of all controllers. We noted that

d_{u} : = \sum_{i = 1}^{N} d_{u}^{i}

. DSC often assumes that the system is composed of N agents and that the state of the system is decomposed into

x_{t} : = (x_{t}^{1}, x_{t}^{2}, \dots, x_{t}^{N}) \in R^{d_{x}}

where

x_{t}^{i} \in R^{d_{x}^{i}}

is the state of agent i. In this paper, we do not assume such a situation, because our formulation of the state of the system includes it as a special case.

In DSC, controller i cannot completely observe the state of the system

x_{t}

and the joint control of all controllers

u_{t}

. It can only obtain the noisy observation

y_{t}^{i} \in R^{d_{y}^{i}}

, which evolves by the following SDE:

\begin{matrix} d y_{t}^{i} & = h^{i} (t, x_{t}, u_{t}) d t + γ^{i} (t, x_{t}, u_{t}) d ν_{t}^{i}, \end{matrix}

(2)

where

y_{0}^{i}

obeys

p_{0}^{i} (y_{0}^{i})

, and

ν_{t}^{i} \in R^{d_{ν}^{i}}

is the standard Wiener process. Controller i’s observation

y_{t}^{i}

is controlled by the other controllers through the joint control

u_{t}

, which expresses the communication among the controllers. Controller i determines its control

u_{t}^{i}

based on the observation history

y_{0 : t}^{i} : = {y_{τ}^{i} | τ \in [0, t]}

as follows:

\begin{matrix} u_{t}^{i} = u^{i} (t, y_{0 : t}^{i}) . \end{matrix}

(3)

The objective function of DSC is given by the following expected cumulative cost function:

\begin{matrix} J [u] : = E_{p (x_{0 : T}, y_{0 : T}; u)} [\int_{0}^{T} f (t, x_{t}, u_{t}) d t + g (x_{T})], \end{matrix}

(4)

where f is the running cost function, g is the terminal cost function,

p (x_{0 : T}, y_{0 : T}; u)

is the joint probability of

x_{0 : T}

and

y_{0 : T}

given that u is a parameter, and

E_{p} [\cdot]

is the expectation with respect to p. DSC is the problem to find the optimal joint control function

u^{*}

that minimizes the objective function

J [u]

:

\begin{matrix} u^{*} : = \arg \min_{u} J [u] . \end{matrix}

(5)

In DSC, controller i needs to memorize the infinite-dimensional observation history

y_{0 : t}^{i}

to determine the optimal control

u_{t}^{i *} = u^{i *} (t, y_{0 : t}^{i})

. This is one of the major obstacles in DSC for implementing controllers with finite and limited memory.

3. Memory-Limited Decentralized Stochastic Control

In this section, we formulate ML-DSC, which can circumvent the difficulty in DSC by explicitly formulating finite-dimensional memory of the controllers.

3.1. Problem Formulation

In this subsection, we formulate ML-DSC (Figure 1b). ML-DSC explicitly formulates the finite-dimensional memory of controller i by

z_{t}^{i} \in R^{d_{z}^{i}}

. The memory dimension

d_{z}^{i}

is prescribed by the available memory size of the controller i. The controller i compresses the infinite-dimensional observation history

y_{0 : t}^{i}

into the finite-dimensional memory

z_{t}^{i}

by the following SDE:

\begin{matrix} d z_{t}^{i} = c^{i} (t, z_{t}^{i}, v_{t}^{i}) d t + κ^{i} (t, z_{t}^{i}, v_{t}^{i}) d y_{t}^{i} + η^{i} (t, z_{t}^{i}, v_{t}^{i}) d ξ_{t}^{i}, \end{matrix}

(6)

where

z_{0}^{i}

obeys

p_{0}^{i} (z_{0}^{i})

,

ξ_{t}^{i} \in R^{d_{ξ}^{i}}

is the standard Wiener process, and

v_{t}^{i} \in R^{d_{v}^{i}}

is the control of the memory. Unlike the conventional DSC, ML-DSC can take into account the intrinsic stochasticity of the memory, which is modeled by the standard Wiener process

d ξ_{t}^{i}

in the memory dynamics (6). In addition, the compression of the infinite-dimensional observation history

y_{0 : t}^{i}

into the finite-dimensional memory

z_{t}^{i}

is optimized by the memory control

v_{t}^{i}

. In ML-DSC, the controller i determines the state control

u_{t}^{i}

and the memory control

v_{t}^{i}

based on the finite-dimensional memory

z_{t}^{i}

as follows:

\begin{matrix} u_{t}^{i} = u^{i} (t, z_{t}^{i}), v_{t}^{i} = v^{i} (t, z_{t}^{i}) . \end{matrix}

(7)

The objective function of ML-DSC is given by the following expected cumulative cost function:

\begin{matrix} J [u, v] : = E_{p (x_{0 : T}, y_{0 : T}, z_{0 : T}; u, v)} [\int_{0}^{T} f (t, x_{t}, u_{t}, v_{t}) d t + g (x_{T})], \end{matrix}

(8)

where f is the running cost function, g is the terminal cost function,

p (x_{0 : T}, y_{0 : T}, z_{0 : T}; u, v)

is the joint probability of

x_{0 : T}

,

y_{0 : T}

and

z_{0 : T}

given u and v as parameters, and

E_{p} [\cdot]

is the expectation with respect to p. Unlike the cost function f of DSC in Equation (4), the cost function f of ML-DSC in Equation (8) depends on the memory control

v_{t}

as well as the state control

u_{t}

. From a practical viewpoint, it should be natural to consider both costs of control and memory. ML-DSC optimizes the state control function u and the memory control function v based on the objective function

J [u, v]

:

\begin{matrix} u^{*}, v^{*} : = \arg \min_{u, v} J [u, v] . \end{matrix}

(9)

The optimal memory control function

v^{*} : = (v^{1 *}, v^{2 *}, \dots, v^{N *})

optimizes the memory dynamics (6), which can be interpreted as the optimization of the compression of the observation history into the finite-dimensional memory. In the LQG problem of POSC, the optimal memory control function

v^{*}

makes the memory dynamics into the Kalman filter, which is the optimal compression of the observations history in this problem [23]. We expect that the optimal memory control function

v^{*}

is also effective for more general problems of ML-DSC.

In ML-DSC, controller i determines the optimal control functions

u_{t}^{i *}

and

v_{t}^{i *}

based only on the finite-dimensional memory

z_{t}^{i}

. In addition, ML-DSC can take into account the intrinsic stochasticity and the control cost of the memory. Thus, ML-DSC can explicitly accommodate various realistic constrains of the controllers such as memory size, noise in the controllers, and cost for updating memory, none of which can be explicitly addressed in the conventional DSC.

It should be noted that here we consider memory size only for storing continuous time-series with finite-dimensional vectors. While memory size also matters when we consider quantization and storing of real valued observations, these topics are out of the scope of this work.

3.2. Extended State

In this subsection, we generalize the formulation of ML-DSC based on the extended state. This generalization is useful for mathematical investigations by simplifying the notation of ML-DSC. Furthermore, it clarifies the difference between ML-DSC and the conventional stochastic optimal control problems.

We define the extended state

s_{t} \in R^{d_{s}}

, the extended control

{\tilde{u}}_{t}^{i} \in R^{d_{\tilde{u}}^{i}}

, the extended joint control

{\tilde{u}}_{t} \in R^{d_{\tilde{u}}}

, and the extended standard Wiener process

{\tilde{ω}}_{t} \in R^{d_{\tilde{ω}}}

as follows:

s_{t} : = (\begin{matrix} x_{t} \\ z_{t}^{1} \\ ⋮ \\ z_{t}^{N} \end{matrix}), {\tilde{u}}_{t}^{i} : = (\begin{matrix} u_{t}^{i} \\ v_{t}^{i} \end{matrix}), {\tilde{u}}_{t} : = (\begin{matrix} {\tilde{u}}_{t}^{1} \\ ⋮ \\ {\tilde{u}}_{t}^{N} \end{matrix}), {\tilde{ω}}_{t} : = (\begin{matrix} ω_{t} \\ ν_{t}^{1} \\ ⋮ \\ ν_{t}^{N} \\ ξ_{t}^{1} \\ ⋮ \\ ξ_{t}^{N} \end{matrix}),

(10)

where

d_{s} : = d_{x} + \sum_{i = 1}^{N} d_{z}^{i}

,

d_{\tilde{u}}^{i} : = d_{u}^{i} + d_{v}^{i}

,

d_{\tilde{u}} : = \sum_{i = 1}^{N} d_{\tilde{u}}^{i}

, and

d_{\tilde{ω}} : = d_{ω} + \sum_{i = 1}^{N} d_{ν}^{i} + \sum_{i = 1}^{N} d_{ξ}^{i}

.

Based on the extended state

s_{t}

, the extended joint control

{\tilde{u}}_{t}

, and the extended standard Wiener process

{\tilde{ω}}_{t}

, the state, observation, and memory SDEs, i.e., Equations (1), (2), and (6), are summarized as follows:

d s_{t} = \underset{= : \tilde{b} (t, s_{t}, {\tilde{u}}_{t})}{\underset{⏟}{(\begin{matrix} b \\ c^{1} + κ^{1} h^{1} \\ ⋮ \\ c^{N} + κ^{N} h^{N} \end{matrix})}} d t + \underset{= : \tilde{σ} (t, s_{t}, {\tilde{u}}_{t})}{\underset{⏟}{(\begin{matrix} σ & O & \dots & O & O & \dots & O \\ O & κ^{1} γ^{1} & \dots & O & η^{1} & \dots & O \\ ⋮ & ⋮ & ⋱ & ⋮ & ⋮ & ⋱ & ⋮ \\ O & O & \dots & κ^{N} γ^{N} & O & \dots & η^{N} \end{matrix})}} d {\tilde{ω}}_{t},

(11)

where

p_{0} (s_{0}) = p_{0} (x_{0}) \prod_{i = 1}^{N} p_{0}^{i} (z_{0}^{i})

. Thus, the SDE of ML-DSC can be generalized as follows:

\begin{matrix} d s_{t} = \tilde{b} (t, s_{t}, {\tilde{u}}_{t}) d t + \tilde{σ} (t, s_{t}, {\tilde{u}}_{t}) d {\tilde{ω}}_{t}, \end{matrix}

(12)

where

s_{0}

obeys

p_{0} (s_{0})

. We note that the structures of

\tilde{b} (t, s_{t}, {\tilde{u}}_{t})

and

\tilde{σ} (t, s_{t}, {\tilde{u}}_{t})

in Equation (12) are not necessarily restricted to those in Equation (11). Importantly, in ML-DSC, controller i determines the extended control

{\tilde{u}}_{t}^{i}

based solely on the memory

z_{t}^{i}

as follows:

\begin{matrix} {\tilde{u}}_{t}^{i} = {\tilde{u}}^{i} (t, z_{t}^{i}) . \end{matrix}

(13)

The objective function of ML-DSC (8) is generalized as follows:

\begin{matrix} J [\tilde{u}] : = E_{p (s_{0 : T}; \tilde{u})} [\int_{0}^{T} \tilde{f} (t, s_{t}, {\tilde{u}}_{t}) d t + \tilde{g} (s_{T})], \end{matrix}

(14)

where

\tilde{f}

is the running cost function and

\tilde{g}

is the terminal cost function. Therefore, the generalized ML-DSC is the problem to find the optimal extended joint control function

{\tilde{u}}^{*}

that minimizes the objective function

J [\tilde{u}]

:

\begin{matrix} {\tilde{u}}^{*} : = \arg \min_{\tilde{u}} J [\tilde{u}] \end{matrix}

(15)

under the constraint of Equation (13).

This generalization (12)–(15) clarifies the difference between ML-DSC and the conventional stochastic optimal control problems. If controller i determines the extended control

{\tilde{u}}_{t}^{i}

based on the whole extended state

s_{t} : = (x_{t}, z_{t}^{1}, \dots, z_{t}^{N})

as

{\tilde{u}}_{t}^{i} = {\tilde{u}}^{i} (t, s_{t})

, this problem becomes equivalent to completely observable stochastic control (COSC), which is the most basic stochastic optimal control problem (Figure 2a) [13,14,40]. Furthermore, if controller i determines the extended control

{\tilde{u}}_{t}^{i}

based on the joint memory

z_{t} : = (z_{t}^{1}, \dots, z_{t}^{N})

as

{\tilde{u}}_{t}^{i} = {\tilde{u}}^{i} (t, z_{t})

, this problem is reduced to ML-POSC in which all controllers share their information (Figure 2b) [23,24]. ML-DSC determines the extended control

{\tilde{u}}_{t}^{i}

based solely on its own memory

z_{t}^{i}

as

{\tilde{u}}_{t}^{i} = {\tilde{u}}^{i} (t, z_{t}^{i})

(13), which is different from COSC and ML-POSC (Figure 2c). While ML-DSC cannot be solved in a similar way as COSC [14,40,41], as is shown in the next section, it can be solved in a similar way as ML-POSC [23,24] because the method of ML-POSC is more general than that of COSC.

In the following section, we mainly consider the formulation of this subsection rather than that of Section 3.1 because it is simpler and more general. Moreover, we omit

\tilde{\cdot}

for the notational simplicity.

4. Derivation of Optimal Control Function

In this section, we solve ML-DSC by employing the technique in mean-field stochastic control [25,26] and ML-POSC [23,24].

4.1. Derivation of Optimal Control Function

In this subsection, we derive the optimal control function of ML-DSC. In ML-DSC, each controller cannot directly access the information about the state of the system and the memories of the other controllers. This constraint makes ML-DSC unable to be solved by the conventional methods of COSC, such as Bellman’s dynamic programming principle on the extended state space [13,14,40]. In order to address this issue, we converted ML-DSC on the extended state space into the deterministic optimal control problem on the probability density function space. The similar technique has also been used in mean-field stochastic control [25,26] and ML-POSC [23,24], and it is more effective for a broader class of stochastic optimal control problems than the conventional methods of COSC.

The extended state SDE (12) can be converted into the following Fokker–Planck (FP) equation:

\begin{matrix} \frac{\partial p_{t} (s)}{\partial t} = L_{u}^{†} p_{t} (s), \end{matrix}

(16)

where the initial condition is given by

p_{0} (s)

, and

L_{u}^{†}

is the forward diffusion operator, which is defined by

\begin{matrix} L_{u}^{†} p_{t} (s) & : = - \sum_{i = 1}^{d_{s}} \frac{\partial (b_{i} (t, s, u) p_{t} (s))}{\partial s_{i}} + \frac{1}{2} \sum_{i, j = 1}^{d_{s}} \frac{\partial^{2} (D_{i j} (t, s, u) p_{t} (s))}{\partial s_{i} \partial s_{j}}, \end{matrix}

(17)

where

D (t, s, u) : = σ (t, s, u) σ^{⊤} (t, s, u)

. The objective function (14) can be calculated as follows:

\begin{matrix} J [u] = \int_{0}^{T} \bar{f} (t, p_{t}, u_{t}) d t + \bar{g} (p_{T}), \end{matrix}

(18)

where

\bar{f} (t, p, u) : = E_{p (s)} [f (t, s, u)]

and

\bar{g} (p) : = E_{p (s)} [g (s)]

. We note that

\tilde{\cdot}

is omitted for the notational simplicity. From Equations (16) and (18), ML-DSC on the extended state space is converted into the deterministic optimal control problem on the probability density function space.

If being represented by the extended state, each controller cannot completely access the information of the extended state in ML-DSC, which hampers the conventional methods of COSC. By lifting the state variable from the extended state to its probability density function, such a difficulty can be avoided, because any controllers can completely access the probability density function from its deterministic nature. As a result, the optimal condition of ML-DSC was obtained by a similar way to the deterministic optimal control problem, i.e., Pontryagin’s minimum principle on the probability density function, which can be interpreted as the generalization of Bellman’s dynamic programming principle on the extended state space [23,24]:

Theorem 1.

The optimal control function of ML-DSC satisfies the following equation:

\begin{matrix} u^{i *} (t, z^{i}) = \arg \min_{u^{i}} E_{p_{t} (s^{- i} | z^{i})} [H (t, s, (u^{- i *}, u^{i}), w)],^{\forall} i \in {1, 2, \dots, N}, \end{matrix}

(19)

where

H

is the Hamiltonian, which is defined as follows:

\begin{matrix} H (t, s, u, w) : = f (t, s, u) + L_{u} w (t, s), \end{matrix}

(20)

where

L_{u}

is the backward diffusion operator, which is defined as follows:

\begin{matrix} L_{u} w (t, s) & : = \sum_{i = 1}^{d_{s}} b_{i} (t, s, u) \frac{\partial w (t, s)}{\partial s_{i}} + \frac{1}{2} \sum_{i, j = 1}^{d_{s}} D_{i j} (t, s, u) \frac{\partial^{2} w (t, s)}{\partial s_{i} \partial s_{j}}, \end{matrix}

(21)

which is the conjugate of

L_{u}^{†}

as follows:

\begin{matrix} \int w (t, s) L_{u}^{†} p (t, s) d s = \int p (t, s) L_{u} w (t, s) d s . \end{matrix}

(22)

Variables

s^{- i} \in R^{d_{s}^{- i}}

,

u^{- i *} \in R^{d_{u}^{- i}}

, and

(u^{- i *}, u^{i}) \in R^{d_{u}}

are defined as follows:

s^{- i} : = (\begin{matrix} x \\ z^{1} \\ ⋮ \\ z^{i - 1} \\ z^{i + 1} \\ ⋮ \\ z^{N} \end{matrix}), u^{- i *} : = (\begin{matrix} u^{1 *} \\ ⋮ \\ u^{i - 1 *} \\ u^{i + 1 *} \\ ⋮ \\ u^{N *} \end{matrix}), (u^{- i *}, u^{i}) : = (\begin{matrix} u^{1 *} \\ ⋮ \\ u^{i - 1 *} \\ u^{i} \\ u^{i + 1 *} \\ ⋮ \\ u^{N *} \end{matrix}),

(23)

where

d_{s}^{- i} : = d_{s} - d_{z}^{i}

and

d_{u}^{- i} : = d_{u} - d_{u}^{i}

. Function

w (t, s)

is the solution of the following Hamilton–Jacobi–Bellman (HJB) equation:

\begin{matrix} - \frac{\partial w (t, s)}{\partial t} = H (t, s, u^{*}, w), \end{matrix}

(24)

where

w (T, s) = g (s)

. Function

p_{t} (s^{- i} | z^{i}) : = p_{t} (s) / \int p_{t} (s) d s^{- i}

is the conditional probability density function of

s^{- i}

given

z^{i}

, and

p_{t} (s)

is the solution of FP Equation (16) driven by

u^{*}

. We note that

\tilde{\cdot}

is omitted for the notational simplicity.

Proof.

The proof is shown in Appendix A. □

We note that the optimality condition (19) is a necessary condition of the optimal control function of ML-DSC, not a sufficient condition. The optimality condition (19) becomes a necessary and sufficient condition when the expected Hamiltonian

\bar{H} (t, p, u, w) : = E_{p (s)} [H (t, s, u, w)]

is convex with respect to p and u. This proof is almost the same with Reference [24]. In the following, the control function of ML-DSC that satisfies the optimality condition (19) is called the optimal control function of ML-DSC.

4.2. Numerical Algorithm

The optimal control function of ML-DSC (19) is obtained by jointly solving FP Equation (16) and HJB Equation (24). HJB-FP equations also appear in mean-field stochastic game and control [25,26,27,28,29] and ML-POSC [23,24], and numerous numerical algorithms have been developed [24,32]. As a result, ML-DSC may be solved practically by employing these numerical algorithms.

One of the most basic numerical algorithms to solve HJB-FP equations is the forward-backward sweep method (the fixed-point iteration method) [24,32,42,43,44], which computes the forward FP Equation (16) and the backward HJB Equation (24) alternately. While the convergence of the forward-backward sweep method is not guaranteed in mean-field stochastic game and control [32,42,43,44], it is guaranteed in ML-POSC because the coupling of HJB-FP equations is limited to the optimal control function in ML-POSC [24]. The convergence of the forward-backward sweep method is also guaranteed in ML-DSC for the same reason as ML-POSC. This proof is almost the same as Reference [24].

4.3. Comparison with Completely Observable Stochastic Control or Memory-Limited Partially Observable Stochastic Control

COSC and ML-POSC can be solved in a similar way to ML-DSC [23,24]. In COSC, controller i can completely observe the state of the target system

x_{t}

and the memories of the other controllers

z_{t}^{j} (j \neq i)

, as well as its own memory

z_{t}^{i}

(Figure 2a) [13,14,40]. As a result, the control

u_{t}^{i}

is determined based on the whole extended state

s_{t} : = (x_{t}, z_{t}^{1}, \dots, z_{t}^{N})

as

u_{t}^{i} = u^{i} (t, s_{t})

. From Pontryagin’s minimum principle on the probability density function space, the optimal control function of COSC is given by the following equation:

\begin{matrix} u^{i *} (t, s) = \arg \min_{u^{i}} H (t, s, (u^{- i *}, u^{i}), w), \end{matrix}

(25)

where

w (t, s)

is the solution of HJB Equation (24). This result is the same with Bellman’s dynamic programming principle on the state space. Thus, Pontryagin’s minimum principle on the probability density function space can be interpreted as the generalization of Bellman’s dynamic programming principle on the state space.

In ML-POSC, controller i can observe the memories of the other controllers

z_{t}^{j} (j \neq i)

as well as its own memory

z_{t}^{i}

(Figure 2b) [23,24]. As a result, the control

u_{t}^{i}

is determined based on the joint memory

z_{t} : = (z_{t}^{1}, \dots, z_{t}^{N})

as

u_{t}^{i} = u^{i} (t, z_{t})

. From Pontryagin’s minimum principle on the probability density function space, the optimal control function of ML-POSC is given by the following equation:

\begin{matrix} u^{i *} (t, z) = \arg \min_{u^{i}} E_{p_{t} (x | z)} [H (t, s, (u^{- i *}, u^{i}), w)], \end{matrix}

(26)

where

w (t, s)

is the solution of HJB Equation (24).

p_{t} (x | z) : = p_{t} (s) / \int p_{t} (s) d x

is the conditional probability density function of x given z, and

p_{t} (s)

is the solution of FP Equation (16) driven by

u^{*}

.

Although HJB Equation (24) is the same for COSC, ML-POSC, and ML-DSC, the optimal control function is different. Notably, the optimal control functions of ML-POSC and ML-DSC depend on FP Equation (16) because they need to estimate unobservables from observables. ML-POSC needs to estimate the state of the system

x_{t}

from the joint memory of all controllers

z_{t}

. ML-DSC needs to estimate the memories of the other controllers

z_{t}^{j} (j \neq i)

as well as the state of the system

x_{t}

from its own memory

z_{t}^{i}

.

In COSC, the optimal control function depends only on HJB Equation (24). As a result, it can be obtained by solving the HJB Equation (24) backward in time from the terminal condition, which is called the value iteration method [45,46,47]. By contrast, in ML-POSC and ML-DSC, the optimal control function cannot be obtained by the value iteration method because it depends on FP Equation (16) as well as HJB Equation (24). Instead, it can be obtained by the forward-backward sweep method, which computes the forward FP Equation (16) and the backward HJB Equation (24) alternately [24].

5. Linear-Quadratic-Gaussian Problem

In this section, we demonstrate how ML-DSC works by applying it to the LQG problem. In the conventional DSC, the LQG problem can be solved only when the information of the controllers is independent [8,9,10] or partially nested [17,18,19,20,21,22]. By contrast, in ML-DSC, the LQG problem can be solved without restricting the interaction among the controllers.

5.1. Problem Formulation

In this subsection, we formulate the LQG problem of ML-DSC. In this problem, the extended state SDE (12) is given as follows:

\begin{matrix} d s_{t} & = (A (t) s_{t} + B (t) u_{t}) d t + σ (t) d ω_{t} \\ = (A (t) s_{t} + \sum_{i = 1}^{N} B_{i} (t) u_{t}^{i}) d t + σ (t) d ω_{t}, \end{matrix}

(27)

where the initial condition is given by the Gaussian distribution

p_{0} (s) : = N (s |μ_{0}, \sum_{0})

. Furthermore, we note that

\tilde{\cdot}

is omitted for the notational simplicity. In ML-DSC, controller i determines the control

u_{t}^{i}

based on the memory

z_{t}^{i}

as follows:

\begin{matrix} u_{t}^{i} = u^{i} (t, z_{t}^{i}) . \end{matrix}

(28)

In the extended state SDE (27), the interaction among the controllers is not restricted. For example, each controller is allowed to control the memories of the other controllers from its own memory through the state or the observations. In this case, the memories of the controllers do not become independent or partially nested. It becomes obvious in the numerical experiments in Section 6.

The objective function (14) is given as follows:

\begin{matrix} J [u] & : = E_{p (s_{0 : T}; u)} [\int_{0}^{T} (s_{t}^{⊤} Q (t) s_{t} + u_{t}^{⊤} R (t) u_{t}) d t + s_{T}^{⊤} P s_{T}] \\ = E_{p (s_{0 : T}; u)} [\int_{0}^{T} (s_{t}^{⊤} Q (t) s_{t} + \sum_{i = 1}^{N} \sum_{j = 1}^{N} {(u_{t}^{i})}^{⊤} R_{i j} (t) u_{t}^{j}) d t + s_{T}^{⊤} P s_{T}], \end{matrix}

(29)

where

Q (t) ⪰ O

,

R (t) ≻ O

, and

P ⪰ O

. The objective of this problem is to find the optimal control function

u^{*}

that minimizes the objective function

J [u]

:

\begin{matrix} u^{*} : = \arg \min_{u} J [u] . \end{matrix}

(30)

In this paper, we assume that

R (t)

is the block diagonal matrix, i.e., the costs of different controllers are independent as follows:

\begin{matrix} R (t) = (\begin{matrix} R_{11} (t) & O & \dots & O \\ O & R_{22} (t) & \dots & O \\ ⋮ & ⋮ & ⋱ & ⋮ \\ O & O & \dots & R_{N N} (t) \end{matrix}), \end{matrix}

(31)

where

R_{i i} (t) ≻ O

. In this case, the objective function (29) can be calculated as follows:

\begin{matrix} J [u] & = E_{p (s_{0 : T}; u)} [\int_{0}^{T} (s_{t}^{⊤} Q (t) s_{t} + \sum_{i = 1}^{N} {(u_{t}^{i})}^{⊤} R_{i i} (t) u_{t}^{i}) d t + s_{T}^{⊤} P s_{T}] . \end{matrix}

(32)

If this assumption does not hold, the optimal control function cannot be derived explicitly. This problem is similar with the Witsenhausen’s counterexample, which demonstrates the difficulty of DSC and is historically recognized as an important problem [5,48,49]. However, this assumption is not critical in many applications as the control cost matrix is usually diagonal.

5.2. Derivation of Optimal Control Function

In this subsection, we derive the optimal control function of the LQG problem of ML-DSC by applying Theorem 1. In this problem, the probability density function of the extended state s at time t is given by the Gaussian distribution

p_{t} (s) : = N (s | μ (t), \sum (t))

, in which

μ (t)

is the mean vector and

\sum (t)

is the covariance matrix. Defining the stochastic extended state

\hat{s} : = s - μ

, the conditional expectation

E_{p_{t} (s^{- i} | z^{i})} [s]

can be calculated as follows:

\begin{matrix} E_{p_{t} (s^{- i} | z^{i})} [s] = μ (t) + K_{i} (t) \hat{s}, \end{matrix}

(33)

where

K_{i} (t) \in R^{d_{s} \times d_{s}}

is defined as follows:

\begin{matrix} K_{i} (t) : = (\begin{matrix} O & \dots & O & \sum_{x z^{i}} (t) \sum_{z^{i} z^{i}}^{- 1} (t) & O & \dots & O \\ O & \dots & O & \sum_{z^{1} z^{i}} (t) \sum_{z^{i} z^{i}}^{- 1} (t) & O & \dots & O \\ ⋮ & ⋱ & ⋮ & ⋮ & ⋮ & ⋱ & ⋮ \\ O & \dots & O & \sum_{z^{i - 1} z^{i}} (t) \sum_{z^{i} z^{i}}^{- 1} (t) & O & \dots & O \\ O & \dots & O & I & O & \dots & O \\ O & \dots & O & \sum_{z^{i + 1} z^{i}} (t) \sum_{z^{i} z^{i}}^{- 1} (t) & O & \dots & O \\ ⋮ & ⋱ & ⋮ & ⋮ & ⋮ & ⋱ & ⋮ \\ O & \dots & O & \sum_{z^{N} z^{i}} (t) \sum_{z^{i} z^{i}}^{- 1} (t) & O & \dots & O \end{matrix}) . \end{matrix}

(34)

Function

K_{i} (t)

is the zero matrix except for the columns corresponding to

z^{i}

. By applying Theorem 1 to the LQG problem of ML-DSC, we obtained the following theorem:

Theorem 2.

In the LQG problem, the optimal control function of ML-DSC satisfies the following equation:

\begin{matrix} u^{i *} (t, z^{i}) = - R_{i i}^{- 1} B_{i}^{⊤} (Ψ μ + Φ K_{i} \hat{s}), \end{matrix}

(35)

where

K_{i} (t)

is defined by Equation (34), which depends on

\sum (t)

. Functions

μ (t)

and

\sum (t)

are the solutions of the following ordinary differential equations (ODEs):

\begin{matrix} \frac{d μ}{d t} & = (A - B R^{- 1} B^{⊤} Ψ) μ, \end{matrix}

(36)

\begin{matrix} \frac{d \sum}{d t} & = σ σ^{⊤} + (A - \sum_{i = 1}^{N} B_{i} R_{i i}^{- 1} B_{i}^{⊤} Φ K_{i}) \sum + \sum {(A - \sum_{i = 1}^{N} B_{i} R_{i i}^{- 1} B_{i}^{⊤} Φ K_{i})}^{⊤}, \end{matrix}

(37)

where

μ (0) = μ_{0}

and

\sum (0) = \sum_{0}

. Functions

Ψ (t)

and

Φ (t)

are the solutions of the following ODEs:

\begin{matrix} - \frac{d Ψ}{d t} & = Q + A^{⊤} Ψ + Ψ A - Ψ B R^{- 1} B^{⊤} Ψ, \end{matrix}

(38)

\begin{matrix} - \frac{d Φ}{d t} & = Q + A^{⊤} Φ + Φ A - Φ B R^{- 1} B^{⊤} Φ + \sum_{i = 1}^{N} {(I - K_{i})}^{⊤} Φ B_{i} R_{i i}^{- 1} B_{i}^{⊤} Φ (I - K_{i}), \end{matrix}

(39)

where

Ψ (T) = Φ (T) = P

.

Proof.

The proof is shown in Appendix B. □

In the LQG problem of ML-DSC, FP Equation (16) is reduced to Equations (36) and (37), and HJB Equation (24) is reduced to Equations (38) and (39). The optimal control function (35) is decomposed into the deterministic part and the stochastic part, which correspond to the first term and the second term, respectively. The first term of the optimal control function (35) also appears in the linear-quadratic (LQ) problem of deterministic control, and Equation (38) is called the Riccati equation [14,40]. In contrast, the second term of the optimal control function (35) is new in the LQG problem of ML-DSC, and Equation (39) is called the decentralized Riccati equation in this paper.

5.3. Comparison with Completely Observable Stochastic Control or Memory-Limited Partially Observable Stochastic Control

In the LQG problem, the optimal control function of COSC is given as follows [14,40]:

\begin{matrix} u^{i *} (t, s) = - R_{i i}^{- 1} B_{i}^{⊤} (Ψ μ + Ψ \hat{s}), \end{matrix}

(40)

where

Ψ (t)

is the solution of the Riccati Equation (38). Function

Ψ (t)

appears in the second term and the first term in COSC. In addition, the optimal control function of ML-POSC is given as follows [23,24]:

\begin{matrix} u^{i *} (t, z) = - R_{i i}^{- 1} B_{i}^{⊤} (Ψ μ + Π K \hat{s}), \end{matrix}

(41)

where

Π (t)

is the solution of the following partially observable Riccati equation:

\begin{matrix} - \frac{d Π}{d t} & = Q + A^{⊤} Π + Π A - Π B R^{- 1} B^{⊤} Π + {(I - K)}^{⊤} Π B R^{- 1} B^{⊤} Π (I - K), \end{matrix}

(42)

where

Π (T) = P

and

K (t)

is defined by

\begin{matrix} K (t) : = (\begin{matrix} O & \sum_{x z} (t) \sum_{z z}^{- 1} (t) \\ O & I \end{matrix}) . \end{matrix}

(43)

Thus, the decentralized Riccati Equation (39) is a natural extension of the partially observable Riccati Equation (42) from a centralized problem to a decentralized problem.

While the first deterministic term of the optimal control function is the same for COSC (40), ML-POSC (41), and ML-DSC (35), the second stochastic term is different, reflecting the fact that the observable part of the stochastic extended state is different.

5.4. Decentralized Riccati Equation

In this subsection, we analyze the decentralized Riccati Equation (39) by comparing it with the Riccati Equation (38). The Riccati Equation (38) is the control gain matrix of deterministic control and COSC. While the Riccati Equation (38) optimizes control, it does not improve estimation, because estimation is not needed in deterministic control and COSC. In contrast, the decentralized Riccati Equation (39) may improve estimation as well as control because the controllers of ML-DSC need to estimate the state of the system and the memories of the other controllers from their own memories.

In order to support this discussion, we analyze the last term of the decentralized Riccati Equation (39), which is denoted as follows:

\begin{matrix} Q_{i} : = {(I - K_{i})}^{⊤} Φ B_{i} R_{i i}^{- 1} B_{i}^{⊤} Φ (I - K_{i}) . \end{matrix}

(44)

This term is the main difference between the Riccati Equation (38) and the decentralized Riccati Equation (39), and thus accounts for the contribution of estimation in ML-DSC. For the sake of simplicity, we focused on

Q_{N}

. A similar discussion is possible for

Q_{i}

by the permutation of the controllers’ indices. We also denoted

a : = s^{- N}

and

b : = z^{N}

for notational simplicity. a is unobservable and b is observable for the controller N.

Q_{N}

can be calculated as follows:

\begin{matrix} Q_{N} = (\begin{matrix} P_{a a} & - P_{a a} \sum_{a b} \sum_{b b}^{- 1} \\ - \sum_{b b}^{- 1} \sum_{b a} P_{a a} & \sum_{b b}^{- 1} \sum_{b a} P_{a a} \sum_{a b} \sum_{b b}^{- 1} \end{matrix}), \end{matrix}

(45)

where

P_{a a} : = {(Φ B_{N} R_{N N}^{- 1} B_{N}^{⊤} Φ)}_{a a}

. Because

P_{a a} ⪰ O

and

\sum_{b b}^{- 1} \sum_{b a} P_{a a} \sum_{a b} \sum_{b b}^{- 1} ⪰ O

,

Φ_{a a}

and

Φ_{b b}

may be larger than

Ψ_{a a}

and

Ψ_{b b}

, respectively. Because

Φ_{a a}

and

Φ_{b b}

are the negative feedback gains of a and b, respectively,

Q_{N}

may contribute to decreasing

\sum_{a a}

and

\sum_{b b}

. Moreover, when

\sum_{a b}

is positive/negative,

Φ_{a b}

may be smaller/larger than

Ψ_{a b}

, which may increase/decrease

\sum_{a b}

. The similar discussion is possible for

\sum_{b a}

,

Φ_{b a}

, and

Ψ_{b a}

because

\sum

,

Φ

, and

Ψ

are symmetric matrices. As a result,

Q_{N}

may contribute to decreasing the following conditional covariance matrix:

\begin{matrix} \sum_{a | b} : = \sum_{a a} - \sum_{a b} \sum_{b b}^{- 1} \sum_{b a}, \end{matrix}

(46)

which corresponds to the estimation error of the unobservable a from the observable b. It indicates that the decentralized Riccati Equation (39) may improve estimation as well as control.

It should be noted that estimation and control are not clearly separated in the LQG problem of ML-DSC. In the LQG problem of POSC, estimation and control are clearly separated, and they are optimized by the Kalman filter and the Riccati equation, respectively [11,14]. By contrast, in the LQG problem of ML-DSC, both estimation and control are optimized by the decentralized Riccati equation. This coupling of estimation and control also appears in the conventional DSC [17,18,19,20,21,22] and ML-POSC [23,24], which indicates that it may be caused by a decentralized structure and memory limitation.

6. Numerical Experiments

In this section, we demonstrate the significance of the decentralized Riccati Equation (39) by conducting the numerical experiments of two LQG problems in ML-DSC. One is the one-dimensional state case, and the other is the two-dimensional state case.

6.1. One-Dimensional State Case

In this subsection, we conduct a numerical experiment in one-dimensional state case (Figure 3a). In this case, we consider the state

x_{t} \in R

, the observation

y_{t}^{i} \in R

and the memory

z_{t}^{i} \in R

of the controller

i \in {1, 2}

, which evolve by the following SDEs:

\begin{matrix} d x_{t} & = (x_{t} + u_{x, t}^{1} + u_{x, t}^{2}) d t + d ω_{t}, \end{matrix}

(47)

\begin{matrix} d y_{t}^{1} & = (x_{t} + u_{y, t}^{2}) d t + d ν_{t}^{1}, \end{matrix}

(48)

\begin{matrix} d y_{t}^{2} & = (x_{t} + u_{y, t}^{1}) d t + d ν_{t}^{2}, \end{matrix}

(49)

\begin{matrix} d z_{t}^{1} & = v_{t}^{1} d t + d y_{t}^{1}, \end{matrix}

(50)

\begin{matrix} d z_{t}^{2} & = v_{t}^{2} d t + d y_{t}^{2} . \end{matrix}

(51)

The initial conditions are given by the standard Gaussian distributions.

ω_{t} \in R

,

ν_{t}^{1} \in R

, and

ν_{t}^{2} \in R

are the independent standard Wiener processes.

u_{t}^{i} : = (u_{x, t}^{i}, u_{y, t}^{i}) = u^{i} (t, z_{t}^{i}) \in R^{2}

and

v_{t}^{i} : = v^{i} (t, z_{t}^{i}) \in R

are the controls of controller i. Each controller can control the memory of the other controller through

u_{y, t}^{i}

, which can be interpreted as the communication among the controllers. The objective function to be minimized is given as follows:

\begin{matrix} J [u, v] : = E_{p (x_{0 : T}, y_{0 : T}, z_{0 : T}; u, v)} [\int_{0}^{10} (x_{t}^{2} + \sum_{i = 1}^{2} ({(u_{x, t}^{i})}^{2} + {(u_{y, t}^{i})}^{2} + {(v_{t}^{i})}^{2})) d t], \end{matrix}

(52)

where

u : = (u^{1}, u^{2})

and

v : = (v^{1}, v^{2})

. Therefore, the objective of this problem is to minimize the state variance by small controls (Figure 3b).

In this problem, the information of the controllers is neither independent nor partially nested. The information of the controller 1’s memory

z_{t}^{1}

propagates to the controller 2’s memory

z_{t}^{2}

through the state control

u_{x, t}^{1}

and the observation control

u_{y, t}^{1}

, and vice versa (Figure 3a). While such a problem cannot be solved by the conventional DSC, ML-DSC can address it.

Figure 3. Schematic diagram of the LQG problem in Section 6.1. (a) The state of the system

x_{t}

is one-dimensional. (b) The two controllers control the state of the system to be close to 0.

Figure 3. Schematic diagram of the LQG problem in Section 6.1. (a) The state of the system

x_{t}

is one-dimensional. (b) The two controllers control the state of the system to be close to 0.

In the representation using the extended state

s_{t} : = (x_{t}, z_{t}^{1}, z_{t}^{2}) \in R^{3}

, the extended control

{\tilde{u}}_{t}^{i} : = (u_{x, t}^{i}, u_{y, t}^{i}, v_{t}^{i}) = {\tilde{u}}^{i} (t, z_{t}^{i}) \in R^{3}

, and the extended standard Wiener process

{\tilde{ω}}_{t} : = (ω_{t}, ν_{t}^{1}, ν_{t}^{2}) \in R^{3}

as in Equations (27) and (32), the SDEs defined by Equations (47)–(51) can be described as follows:

d s_{t} = ((\begin{matrix} 1 & 0 & 0 \\ 1 & 0 & 0 \\ 1 & 0 & 0 \end{matrix}) s_{t} + (\begin{matrix} 1 & 0 & 0 \\ 0 & 0 & 1 \\ 0 & 1 & 0 \end{matrix}) {\tilde{u}}_{t}^{1} + (\begin{matrix} 1 & 0 & 0 \\ 0 & 1 & 0 \\ 0 & 0 & 1 \end{matrix}) {\tilde{u}}_{t}^{2}) d t + d {\tilde{ω}}_{t},

(53)

which corresponds to Equation (27). The objective function (52) can be rewritten as follows:

J [\tilde{u}] : = E_{p (s_{0 : T}; \tilde{u})} [\int_{0}^{10} (s_{t}^{⊤} (\begin{matrix} 1 & 0 & 0 \\ 0 & 0 & 0 \\ 0 & 0 & 0 \end{matrix}) s_{t} + \sum_{i = 1}^{2} {({\tilde{u}}_{t}^{i})}^{⊤} (\begin{matrix} 1 & 0 & 0 \\ 0 & 1 & 0 \\ 0 & 0 & 1 \end{matrix}) {\tilde{u}}_{t}^{i}) d t],

(54)

which corresponds to Equation (32). In addition, it satisfies the block diagonal matrix assumption of

R (t)

(31).

The Riccati Equation (38) can be solved backwards in time from the terminal condition. In contrast, the decentralized Riccati Equation (39) cannot be solved in a similar way to the Riccati Equation (38) because it depends on the covariance matrix

\sum (t)

via the estimation gain matrix

K_{i} (t)

, which is the solution of the time-forward ODE (37). In order to solve the decentralized Riccati Equation (39), which is the time-backward ODE of

Φ (t)

, we use the forward-backward sweep method, which computes the time-forward ODE of

\sum (t)

(37) and the time-backward ODE of

Φ (t)

(39) alternately [24]. We note that the partially observable Riccati Equation (42) can also be solved by the forward-backward sweep method [24].

Figure 4 shows the trajectories of

Ψ (t)

,

Π (t)

, and

Φ (t)

, which are the optimal control gain matrices of COSC, ML-POSC, and ML-DSC, respectively. Functions

Ψ_{a b} (t)

,

Π_{a b} (t)

, and

Φ_{a b} (t)

are the negative control gains from b to a. We noted that

Ψ_{a b} (t)

,

Π_{a b} (t)

, and

Φ_{a b} (t)

are also the negative control gains from a to b because

Ψ (t)

,

Π (t)

, and

Φ (t)

are symmetric matrices. While the elements of

Ψ (t)

related with the memories

z_{1}

and

z_{2}

are always 0, those of

Π (t)

and

Φ (t)

are not (Figure 4b–f). Thus, the controls of the memories do not appear in COSC, but appear in ML-POSC and ML-DSC. This result indicates that the controls of the memories play an important role in estimation.

We first compares

Φ

with

Ψ

in more detail to investigate

Φ

.

Φ_{x x}

and

Φ_{z^{i} z^{i}}

are larger than

Ψ_{x x}

and

Ψ_{z^{i} z^{i}}

(Figure 4a,d,f), which may decrease

\sum_{x x}

and

\sum_{z^{i} z^{i}}

. Moreover,

Φ_{x z^{i}}

is smaller than

Ψ_{x z^{i}}

(Figure 4b,c), which may strengthen the positive correlation between x and

z^{i}

. Therefore,

Φ_{x x}

,

Φ_{z^{i} z^{i}}

, and

Φ_{x z^{i}}

may improve estimation, which is consistent with our discussion in Section 5.4. However,

Φ_{z^{1} z^{2}}

is larger than

Ψ_{z^{1} z^{2}}

(Figure 4e), which may weaken the positive correlation between

z^{1}

and

z^{2}

. It seems to be contrary to our discussion because it may worsen estimation.

In order to understand

Φ_{z^{1} z^{2}}

, we also compared

Φ

with

Π

. Estimation in ML-DSC is more challenging than that in ML-POSC because the controllers cannot completely share their information. Thus, except for

Φ_{z^{1} z^{2}}

, the absolute values of

Φ

are larger than those of

Π

(Figure 4a–d,f) for the same reason as the comparison with

Ψ

. The problem is only

Φ_{z^{1} z^{2}}

(Figure 4e). In ML-POSC, the estimation between

z^{1}

and

z^{2}

is not needed because the controllers share their information. As a result,

Π_{z^{1} z^{2}}

is determined only from a control perspective, not an estimation perspective.

Π_{z^{1} z^{2}}

is almost the same with

Π_{z^{i} z^{i}}

(Figure 4d–f), presumably because the cooperative control by the controllers 1 and 2 is more efficient than the independent control. By contrast, in ML-DSC, the estimation between

z^{1}

and

z^{2}

is necessary because each controller cannot directly access the other controller.

Φ_{z^{1} z^{2}}

is smaller than

Π_{z^{1} z^{2}}

(Figure 4e), which may strengthen the positive correlation between

z^{1}

and

z^{2}

. Therefore,

Φ_{z^{1} z^{2}}

may be determined by a trade-off between control and estimation.

In order to clarify the significance of the decentralized Riccati Equation (39), we compared the performance of the optimal control function of ML-DSC (35) with that of the following control functions:

\begin{matrix} u^{i, Ψ} (t, z^{i}) = - R_{i i}^{- 1} B_{i}^{⊤} (Ψ μ + Ψ K_{i} \hat{s}), \end{matrix}

(55)

\begin{matrix} u^{i, Π} (t, z^{i}) = - R_{i i}^{- 1} B_{i}^{⊤} (Ψ μ + Π K_{i} \hat{s}), \end{matrix}

(56)

which replaced

Φ

with

Ψ

and

Π

, respectively. We noted that the first terms are not important because

μ (t) = 0

is satisfied in this set-up. The result is shown in Figure 5. The variances of the state and the memories are

u^{Ψ}

>

u^{Π}

>

u^{*}

(Figure 5a–c). Similarly, the expected cumulative costs are

u^{Ψ}

>

u^{Π}

>

u^{*}

(Figure 5d). These orders are consistent with the extent of the optimization of estimation.

u^{Ψ}

does not take into account estimation at all, and its performance is the worst.

u^{Π}

takes into account only the state estimation, and it performs better than

u^{Ψ}

, but not optimally.

u^{*}

takes into account the estimation of the other memories and the state, and its performance is optimal.

6.2. Two-Dimensional State Case

In this subsection, we conduct a numerical experiment in two-dimensional state case (Figure 6a). In this case, we formulated the target state

x_{t}^{tar} : = (x_{t}^{tar, 1}, x_{t}^{tar, 2}) \in R^{2}

, the actual state

x_{t}^{act} : = (x_{t}^{act, 1}, x_{t}^{act, 2}) \in R^{2}

, the observation

y_{t} : = (y_{t}^{1}, y_{t}^{2}) \in R^{2}

, and the memory

z_{t} : = (z_{t}^{1}, z_{t}^{2}) \in R^{2}

as follows (Figure 6b):

\begin{matrix} d x_{t}^{tar, 1} & = - 2 π x_{t}^{tar, 2} d t, \end{matrix}

(57)

\begin{matrix} d x_{t}^{tar, 2} & = 2 π x_{t}^{tar, 1} d t, \end{matrix}

(58)

\begin{matrix} d x_{t}^{act, 1} & = (- 2 π x_{t}^{act, 2} + u_{x, t}^{1}) d t + d ω_{t}^{1}, \end{matrix}

(59)

\begin{matrix} d x_{t}^{act, 2} & = (2 π x_{t}^{act, 1} + u_{x, t}^{2}) d t + d ω_{t}^{2}, \end{matrix}

(60)

\begin{matrix} d y_{t}^{1} & = (x_{t}^{act, 1} - x_{t}^{tar, 1} + u_{y, t}^{2}) d t + d ν_{t}^{1}, \end{matrix}

(61)

\begin{matrix} d y_{t}^{2} & = (x_{t}^{act, 2} - x_{t}^{tar, 2} + u_{y, t}^{1}) d t + d ν_{t}^{2}, \end{matrix}

(62)

\begin{matrix} d z_{t}^{1} & = v_{t}^{1} d t + d y_{t}^{1}, \end{matrix}

(63)

\begin{matrix} d z_{t}^{2} & = v_{t}^{2} d t + d y_{t}^{2}, \end{matrix}

(64)

where

x_{0}^{tar} = (10, 0)

,

x_{0}^{act} \sim N (x_{0}^{act} | (10, 0), I)

,

y_{0} \sim N (y_{0} | 0, I)

, and

z_{0} \sim N (z_{0} | 0, I)

.

ω_{t}^{i} \in R

and

ν_{t}^{i} \in R

are the independent standard Wiener processes.

u_{t}^{i} : = (u_{x, t}^{i}, u_{y, t}^{i}) = u^{i} (t, z_{t}^{i}) \in R^{2}

and

v_{t}^{i} : = v^{i} (t, z_{t}^{i}) \in R

are the controls of controller i. The objective function to be minimized is given as follows:

\begin{matrix} J [u, v] : = E_{u, v} [\int_{0}^{10} \sum_{i = 1}^{2} (10 {(x_{t}^{act, i} - x_{t}^{tar, i})}^{2} + {(u_{x, t}^{i})}^{2} + {(u_{y, t}^{i})}^{2} + {(v_{t}^{i})}^{2}) d t] . \end{matrix}

(65)

The objective of this problem is to minimize the distance of the actual state

x_{t}^{act}

from the rotating target state

x_{t}^{tar}

by small controls. The solution of Equations (57) and (58) is given by

x_{t}^{tar} = (x_{t}^{tar, 1}, x_{t}^{tar, 2}) = (10 cos (2 π t), 10 sin (2 π t))

. If Equations (59) and (60) do not have the standard Wiener processes

d ω_{t}^{1}

and

d ω_{t}^{2}

, respectively,

u_{x, t}^{1} = 0

and

u_{x, t}^{2} = 0

are optimal because

x_{t}^{tar} = x_{t}^{act}

is satisfied. In practice, however, the actual state

x_{t}^{act}

does not coincide with the target state

x_{t}^{tar}

without the control

u_{x, t}

due to the state noise

d ω_{t}

, and needs to be controlled. Controller i observes and controls the actual state

x_{t}^{act}

only

x_{t}^{act, i}

-axis direction. As a result, the communication between the controllers is more important in this problem.

Figure 6. Schematic diagram of the LQG problem in Section 6.2. (a) The state of the system

x_{t} = (x_{t}^{1}, x_{t}^{2})

is two-dimensional. (b) The two controllers control the actual state

x_{t}^{act}

to be close to the target state

x_{t}^{tar} = (10 cos (2 π t), 10 sin (2 π t))

. Controller i observes and controls the actual state

x_{t}^{act}

only

x_{t}^{act, i}

-axis direction.

Figure 6. Schematic diagram of the LQG problem in Section 6.2. (a) The state of the system

x_{t} = (x_{t}^{1}, x_{t}^{2})

is two-dimensional. (b) The two controllers control the actual state

x_{t}^{act}

to be close to the target state

x_{t}^{tar} = (10 cos (2 π t), 10 sin (2 π t))

. Controller i observes and controls the actual state

x_{t}^{act}

only

x_{t}^{act, i}

-axis direction.

By defining the state

x_{t} : = x_{t}^{act} - x_{t}^{tar}

, Equations (57)–(64) are converted as follows:

\begin{matrix} d x_{t}^{1} & = (- 2 π x_{t}^{2} + u_{x, t}^{1}) d t + d ω_{t}^{1}, \end{matrix}

(66)

\begin{matrix} d x_{t}^{2} & = (2 π x_{t}^{1} + u_{x, t}^{2}) d t + d ω_{t}^{2}, \end{matrix}

(67)

\begin{matrix} d y_{t}^{1} & = (x_{t}^{1} + u_{y, t}^{2}) d t + d ν_{t}^{1}, \end{matrix}

(68)

\begin{matrix} d y_{t}^{2} & = (x_{t}^{2} + u_{y, t}^{1}) d t + d ν_{t}^{2}, \end{matrix}

(69)

\begin{matrix} d z_{t}^{1} & = v_{t}^{1} d t + d y_{t}^{1}, \end{matrix}

(70)

\begin{matrix} d z_{t}^{2} & = v_{t}^{2} d t + d y_{t}^{2}, \end{matrix}

(71)

where the initial conditions are given by the standard Gaussian distributions. Furthermore, the objective function (65) is converted as follows:

\begin{matrix} J [u, v] : = E_{u, v} [\int_{0}^{10} \sum_{i = 1}^{2} (10 {(x_{t}^{i})}^{2} + {(u_{x, t}^{i})}^{2} + {(u_{y, t}^{i})}^{2} + {(v_{t}^{i})}^{2}) d t] . \end{matrix}

(72)

As a result, the problem of controlling the actual state

x_{t}^{act}

to be close to the target state

x_{t}^{tar}

is equivalent to the problem of controlling the state

x_{t}

to be close to 0.

In the representation using the extended state

s_{t} : = (x_{t}^{1}, x_{t}^{2}, z_{t}^{1}, z_{t}^{2}) \in R^{4}

, the extended control

{\tilde{u}}_{t}^{i} : = (u_{x, t}^{i}, u_{y, t}^{i}, v_{t}^{i}) = {\tilde{u}}^{i} (t, z_{t}^{i}) \in R^{3}

, and the extended standard Wiener process

{\tilde{ω}}_{t} : = (ω_{t}^{1}, ω_{t}^{2}, ν_{t}^{1}, ν_{t}^{2}) \in R^{4}

as in Equations (27) and (32), the SDEs defined by Equations (66)–(71) can be described as follows:

d s_{t} = ((\begin{matrix} 0 & - 2 π & 0 & 0 \\ 2 π & 0 & 0 & 0 \\ 1 & 0 & 0 & 0 \\ 0 & 1 & 0 & 0 \end{matrix}) s_{t} + (\begin{matrix} 1 & 0 & 0 \\ 0 & 0 & 0 \\ 0 & 0 & 1 \\ 0 & 1 & 0 \end{matrix}) {\tilde{u}}_{t}^{1} + (\begin{matrix} 0 & 0 & 0 \\ 1 & 0 & 0 \\ 0 & 1 & 0 \\ 0 & 0 & 1 \end{matrix}) {\tilde{u}}_{t}^{2}) d t + d {\tilde{ω}}_{t},

(73)

which corresponds to Equation (27). The objective function (72) can be rewritten as follows:

J [\tilde{u}] : = E_{p (s_{0 : T}; \tilde{u})} [\int_{0}^{10} (s_{t}^{⊤} (\begin{matrix} 10 & 0 & 0 & 0 \\ 0 & 10 & 0 & 0 \\ 0 & 0 & 0 & 0 \\ 0 & 0 & 0 & 0 \end{matrix}) s_{t} + \sum_{i = 1}^{2} {({\tilde{u}}_{t}^{i})}^{⊤} (\begin{matrix} 1 & 0 & 0 \\ 0 & 1 & 0 \\ 0 & 0 & 1 \end{matrix}) {\tilde{u}}_{t}^{i}) d t],

(74)

which corresponds to Equation (32). In addition, it satisfies the block diagonal matrix assumption of

R (t)

(31).

Figure 7 shows the trajectories of

Ψ (t)

,

Π (t)

, and

Φ (t)

. Unlike

Ψ (t)

, the elements of

Π (t)

and

Φ (t)

related with the memories

z_{1}

and

z_{2}

are not always 0 (Figure 7e–j), which indicates that the controls of the memories appear in ML-POSC and ML-DSC. The elements of

Φ (t)

only largely deviate from those of

Π (t)

in Figure 7g–j. Figure 7g,h show the negative feedback control gain of the memory

z^{i}

.

Φ_{z^{i} z^{i}} (t)

is larger than

Π_{z^{i} z^{i}} (t)

, which indicates that the memories in ML-DSC are controlled more strongly than those in ML-POSC. Figure 7i shows the control gain between the state

x_{t}^{1}

and the memory

z_{t}^{2}

. While the controller 1 can control the state

x_{t}^{1}

based on the memory

z_{t}^{2}

in ML-POSC, it cannot in ML-DSC, because the controllers do not share their memories in ML-DSC. Furthermore, while controller 1 does not need to send the information of the state

x_{t}^{1}

to controller 2’s memory

z_{t}^{2}

, this is required in ML-DSC. As a result,

Φ_{x^{1} z^{2}} (t)

differs greatly from

Π_{x^{1} z^{2}} (t)

. A similar discussion is possible for Figure 7j.

In order to clarify the significance of the decentralized Riccati Equation (39), we compared the performance of the optimal control function

u^{*}

(35) with that of the control functions

u^{Ψ}

(55) and

u^{Π}

(56). The result is shown in Figure 8. The actual state

x^{act}

faithfully tracks the target state

x^{tar}

under the optimal control function

u^{*}

(Figure 8a–c (green)). Similarly, the memory z is stably controlled in the optimal control function

u^{*}

(Figure 8d,e (green)). As a result, the performance of the optimal control function

u^{*}

is optimal (Figure 8f (green)).

7. Discussion

In this paper, we proposed ML-DSC, which explicitly formulates the finite-dimensional memories of the controllers. In ML-DSC, each controller is designed to compress the infinite-dimensional observation histories appropriately into the finite-dimensional memory, and determines the optimal control based on it. As a result, ML-DSC can handle the difficulty in the conventional DSC that arises from the finiteness of actual memory of controllers. We demonstrated the effectiveness of ML-DSC in the LQG problem. While the conventional DSC needs to restrict the interaction among the controllers to solve the LQG problem, ML-DSC is free from such a restriction. We found that estimation and control are optimized by the decentralized Riccati equation in the LQG problem of ML-DSC. Our numerical experiments showed that the decentralized Riccati equation is superior to the Riccati equation and the partially observable Riccati equation in this problem.

ML-DSC can also address non-LQG problems. In DSC, the non-LQG problem cannot be solved numerically even if the number of the controllers is one, which corresponds to POSC [12,13]. This is because a functional differential equation needs to be solved in the non-LQG problem of POSC, which is generally intractable, even numerically. ML-POSC and ML-DSC are more tractable than the conventional POSC and DSC because only HJB-FP equations need to be solved, which are partial differential equations. The previous work showed that ML-POSC is more effective than the conventional POSC in a non-LQG problem [23]. Therefore, unlike the conventional DSC, ML-DSC may also be effective to non-LQG problems.

In order to solve ML-DSC with a large number of controllers, more efficient numerical algorithms are needed because HJB-FP equations become high-dimensional partial differential equations. In order to solve high-dimensional HJB-FP equations, neural network-based algorithms have been proposed in mean-field stochastic game and control [50,51]. Therefore, by exploiting these neural network-based algorithms, we may efficiently solve ML-DSC with a large number of controllers.

Author Contributions

Conceptualization, Formal analysis, Funding acquisition, Writing— original draft, T.T. and T.J.K.; Software, Visualization, T.T. All authors have read and agreed to the published version of the manuscript.

Funding

The first author received a JSPS Research Fellowship (Grant No. 21J20436). This work was supported by JSPS KAKENHI (Grant No. 19H05799) and JST CREST (Grant No. JPMJCR2011).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Not applicable.

Conflicts of Interest

The authors declare no conflict of interest.

Abbreviations

The following abbreviations are used in this manuscript:

COSC	Completely Observable Stochastic Control
POSC	Partially Observable Stochastic Control
DSC	Decentralized Stochastic Control
ML-POSC	Memory-Limited Partially Observable Stochastic Control
ML-DSC	Memory-Limited Decentralized Stochastic Control
DEC-POMDP	Decentralized Partially Observable Markov Decision Process
HJB	Hamilton-Jacobi-Bellman
FP	Fokker-Planck
ODE	Ordinary Differential Equation
SDE	Stochastic Differential Equation
LQ	Linear-Quadratic
LQG	Linear-Quadratic-Gaussian

Appendix A. Proof of Theorem 1

In this section, we prove Theorem 1, which is Pontryagin’s minimum principle on the probability density function space. In this paper, we prove Pontryagin’s minimum principle via Bellman’s dynamic programming principle, which is a similar approach with reference [23]. We note that Pontryagin’s minimum principle can also be proved directly, which is almost the same with reference [24] and omitted in this paper.

Appendix A.1. Bellman’s Dynamic Programming Principle

In this subsection, we obtain the optimality condition of ML-DSC from the viewpoint of Bellman’s dynamic programming principle on the probability density function space.

The minimization of the objective function can be calculated as follows:

\begin{matrix} \min_{u} J [u] & = \min_{u_{0 : T}} [\int_{0}^{T} \bar{f} (τ, p_{τ}, u_{τ}) d τ + \bar{g} (p_{T})] \\ = \min_{u_{0 : t - d t}} [\int_{0}^{t - d t} \bar{f} (τ, p_{τ}, u_{τ}) d τ \\ + \min_{u_{t}} [\bar{f} (t, p_{t}, u_{t}) d t + \min_{u_{t + d t : T}} [\int_{t + d t}^{T} \bar{f} (τ, p_{τ}, u_{τ}) d τ + \bar{g} (p_{T})]]] . \end{matrix}

(A1)

Therefore, the optimal joint control function at time t is given as follows:

u_{t}^{*} = \arg \min_{u_{t}} [\bar{f} (t, p_{t}, u_{t}) d t + \min_{u_{t + d t : T}} [\int_{t + d t}^{T} \bar{f} (τ, p_{τ}, u_{τ}) d τ + \bar{g} (p_{T})]] .

(A2)

We define the value function

V (t, p) : = \min_{u_{t : T}} [\int_{t}^{T} \bar{f} (t, p_{τ}, u_{τ}) d τ + \bar{g} (p_{T})],

(A3)

where

{p_{τ} | τ \in [t, T]}

is the solution of FP Equation (16) with

p_{t} = p

. We note that

V (T, p) = \bar{g} (p)

is satisfied. From the definition of the value function, the optimal joint control function at time t can be calculated as follows:

\begin{matrix} u_{t}^{*} = \arg \min_{u_{t}} & [\bar{f} (t, p_{t}, u_{t}) d t + V (t + d t, p_{t} + L_{u_{t}}^{†} p_{t} d t)] \\ = \arg \min_{u_{t}} & [\bar{f} (t, p_{t}, u_{t}) d t + V (t, p_{t}) + \frac{\partial V (t, p_{t})}{\partial t} d t \\ + (\int \frac{δ V (t, p_{t})}{δ p} (s_{t}) L_{u_{t}}^{†} p_{t} (s_{t}) d s_{t}) d t] \\ = \arg \min_{u_{t}} & [\bar{f} (t, p_{t}, u_{t}) + \int \frac{δ V (t, p_{t})}{δ p} (s_{t}) L_{u_{t}}^{†} p_{t} (s_{t}) d s_{t}] . \end{matrix}

(A4)

Since

L_{u_{t}}^{†}

is the conjugate of

L_{u_{t}}

(22),

\begin{matrix} u_{t}^{*} & = \arg \min_{u_{t}} [\bar{f} (t, p_{t}, u_{t}) + \int p_{t} (s_{t}) L_{u_{t}} \frac{δ V (t, p_{t})}{δ p} (s_{t}) d s_{t}] \\ = \arg \min_{u_{t}} E_{p_{t} (s_{t})} [f (t, s_{t}, u_{t}) + L_{u_{t}} \frac{δ V (t, p_{t})}{δ p} (s_{t})] . \end{matrix}

(A5)

From the definition of the Hamiltonian (20),

u_{t}^{*} = \arg \min_{u_{t}} E_{p_{t} (s_{t})} [H (t, s_{t}, u_{t}, \frac{δ V (t, p_{t})}{δ p} (s_{t}))] .

(A6)

The minimization of the Hamiltonian can be calculated as follows:

\begin{matrix} \min_{u_{t}} E_{p_{t} (s_{t})} [H (t, s_{t}, u_{t}, \frac{δ V (t, p_{t})}{δ p} (s_{t}))] \\ = \min_{u_{t}^{i}} E_{p_{t} (s_{t})} [H (t, s_{t}, (u_{t}^{- i *}, u_{t}), \frac{δ V (t, p_{t})}{δ p} (s_{t}))] . \end{matrix}

(A7)

Since

u_{t}^{i}

is the function of

z_{t}^{i}

in ML-DSC, the minimization by

u_{t}^{i}

can be exchanged with the expectation by

p (z_{t}^{i})

as follows:

\begin{matrix} \min_{u_{t}^{i}} E_{p_{t} (s_{t})} [H (t, s_{t}, (u_{t}^{- i *}, u_{t}), \frac{δ V (t, p_{t})}{δ p} (s_{t}))] \\ = E_{p_{t} (z_{t}^{i})} [\min_{u_{t}^{i}} E_{p_{t} (s_{t}^{- i} | z_{t}^{i})} [H (t, s_{t}, (u_{t}^{- i *}, u_{t}), \frac{δ V (t, p_{t})}{δ p} (s_{t}))]] . \end{matrix}

(A8)

Therefore, the optimal control function of controller i at time t is given as follows:

u_{t}^{i *} (z_{t}^{i}) = \arg \min_{u_{t}^{i}} E_{p_{t} (s_{t}^{- i} | z_{t}^{i})} [H (t, s_{t}, (u_{t}^{- i *}, u_{t}), \frac{δ V (t, p_{t})}{δ p} (s_{t}))] .

(A9)

In order to obtain the optimal control function (A9), we need to obtain the value function

V (t, p)

. The value function

V (t, p)

can be calculated as follows:

\begin{matrix} V (t, p) & = \min_{u_{t : T}} [\int_{t}^{T} \bar{f} (τ, p_{τ}, u_{τ}) d τ + \bar{g} (p_{T})] \\ = \min_{u} [\bar{f} (t, p, u) d t + \min_{u_{t + d t : T}} [\int_{t + d t}^{T} \bar{f} (τ, p_{τ}, u_{τ}) d τ + \bar{g} (p_{T})]] \\ = \min_{u} [\bar{f} (t, p, u) d t + V (t + d t, p + L_{u}^{†} p d t)] \\ = \min_{u} [\bar{f} (t, p, u) d t + V (t, p) + \frac{\partial V (t, p)}{\partial t} d t + (\int \frac{δ V (t, p)}{δ p} (s) L_{u}^{†} p (s) d s) d t] . \end{matrix}

(A10)

Therefore, the following equation is obtained:

- \frac{\partial V (t, p)}{\partial t} = \min_{u} [\bar{f} (t, p, u) + \int \frac{δ V (t, p)}{δ p} (s) L_{u}^{†} p (s) d s] .

(A11)

Since

L_{u}^{†}

is the conjugate of

L_{u}

(22),

\begin{matrix} - \frac{\partial V (t, p)}{\partial t} & = \min_{u} [\bar{f} (t, p, u) + \int p (s) L_{u} \frac{δ V (t, p)}{δ p} (s) d s] \\ = \min_{u} E_{p (s)} [f (t, s, u) + L_{u} \frac{δ V (t, p)}{δ p} (s)] d s . \end{matrix}

(A12)

From the definition of the Hamiltonian

H

(20),

- \frac{\partial V (t, p)}{\partial t} = \min_{u} E_{p (s)} [H (t, s, u, \frac{δ V (t, p)}{δ p} (s))] .

(A13)

From Equation (A6),

- \frac{\partial V (t, p)}{\partial t} = E_{p (s)} [H (t, s, u^{*}, \frac{δ V (t, p)}{δ p} (s))],

(A14)

where

V (T, p) = \bar{g} (p)

. This functional differential Equation (A14) is called Bellman equation. Therefore, from Bellman’s dynamic programming principle on the probability density function space, the optimal control function of ML-DSC (A9) is obtained by solving FP Equation (16) and Bellman Equation (A14).

However, Bellman Equation (A14) is a functional differential equation, which cannot be solved even numerically. Therefore, Bellman’s dynamic programming principle on the probability density function space is not practical. In the next subsection, we resolve this problem by converting Bellman’s dynamic programming principle to Pontryagin’s minimum principle on the probability density function space. This conversion technique has also been used in mean-field stochastic control [25,26] and ML-POSC [23,24].

Appendix A.2. Conversion from Bellman’s Dynamic Programming Principle to Pontryagin’s Minimum Principle

In this subsection, we prove Theorem 1 by converting Bellman’s dynamic programming principle to Pontryagin’s minimum principle on the probability density function space.

We first define

W (t, p, s) : = \frac{δ V (t, p)}{δ p} (s),

(A15)

which satisfies

W (T, p, s) = g (s)

. Differentiating Bellman Equation (A14) with respect to p, the following equation is obtained:

- \frac{\partial W (t, p, s)}{\partial t} = H (t, s, u^{*}, W) + E_{p (s^{'})} [L_{u^{*}} \frac{δ W (t, p, s^{'})}{δ p} (s)] .

(A16)

Since

L_{u^{*}}

is the conjugate of

L_{u^{*}}^{†}

(22),

- \frac{\partial W (t, p, s)}{\partial t} = H (t, s, u^{*}, W) + \int \frac{δ W (t, p, s)}{δ p} (s^{'}) L_{u^{*}}^{†} p (s^{'}) d s^{'} .

(A17)

We then define

w (t, s) : = W (t, p_{t}, s),

(A18)

where

p_{t}

is the solution of FP Equation (16). The time derivative of

w (t, s)

can be calculated as follows:

\frac{\partial w (t, s)}{\partial t} = \frac{\partial W (t, p_{t}, s)}{\partial t} + \int \frac{δ W (t, p_{t}, s)}{δ p} (s^{'}) \frac{\partial p_{t} (s^{'})}{\partial t} d s^{'} .

(A19)

By substituting Equation (A17) into Equation (A19), the following equation is obtained:

- \frac{\partial w (t, s)}{\partial t} = H (t, s, u^{*}, w) - \int \frac{δ W (t, p_{t}, s)}{δ p} (s^{'}) \underset{(*)}{\underset{⏟}{(\frac{\partial p_{t} (s^{'})}{\partial t} - L_{u^{*}}^{†} p_{t} (s^{'}))}} d s^{'} .

(A20)

From FP Equation (16),

(*) = 0

holds. Therefore, HJB Equation (24) is obtained.

From Equations (A15) and (A18), the optimal control function of ML-DSC (A9) can be calculated as follows:

u^{i *} (t, z^{i}) = \arg \min_{u^{i}} E_{p_{t} (s^{- i} | z^{i})} [H (t, s, (u^{- i *}, u), w)] .

(A21)

Therefore, the optimal control function of ML-DSC (A9) is obtained by solving FP Equation (16) and HJB Equation (24).

Appendix B. Proof of Theorem 2

In the LQG problem, the Hamiltonian (20) is given as follows:

\begin{matrix} H (t, s, u, w) = s^{⊤} Q s + \sum_{i = 1}^{N} {(u^{i})}^{⊤} R_{i i} u^{i} + {(\frac{\partial w (t, s)}{\partial s})}^{⊤} (A s + \sum_{i = 1}^{N} B_{i} u^{i}) \\ + \frac{1}{2} \{\frac{\partial}{\partial s} {(\frac{\partial w (t, s)}{\partial s})}^{⊤} σ σ^{⊤}\} . \end{matrix}

(A22)

Therefore, in the LQG problem, the optimal control function of ML-DSC (19) can be calculated as follows:

u^{i *} (t, z^{i}) = - \frac{1}{2} R_{i i}^{- 1} B_{i}^{⊤} E_{p_{t} (s^{- i} | z^{i})} [\frac{\partial w (t, s)}{\partial s}] .

(A23)

We assume that

p_{t} (s)

is the Gaussian distribution

p_{t} (s) : = N (s | μ (t), \sum (t)),

(A24)

and

w (t, s)

is the quadratic function

w (t, s) = s^{⊤} Φ (t) s + α^{⊤} (t) s + β (t) .

(A25)

In this case, the optimal control function of ML-DSC (A23) can be calculated as follows:

u^{i *} (t, z^{i}) = - \frac{1}{2} R_{i i}^{- 1} B_{i}^{⊤} (2 Φ K_{i} \hat{s} + 2 Φ μ + α) .

(A26)

Since Equation (A26) is linear with respect to

\hat{s}

,

p_{t} (s)

becomes the Gaussian distribution, which is consistent with our assumption (A24).

Substituting Equations (A25) and (A26) into HJB Equation (24), the following ODEs are obtained:

- \frac{d Φ}{d t} = Q + A^{⊤} Φ + Φ A - Φ B R^{- 1} B^{⊤} Φ + Q,

(A27)

- \frac{d α}{d t} = {(A - B R^{- 1} B^{⊤} Φ)}^{⊤} α - 2 Q μ,

(A28)

- \frac{d β}{d t} = tr (Φ σ σ^{⊤}) - \frac{1}{4} α^{⊤} B R^{- 1} B^{⊤} α + μ^{⊤} Q μ,

(A29)

where

Q : = \sum_{i = 1}^{N} {(I - K_{i})}^{⊤} Φ B_{i} R_{i i}^{- 1} B_{i}^{⊤} Φ (I - K_{i})

,

Φ (T) = P

,

α (T) = 0

, and

β (T) = 0

. If

Φ (t)

,

α (t)

, and

β (t)

satisfy ODEs (A27), (A28), and (A29), respectively, HJB Equation (24) is satisfied, which is consistent with our assumption (A25).

Defining

Υ (t)

by

α (t) = 2 Υ (t) μ (t)

and

Ψ (t)

by

Ψ (t) : = Φ (t) + Υ (t)

, the optimal control function of ML-DSC (A26) can be calculated as follows:

u^{i *} (t, z^{i}) = - R_{i i}^{- 1} B_{i}^{⊤} (Ψ μ + Φ K_{i} \hat{s}) .

(A30)

From Equations (A27) and (A28),

Ψ (t)

is the solution of the Riccati Equation (38). From Equation (A27),

Φ (t)

is the solution of the decentralized Riccati Equation (39). The detailed calculations are almost the same with reference [23].

References

Mahajan, A.; Teneketzis, D. On the design of globally optimal communication strategies for real-time noisy communication systems with noisy feedback. IEEE J. Sel. Areas Commun. 2008, 26, 580–595. [Google Scholar] [CrossRef]
Mahajan, A.; Teneketzis, D. Optimal Design of Sequential Real-Time Communication Systems. IEEE Trans. Inf. Theory 2009, 55, 5317–5338. [Google Scholar] [CrossRef]
Nayyar, A.; Teneketzis, D. Sequential Problems in Decentralized Detection with Communication. IEEE Trans. Inf. Theory 2011, 57, 5410–5435. [Google Scholar] [CrossRef]
Mahajan, A.; Teneketzis, D. Optimal Performance of Networked Control Systems with Nonclassical Information Structures. SIAM J. Control Optim. 2009, 48, 1377–1404. [Google Scholar] [CrossRef]
Witsenhausen, H.S. A Counterexample in Stochastic Optimum Control. SIAM J. Control 1968, 6, 131–147. [Google Scholar] [CrossRef]
Nayyar, A.; Mahajan, A.; Teneketzis, D. Decentralized Stochastic Control with Partial History Sharing: A Common Information Approach. IEEE Trans. Autom. Control 2013, 58, 1644–1658. [Google Scholar] [CrossRef]
Mahajan, A.; Nayyar, A. Sufficient Statistics for Linear Control Strategies in Decentralized Systems With Partial History Sharing. IEEE Trans. Autom. Control 2015, 60, 2046–2056. [Google Scholar] [CrossRef]
Charalambous, C.D.; Ahmed, N.U. Team Optimality Conditions of Distributed Stochastic Differential Decision Systems with Decentralized Noisy Information Structures. IEEE Trans. Autom. Control 2017, 62, 708–723. [Google Scholar] [CrossRef]
Charalambous, C.D.; Ahmed, N.U. Centralized Versus Decentralized Optimization of Distributed Stochastic Differential Decision Systems with Different Information Structures—Part I: A General Theory. IEEE Trans. Autom. Control 2017, 62, 1194–1209. [Google Scholar] [CrossRef]
Charalambous, C.D.; Ahmed, N.U. Centralized Versus Decentralized Optimization of Distributed Stochastic Differential Decision Systems with Different Information Structures—Part II: Applications. IEEE Trans. Autom. Control 2018, 63, 1913–1928. [Google Scholar] [CrossRef]
Wonham, W.M. On the Separation Theorem of Stochastic Control. SIAM J. Control 1968, 6, 312–326. [Google Scholar] [CrossRef]
Bensoussan, A. Stochastic Control of Partially Observable Systems; Cambridge University Press: Cambridge, UK, 1992. [Google Scholar] [CrossRef]
Nisio, M. Stochastic Control Theory. In Probability Theory and Stochastic Modelling; Springer: Tokyo, Japan, 2015; Volume 72. [Google Scholar] [CrossRef]
Bensoussan, A. Estimation and Control of Dynamical Systems. In Interdisciplinary Applied Mathematics; Springer International Publishing: Cham, Switzerland, 2018; Volume 48. [Google Scholar] [CrossRef]
Wang, G.; Wu, Z.; Xiong, J. An Introduction to Optimal Control of FBSDE with Incomplete Information; SpringerBriefs in Mathematics; Springer International Publishing: Cham, Switzerland, 2018. [Google Scholar] [CrossRef]
Bensoussan, A.; Yam, S.C.P. Mean field approach to stochastic control with partial information. ESAIM Control Optim. Calc. Var. 2021, 27, 89. [Google Scholar] [CrossRef]
Lessard, L.; Lall, S. A state-space solution to the two-player decentralized optimal control problem. In Proceedings of the 2011 49th Annual Allerton Conference on Communication, Control, and Computing (Allerton), Monticello, IL, USA, 28–30 September 2011; pp. 1559–1564. [Google Scholar] [CrossRef]
Lessard, L.; Lall, S. Optimal controller synthesis for the decentralized two-player problem with output feedback. In Proceedings of the 2012 American Control Conference (ACC), Montréal, QC, Canada, 27–29 June 2012; pp. 6314–6321. [Google Scholar] [CrossRef]
Lessard, L. Decentralized LQG control of systems with a broadcast architecture. In Proceedings of the 2012 IEEE 51st IEEE Conference on Decision and Control (CDC), Maui, HI, USA, 10–13 December 2012; pp. 6241–6246. [Google Scholar] [CrossRef]
Lessard, L.; Nayyar, A. Structural results and explicit solution for two-player LQG systems on a finite time horizon. In Proceedings of the 52nd IEEE Conference on Decision and Control, Firenze, Italy, 10–13 December 2013; pp. 6542–6549. [Google Scholar] [CrossRef]
Lessard, L.; Lall, S. Optimal Control of Two-Player Systems With Output Feedback. IEEE Trans. Autom. Control 2015, 60, 2129–2144. [Google Scholar] [CrossRef]
Nayyar, A.; Lessard, L. Structural results for partially nested LQG systems over graphs. In Proceedings of the 2015 American Control Conference (ACC), Chicago, IL, USA, 1–3 July 2015; pp. 5457–5464. [Google Scholar] [CrossRef]
Tottori, T.; Kobayashi, T.J. Memory-Limited Partially Observable Stochastic Control and Its Mean-Field Control Approach. Entropy 2022, 24, 1599. [Google Scholar] [CrossRef]
Tottori, T.; Kobayashi, T.J. Forward-Backward Sweep Method for the System of HJB-FP Equations in Memory-Limited Partially Observable Stochastic Control. Entropy 2023, 25, 208. [Google Scholar] [CrossRef]
Bensoussan, A.; Frehse, J.; Yam, S.C.P. The Master equation in mean field theory. J. Math. Pures Appl. 2015, 103, 1441–1474. [Google Scholar] [CrossRef]
Bensoussan, A.; Frehse, J.; Yam, S.C.P. On the interpretation of the Master Equation. Stoch. Process. Their Appl. 2017, 127, 2093–2137. [Google Scholar] [CrossRef]
Bensoussan, A.; Frehse, J.; Yam, P. Mean Field Games and Mean Field Type Control Theory; Springer Briefs in Mathematics; Springer: New York, NY, USA, 2013. [Google Scholar] [CrossRef]
Carmona, R.; Delarue, F. Probabilistic Theory of Mean Field Games with Applications I. In Probability Theory and Stochastic Modelling; Springer Nature: Cham, Switzerland, 2018; Volume 83. [Google Scholar] [CrossRef]
Carmona, R.; Delarue, F. Probabilistic Theory of Mean Field Games with Applications II. In Probability Theory and Stochastic Modelling; Springer International Publishing: Cham, Switzerland, 2018; Volume 84. [Google Scholar] [CrossRef]
Achdou, Y. Finite Difference Methods for Mean Field Games. In Hamilton-Jacobi Equations: Approximations, Numerical Analysis and Applications: Cetraro, Italy 2011; Loreti, P., Tchou, N.A., Eds.; Lecture Notes in Mathematics; Springer: Berlin/Heidelberg, Germany, 2013; pp. 1–47. [Google Scholar] [CrossRef]
Achdou, Y.; Laurière, M. Mean Field Games and Applications: Numerical Aspects. In Mean Field Games: Cetraro, Italy 2019; Achdou, Y., Cardaliaguet, P., Delarue, F., Porretta, A., Santambrogio, F., Cardaliaguet, P., Porretta, A., Eds.; Lecture Notes in Mathematics; Springer International Publishing: Cham, Switzerland, 2020; pp. 249–307. [Google Scholar] [CrossRef]
Lauriere, M. Numerical Methods for Mean Field Games and Mean Field Type Control. arXiv 2021, arXiv:2106.06231. [Google Scholar] [CrossRef]
Bernstein, D.S. Bounded Policy Iteration for Decentralized POMDPs. In Proceedings of the Nineteenth International Joint Conference on Artificial Intelligence, Edinburgh, UK, 30 July–5 August 2005; pp. 1287–1292. [Google Scholar]
Bernstein, D.S.; Amato, C.; Hansen, E.A.; Zilberstein, S. Policy Iteration for Decentralized Control of Markov Decision Processes. J. Artif. Intell. Res. 2009, 34, 89–132. [Google Scholar] [CrossRef]
Amato, C.; Bernstein, D.S.; Zilberstein, S. Optimizing Memory-Bounded Controllers for Decentralized POMDPs. In Proceedings of the Twenty-Third Conference on Uncertainty in Artificial Intelligence, Vancouver, BC, Canada, 19–22 July 2007. [Google Scholar] [CrossRef]
Amato, C.; Bonet, B.; Zilberstein, S. Finite-State Controllers Based on Mealy Machines for Centralized and Decentralized POMDPs. Proc. AAAI Conf. Artif. Intell. 2010, 24, 1052–1058. [Google Scholar] [CrossRef]
Kumar, A.; Zilberstein, S. Anytime Planning for Decentralized POMDPs using Expectation Maximization. In Proceedings of the Twenty-Sixth Conference on Uncertainty in Artificial Intelligence, Catalina Island, CA, USA, 8–11 July 2010; p. 9. [Google Scholar]
Oliehoek, F.A.; Amato, C. A Concise Introduction to Decentralized POMDPs; SpringerBriefs in Intelligent Systems; Springer International Publishing: Cham, Switzerland, 2016. [Google Scholar] [CrossRef]
Tottori, T.; Kobayashi, T.J. Forward and Backward Bellman Equations Improve the Efficiency of the EM Algorithm for DEC-POMDP. Entropy 2021, 23, 551. [Google Scholar] [CrossRef]
Yong, J.; Zhou, X.Y. Stochastic Controls; Springer: New York, NY, USA, 1999. [Google Scholar] [CrossRef]
Kushner, H. Optimal stochastic control. IRE Trans. Autom. Control 1962, 7, 120–122. [Google Scholar] [CrossRef]
Carlini, E.; Silva, F.J. Semi-Lagrangian schemes for mean field game models. In Proceedings of the 52nd IEEE Conference on Decision and Control, Firenze, Italy, 10–13 December 2013; pp. 3115–3120. [Google Scholar] [CrossRef]
Carlini, E.; Silva, F.J. A Fully Discrete Semi-Lagrangian Scheme for a First Order Mean Field Game Problem. SIAM J. Numer. Anal. 2014, 52, 45–67. [Google Scholar] [CrossRef]
Carlini, E.; Silva, F.J. A semi-Lagrangian scheme for a degenerate second order mean field game system. Discret. Contin. Dyn. Syst. 2015, 35, 4269. [Google Scholar] [CrossRef]
Kushner, H.J.; Dupuis, P.G. Numerical Methods for Stochastic Control Problems in Continuous Time; Springer: New York, NY, USA, 1992. [Google Scholar] [CrossRef]
Fleming, W.H.; Soner, H.M. Controlled Markov Processes and Viscosity Solutions, 2nd ed.; Springer: New York, NY, USA, 2006. [Google Scholar] [CrossRef]
Puterman, M.L. Markov Decision Processes: Discrete Stochastic Dynamic Programming; Wiley-Interscience: New York, NY, USA, 2014. [Google Scholar]
Charalambous, C.D.; Ahmed, N. Equivalence of decentralized stochastic dynamic decision systems via Girsanov’s measure transformation. In Proceedings of the 53rd IEEE Conference on Decision and Control, Los Angeles, CA, USA, 15–17 December 2014; pp. 439–444. [Google Scholar] [CrossRef]
Telsang, B.; Djouadi, S.; Charalambous, C. Numerical Evaluation of Exact Person-by-Person Optimal Nonlinear Control Strategies of the Witsenhausen Counterexample. In Proceedings of the 2021 American Control Conference (ACC), New Orleans, LA, USA, 25–28 May 2021; pp. 1250–1255. [Google Scholar] [CrossRef]
Ruthotto, L.; Osher, S.J.; Li, W.; Nurbekyan, L.; Fung, S.W. A machine learning framework for solving high-dimensional mean field game and mean field control problems. Proc. Natl. Acad. Sci. USA 2020, 117, 9183–9193. [Google Scholar] [CrossRef]
Lin, A.T.; Fung, S.W.; Li, W.; Nurbekyan, L.; Osher, S.J. Alternating the population and control neural networks to solve high-dimensional stochastic mean-field games. Proc. Natl. Acad. Sci. USA 2021, 118, e2024713118. [Google Scholar] [CrossRef]

Figure 1. Schematic diagram of (a) decentralized stochastic control (DSC) and (b) memory-limited DSC (ML-DSC). (a) DSC consists of a system and N controllers.

x_{t} \in R^{d_{x}}

is the state of the target system at time

t \in [0, T]

.

y_{t}^{i} \in R^{d_{y}^{i}}

,

y_{0 : t}^{i} : = {y_{τ}^{i} | τ \in [0, t]}

, and

u_{t}^{i} \in R^{d_{u}^{i}}

are the observation, the observation history, and the control of controller i, respectively. The controller i cannot accurately observe the state of the system

x_{t}

and the controls of the other controllers

u_{t}^{j} (j \neq i)

. It can only obtain their noisy observation

y_{t}^{i}

. Then, the controller i determines the control

u_{t}^{i}

based on the noisy observation history

y_{0 : t}^{i}

, which ideally requires infinite-dimensional memory to store the observation history

y_{0 : t}^{i}

. (b) ML-DSC explicitly formulates the finite-dimensional memory

z_{t}^{i} \in R^{d_{z}^{i}}

. Controller i compresses the infinite-dimensional observation history

y_{0 : t}^{i}

into the finite-dimensional memory

z_{t}^{i}

by optimally designing control over the memory

v_{t}^{i} \in R^{d_{v}^{i}}

as well as control over the state

u_{t}^{i} \in R^{d_{u}^{i}}

.

Figure 1. Schematic diagram of (a) decentralized stochastic control (DSC) and (b) memory-limited DSC (ML-DSC). (a) DSC consists of a system and N controllers.

x_{t} \in R^{d_{x}}

is the state of the target system at time

t \in [0, T]

.

y_{t}^{i} \in R^{d_{y}^{i}}

,

y_{0 : t}^{i} : = {y_{τ}^{i} | τ \in [0, t]}

, and

u_{t}^{i} \in R^{d_{u}^{i}}

are the observation, the observation history, and the control of controller i, respectively. The controller i cannot accurately observe the state of the system

x_{t}

and the controls of the other controllers

u_{t}^{j} (j \neq i)

. It can only obtain their noisy observation

y_{t}^{i}

. Then, the controller i determines the control

u_{t}^{i}

based on the noisy observation history

y_{0 : t}^{i}

, which ideally requires infinite-dimensional memory to store the observation history

y_{0 : t}^{i}

. (b) ML-DSC explicitly formulates the finite-dimensional memory

z_{t}^{i} \in R^{d_{z}^{i}}

. Controller i compresses the infinite-dimensional observation history

y_{0 : t}^{i}

into the finite-dimensional memory

z_{t}^{i}

by optimally designing control over the memory

v_{t}^{i} \in R^{d_{v}^{i}}

as well as control over the state

u_{t}^{i} \in R^{d_{u}^{i}}

.

Figure 2. Schematic diagram of (a) completely observable stochastic control (COSC), (b) memory-limited partially observable stochastic control (ML-POSC), and (c) memory-limited decentralized stochastic control (ML-DSC). Blue and gray regions indicate the observables and the unobservables for the controller 1, respectively.

Figure 4. (a–f) Trajectories of the elements of

Ψ (t) \in R^{3 \times 3}

(blue),

Π (t) \in R^{3 \times 3}

(orange), and

Φ (t) \in R^{3 \times 3}

(green) in Section 6.1. They are the solutions of the Riccati Equation (38), the partially observable Riccati Equation (42), and the decentralized Riccati Equation (39), respectively. Because

Ψ (t)

,

Π (t)

, and

Φ (t)

are symmetric matrices,

Ψ_{z^{1} x} (t)

,

Ψ_{z^{2} x} (t)

,

Ψ_{z^{2} z^{1}} (t)

,

Π_{z^{1} x} (t)

,

Π_{z^{2} x} (t)

,

Π_{z^{2} z^{1}} (t)

,

Φ_{z^{1} x} (t)

,

Φ_{z^{2} x} (t)

, and

Φ_{z^{2} z^{1}} (t)

are not visualized.

Figure 4. (a–f) Trajectories of the elements of

Ψ (t) \in R^{3 \times 3}

(blue),

Π (t) \in R^{3 \times 3}

(orange), and

Φ (t) \in R^{3 \times 3}

(green) in Section 6.1. They are the solutions of the Riccati Equation (38), the partially observable Riccati Equation (42), and the decentralized Riccati Equation (39), respectively. Because

Ψ (t)

,

Π (t)

, and

Φ (t)

are symmetric matrices,

Ψ_{z^{1} x} (t)

,

Ψ_{z^{2} x} (t)

,

Ψ_{z^{2} z^{1}} (t)

,

Π_{z^{1} x} (t)

,

Π_{z^{2} x} (t)

,

Π_{z^{2} z^{1}} (t)

,

Φ_{z^{1} x} (t)

,

Φ_{z^{2} x} (t)

, and

Φ_{z^{2} z^{1}} (t)

are not visualized.

Figure 5. Stochastic simulations in Section 6.1. (a–c) Stochastic behaviors of (a) the state

x_{t}

, (b) the controller 1’s memory

z_{t}^{1}

, and (c) the controller 2’s memory

z_{t}^{2}

for 100 samples in Section 6.1. (d) The expected cumulative cost computed from the 100 samples. Blue, orange, and green curves are controlled by

u^{Ψ}

(55),

u^{Π}

(56), and

u^{*}

(35), respectively.

Figure 5. Stochastic simulations in Section 6.1. (a–c) Stochastic behaviors of (a) the state

x_{t}

, (b) the controller 1’s memory

z_{t}^{1}

, and (c) the controller 2’s memory

z_{t}^{2}

for 100 samples in Section 6.1. (d) The expected cumulative cost computed from the 100 samples. Blue, orange, and green curves are controlled by

u^{Ψ}

(55),

u^{Π}

(56), and

u^{*}

(35), respectively.

Figure 7. (a–j) Trajectories of the elements of

Ψ (t) \in R^{4 \times 4}

(blue),

Π (t) \in R^{4 \times 4}

(orange), and

Φ (t) \in R^{4 \times 4}

(green) in Section 6.2. They are the solutions of the Riccati Equation (38), the partially observable Riccati Equation (42), and the decentralized Riccati Equation (39), respectively.

Ψ (t)

,

Π (t)

, and

Φ (t)

are symmetric matrices, and duplicate elements are not visualized.

Figure 7. (a–j) Trajectories of the elements of

Ψ (t) \in R^{4 \times 4}

(blue),

Π (t) \in R^{4 \times 4}

(orange), and

Φ (t) \in R^{4 \times 4}

(green) in Section 6.2. They are the solutions of the Riccati Equation (38), the partially observable Riccati Equation (42), and the decentralized Riccati Equation (39), respectively.

Ψ (t)

,

Π (t)

, and

Φ (t)

are symmetric matrices, and duplicate elements are not visualized.

Figure 8. Stochastic simulations in Section 6.2. (a–e) Stochastic behaviors of the actual state

x_{t}^{act} = (x_{t}^{act, 1}, x_{t}^{act, 2})

(a–c) and the memory

z_{t} = (z_{t}^{1}, z_{t}^{2})

(d,e) for 100 samples. (f) The expected cumulative cost computed from the 100 samples. Black curves indicate the target state

x_{t}^{tar}

. Blue, orange, and green curves are controlled by

u^{Ψ}

(55),

u^{Π}

(56), and

u^{*}

(35), respectively.

Figure 8. Stochastic simulations in Section 6.2. (a–e) Stochastic behaviors of the actual state

x_{t}^{act} = (x_{t}^{act, 1}, x_{t}^{act, 2})

(a–c) and the memory

z_{t} = (z_{t}^{1}, z_{t}^{2})

(d,e) for 100 samples. (f) The expected cumulative cost computed from the 100 samples. Black curves indicate the target state

x_{t}^{tar}

. Blue, orange, and green curves are controlled by

u^{Ψ}

(55),

u^{Π}

(56), and

u^{*}

(35), respectively.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Tottori, T.; Kobayashi, T.J. Decentralized Stochastic Control with Finite-Dimensional Memories: A Memory Limitation Approach. Entropy 2023, 25, 791. https://doi.org/10.3390/e25050791

AMA Style

Tottori T, Kobayashi TJ. Decentralized Stochastic Control with Finite-Dimensional Memories: A Memory Limitation Approach. Entropy. 2023; 25(5):791. https://doi.org/10.3390/e25050791

Chicago/Turabian Style

Tottori, Takehiro, and Tetsuya J. Kobayashi. 2023. "Decentralized Stochastic Control with Finite-Dimensional Memories: A Memory Limitation Approach" Entropy 25, no. 5: 791. https://doi.org/10.3390/e25050791

APA Style

Tottori, T., & Kobayashi, T. J. (2023). Decentralized Stochastic Control with Finite-Dimensional Memories: A Memory Limitation Approach. Entropy, 25(5), 791. https://doi.org/10.3390/e25050791

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Decentralized Stochastic Control with Finite-Dimensional Memories: A Memory Limitation Approach

Abstract

1. Introduction

2. Review of Decentralized Stochastic Control

3. Memory-Limited Decentralized Stochastic Control

3.1. Problem Formulation

3.2. Extended State

4. Derivation of Optimal Control Function

4.1. Derivation of Optimal Control Function

4.2. Numerical Algorithm

4.3. Comparison with Completely Observable Stochastic Control or Memory-Limited Partially Observable Stochastic Control

5. Linear-Quadratic-Gaussian Problem

5.1. Problem Formulation

5.2. Derivation of Optimal Control Function

5.3. Comparison with Completely Observable Stochastic Control or Memory-Limited Partially Observable Stochastic Control

5.4. Decentralized Riccati Equation

6. Numerical Experiments

6.1. One-Dimensional State Case

6.2. Two-Dimensional State Case

7. Discussion

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

Abbreviations

Appendix A. Proof of Theorem 1

Appendix A.1. Bellman’s Dynamic Programming Principle

Appendix A.2. Conversion from Bellman’s Dynamic Programming Principle to Pontryagin’s Minimum Principle

Appendix B. Proof of Theorem 2

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI