Decentralized Stochastic Control with Finite-Dimensional Memories: A Memory Limitation Approach

Decentralized stochastic control (DSC) is a stochastic optimal control problem consisting of multiple controllers. DSC assumes that each controller is unable to accurately observe the target system and the other controllers. This setup results in two difficulties in DSC; one is that each controller has to memorize the infinite-dimensional observation history, which is not practical, because the memory of the actual controllers is limited. The other is that the reduction of infinite-dimensional sequential Bayesian estimation to finite-dimensional Kalman filter is impossible in general DSC, even for linear-quadratic-Gaussian (LQG) problems. In order to address these issues, we propose an alternative theoretical framework to DSC—memory-limited DSC (ML-DSC). ML-DSC explicitly formulates the finite-dimensional memories of the controllers. Each controller is jointly optimized to compress the infinite-dimensional observation history into the prescribed finite-dimensional memory and to determine the control based on it. Therefore, ML-DSC can be a practical formulation for actual memory-limited controllers. We demonstrate how ML-DSC works in the LQG problem. The conventional DSC cannot be solved except in the special LQG problems where the information the controllers have is independent or partially nested. We show that ML-DSC can be solved in more general LQG problems where the interaction among the controllers is not restricted.


I. INTRODUCTION
Control problems of multi-agent systems have many practical applications including real-time communication [1], decentralized detection [2], and networked control [3].
Decentralized stochastic control (DSC) is a conventional theoretical framework that considers the optimal control problem of a multi-agent system [4], [5], [6].DSC consists of a system and multiple controllers.Because each controller cannot completely observe the state of the system and the controls of the other controllers, it determines the control based on the noisy observation history.
In order to obtain the optimal control, each controller needs to estimate the state of the system and the observation histories of the other controllers from its own observation history.Although the estimation of the state of the system can be accomplished by the sequential Bayesian filtering [7], [8], that of the observation histories of the other controllers is generally intractable.As a result, the conventional DSC cannot be solved except in the special cases.
In order to address this problem, we propose an alternative theoretical framework to DSC, which can be solved in more general cases.We call it memory-limited DSC (ML-DSC), in which each controller compresses the observation history into the finite-dimensional memory.Because this compression simplifies the estimation among the controllers, ML-DSC is more tractable than the conventional DSC.
ML-DSC can be solved by employing the mathematical technique of the mean-field control theory [9], [10], [11].We show that the optimal control function of ML-DSC is obtained by jointly solving the Fokker-Planck (FP) equation and the Hamilton-Jacobi-Bellman (HJB) equation.The system of HJB-FP equations also appears in the mean-field game and control [9], [10], and numerous numerical algorithms have been developed [12].Therefore, unlike the conventional DSC, ML-DSC can be solved in more general cases by using these algorithms.
ML-DSC is the extension of the finite-state controller [13], [14], [15] from the discrete setting to the continuous setting.However, it is difficult to extend the algorithms of the finite-state controller to our setting because they strongly depend on discreteness.We resolve this problem by using the trick of the mean-field control theory.
ML-DSC is also the extension of memory-limited partially observable stochastic control (ML-POSC) [11] from a single-agent system to a multi-agent system.The conventional POSC approach [7], [8] cannot be extended to the conventional DSC because the estimation among the controllers is much difficult.In contrast, ML-POSC approach can be straightforwardly extended to ML-DSC because the compression of the observation histories into the finite-dimensional memories simplifies the estimation among the controllers.
We demonstrate how ML-DSC works by applying it to the Linear-Quadratic-Gaussian (LQG) problem.The conventional DSC can be solved in the special LQG problems where the controllers have no information about the other controllers [5], [6], or where the controllers have a nested structure [16], [17].In contrast, ML-DSC can be solved in a more general LQG problem involving a non-nested structure.Because estimation and control are not clearly separated in the general LQG problem, the Riccati equation for control is modified to include estimation, which is called the decentralized Riccati equation in this paper.We demonstrate that the decentralized Riccati equation is superior to the conventional Riccati equation in the general LQG problem.
This paper is organized as follows: In Sec.II, we briefly review the conventional DSC.In Sec.III, we formulate ML-DSC.In Sec.IV, we solve ML-DSC based on the mean-field control theory.In Sec.V, we apply ML-DSC to the LQG problem.In Sec.VI, we conclude this paper.

II. REVIEW OF DECENTRALIZED STOCHASTIC CONTROL
In this section, we briefly review the conventional DSC [5], [6].DSC consists of a system and N controllers.x t ∈ R dx is the state of the system at time t ∈ [0, T ], which evolves by the following stochastic differential equation (SDE): where x 0 obeys p 0 (x 0 ), ω t ∈ R dω is the standard Wiener process, In DSC, because the controller i cannot completely observe the state x t and the joint control u t , the controller i obtains the observation y i t ∈ R d i y instead of them, which evolves by the following SDE: where y i 0 obeys p i 0 (y i 0 ), and The controller i determines the control u i t based on the observation history y i 0:t := {y i τ |τ ∈ [0, t]} as follows: The objective function of DSC is given by the following expected cumulative cost function: where f is the cost function, and g is the terminal cost function.DSC is the problem to find the optimal control function u * that minimizes the objective function J[u] as follows: In order to obtain the optimal control function u * , the controller i needs to estimate the state of the system x t and the observation histories of the other controllers y j 0:t (j = i) from its own observation history y i 0:t , which is generally intractable.As a result, the conventional DSC cannot be solved except in the special cases.

III. MEMORY-LIMITED DECENTRALIZED STOCHASTIC CONTROL
In order to address this problem, we propose an alternative theoretical framework to the conventional DSC, ML-DSC.In this section, we formulate ML-DSC.

A. Problem formulation
In this subsection, we formulate ML-DSC.In ML-DSC, the controller i determines the control u i t based on the finite-dimensional memory z i t ∈ R d i z as follows: d i z is determined by the dimension of the memory available to the controller i. Comparing (3) and (6), the memory z i t can be interpreted as the compression of the observation history y i 0:t .Because this compression simplifies the estimation among the controllers, ML-DSC is more tractable than the conventional DSC.
The memory z i t is assumed to evolve by where z i 0 obeys p i 0 (z i 0 ), and is the control.Because (7) depends on the observation dy i t , the observation history y i 0:t can be compressed into the memory z i t .Furthermore, because (7) depends on the control v i t , the memory z i t can be optimized through the control v i t , which can improve the estimation.We note that ( 7) can be extended to include the intrinsic stochasticity [11].
The objective function of ML-DSC is given by the following expected cumulative cost function: Because the cost function f depends on the memory control v t as well as the state control u t , ML-DSC can consider the memory control cost (estimation cost) as well as the state control cost (control cost) [11].In the light of the dualistic roles played by estimation and control, it is natural to consider the estimation cost as well as the control cost.ML-DSC optimizes the state control function u and the memory control function v based on the objective function J [u, v] as follows:

B. Problem reformulation
Although the formulation of ML-DSC in the previous subsection clarifies its relationship with the conventional DSC, it is inconvenient for further mathematical investigations.In order to resolve this problem, we reformulate ML-DSC in this subsection.This formulation is simpler and more general than the previous one.
We first define the extended state s t as follows: where The extended state s t evolves by the following SDE: where s 0 obeys p 0 (s 0 ), ωt ∈ R dω is the standard Wiener process, ũi t ∈ R d i ũ is the control of the controller i, and ũt := (ũ 1 t , ũ2 t , ..., ũN t ) is the joint control of N controllers.In ML-DSC, the controller i determines the control ũi t based on the memory z i t as follows: The extended state SDE (11) includes the previous state, observation, and memory SDEs ( 1), ( 2), ( 7) as a special case because they can be represented as follows: where p 0 (s 0 ) = p 0 (x 0 )p 0 (z 0 ).The objective function of ML-DSC is given by the following expected cumulative cost function: where f is the cost function, and g is the terminal cost function.It is obvious that this objective function (14) is more general than that in the previous one (8).ML-DSC is the problem to find the optimal control function ũ * that minimizes the objective function J[ũ] as follows: In the following section, we mainly consider the formulation of this subsection rather than that of the previous subsection because it is simpler and more general.Moreover, we omit • for the notational simplicity.

IV. MEAN-FIELD CONTROL APPROACH
If the control u i t is determined based on the extended state s t , i.e., u i t = u i (t, s t ), ML-DSC is the same with the completely observable stochastic control (COSC) of the extended state, and it can be solved by the conventional COSC approach [18].However, because ML-DSC determines the control u i t based solely on the memory z i t , i.e., u i t = u i (t, z i t ), ML-DSC cannot be approached in the similar way as COSC.In this section, we propose the mean-field control approach [11] to ML-DSC.

A. Derivation of optimal control function
In this subsection, we solve ML-DSC based on the mean-field control theory [11].We first show that ML-DSC can be converted into a deterministic control of the probability density function.The extended state SDE (11) can be converted into the following Fokker-Planck (FP) equation: where the initial condition is given by p 0 (s), and L † is the forward diffusion operator, which is defined by where D(t, s, u) := σ(t, s, u)σ T (t, s, u).The objective function of ML-DSC ( 14) can be calculated as follows: where f (t, p, u) ]. From ( 16) and ( 17), ML-DSC is converted into a deterministic control of p t .As a result, ML-DSC can be approached in the similar way as the deterministic control.
Theorem 1: The optimal control function of ML-DSC is given by where s −i and (u −i * , u i ) are defined by and H is the Hamiltonian, which is defined by where L is the backward diffusion operator, which is defined by We note that L is the conjugate of L † .p t (s −i |z i ) = p t (s)/ p t (s)ds −i , p t (s) is the solution of the FP equation (16), and V (t, p) is the solution of the following Bellman equation: where Proof: The proof is shown in Appendix A. However, because the Bellman equation ( 19) is a functional differential equation, it cannot be solved even numerically.We resolve this problem by employing the mathematical technique of the mean-field control theory [9], [10], [11].This technique converts Theorem 1 into the following theorem by defining where p t is the solution of the FP equation ( 16).
Theorem 2: The optimal control function of ML-DSC is given by where p t (s −i |z i ) = p t (s)/ p t (s)ds −i , p t (s) is the solution of the FP equation ( 16), and w(t, s) is the solution of the following Hamilton-Jacobi-Bellman (HJB) equation: where w(T, s) = g(s).
Proof: The proof is almost the same with [11].While the Bellman equation ( 19) is a functional differential equation, the HJB equation ( 22) is a partial differential equation, which can be solved numerically.
The optimal control function of ML-DSC ( 21) is obtained by jointly solving the FP equation ( 16) and the HJB equation (22).The system of HJB-FP equations also appears in the mean-field game and control [9], [10], and numerous numerical algorithms have been developed [12].As a result, unlike the conventional DSC, ML-DSC can be solved in more general cases by using these algorithms.
One of the most basic algorithms is the forwardbackward sweep method (fixed-point iteration method) [12], [19], which computes the FP equation ( 16) and the HJB equation ( 22) alternately.While the convergence of the forward-backward sweep method is not guaranteed in the mean-field game and control, it is guaranteed in ML-DSC because the coupling of HJB-FP equations is limited to the optimal control function in ML-DSC [19].

B. Comparison with completely observable or memorylimited partially observable stochastic control
The COSC of the extended state and ML-POSC can be solved in the similar way as ML-DSC [11].
In the COSC of the extended state, because the control u i t is determined based on the extended state s t , i.e., u i t = u i (t, s t ), the optimal control function is given by In ML-POSC, because the control u i t is determined based on the joint memory z t , i.e., u i t = u i (t, z t ), the optimal control function is given by Although the HJB equation ( 22) is the same between COSC, ML-POSC, and ML-DSC, the optimal control function is different.Especially, the optimal control functions of ML-POSC and ML-DSC depend on the FP equation (16) because they need to estimate unobservables from observables.
V. LINEAR-QUADRATIC-GAUSSIAN PROBLEM In this section, we demonstrate how ML-DSC works by applying it to the general LQG problem involving a non-nested structure.

A. Problem formulation
In this subsection, we formulate the LQG problem [20].The extended state SDE ( 11) is given as follows: where the initial condition is given by the Gaussian distribution p 0 (s) := N (s |µ 0 , Σ 0 ).The objective function ( 14) is given as follows: where Q(t) O, R(t) O, and P O.The objective of this problem is to find the optimal control function u * that minimizes the objective function J [u].
In this paper, we assume that R(t) is the block diagonal matrix as follows: where If this assumption does not hold, the optimal control function cannot be derived explicitly.This problem is similar with the Witsenhausen's counterexample [21].

B. Derivation of optimal control function
In this subsection, we derive the optimal control function of the LQG problem.In the LQG problem, the probability density function of the extended state s at time t is given by the Gaussian distribution p t (s) := N (s|µ(t), Σ(t)).Defining the stochastic extended state ŝ := s − µ, E pt(s −i |z i ) [s] is given as follows: where K i (t) is defined by is the zero matrix except for the columns corresponding to z i .By applying Theorem 2 to the LQG problem, we obtain the following theorem: Theorem 3: In the LQG problem of ML-DSC, the optimal control function is given by where K i (t) depends on Σ(t), and µ(t) and Σ(t) are the solutions of the following ordinary differential equations: where µ(0) = µ 0 and Σ(0) = Σ 0 .Ψ(t) and Φ(t) are the solutions of the following ordinary differential equations: where Ψ(T ) = Φ(T ) = P .Proof: The proof is shown in Appendix B. While (32) is the Riccati equation [20], [5], [6], (33) is a new equation of ML-DSC, which is the called the decentralized Riccati equation in this paper.Because estimation and control are not clearly separated in the general LQG problem [11], [16], [17], the Riccati equation (32) for control is modified to include estimation, which corresponds to the decentralized Riccati equation (33).As a result, the decentralized Riccati equation (33) may improve estimation as well as control.
In order to support this interpretation, we analyze the decentralized Riccati equation (33) by comparing it with the Riccati equation (32).Since only the last term of (33) is different from (32), we denote it as follows: We focus on Q N for the sake of simplicity.Similar discussions are possible for Q i (i ∈ {1, ..., N − 1}).We also denote a := s −N and b := z N for the notational simplicity.a is unobservable and b is observable for the controller N .Q N can be calculated as follows: where O, Φ aa and Φ bb may be larger than Ψ aa and Ψ bb , respectively.Because Φ aa and Φ bb are the negative feedback gains of a and b, respectively, Q N may decrease Σ aa and Σ bb .Moreover, when Σ ab is positive/negative, Φ ab may be smaller/larger than Ψ ab , which may increase/decrease Σ ab .The similar discussion is possible for Σ ba , Φ ba , and Ψ ba because Σ, Φ, and Ψ are symmetric matrices.As a result, Q N may decrease the following conditional covariance matrix: which corresponds to the estimation error of a from b.Therefore, the decentralized Riccati equation (33) may improve estimation as well as control.

C. Comparison with completely observable or memorylimited partially observable stochastic control
In the COSC of the extended state, the optimal control function is given as follows [20]: where Ψ(t) is the solution of the Riccati equation (32).
In ML-POSC, the optimal control function is given as follows [11]: where Π(t) is the solution of the partially observable Riccati equation, which is given by where Π(T ) = P and K(t) is defined by The decentralized Riccati equation ( 33) is a natural extension of the partially observable Riccati equation (39) from a single-agent system to a multi-agent system.
where the initial conditions are given by the standard Gaussian distributions, ωt := (ω t , ν 1 t , ν 2 t ) ∈ R 3 is the standard Wiener process, ũ1 is the control of the controller 1, and is the control of the controller 2. Each controller can control the other controller's memory through c i t , which can be interpreted as the communication.The objective function to be minimized is given as follows: Therefore, the objective of this problem is to minimize the state variance by the small controls.This problem corresponds to the LQG problem defined by ( 25) and (26).From s t := (x t , z 1 t , z 2 t ) ∈ R 3 , the SDEs (41)-(45) can be rewritten as follows: which corresponds to (25).The objective function (46) can be rewritten as follows: which corresponds to (26).In addition, it satisfies the assumption of R(t) (27).
The Riccati (32) can be solved backward from the terminal condition.The partially observable Riccati equation (39) and the decentralized Riccati equation (33) can be solved by the forward-backward sweep method (fixedpoint iteration method) [12], [19].Fig. 1 shows the trajectories of Ψ(t), Π(t), and Φ(t), which are the optimal control gains of COSC, ML-POSC, and ML-DSC, respectively.While the memory controls do not appear in COSC, they appear in ML-POSC and ML-DSC (Fig. 1(b-f)), which indicates that the memory controls play an important role in estimation.
We investigate Φ by comparing it with Ψ. Φ xx and Φ z i z i are larger than Ψ xx and Ψ z i z i (Fig. 1(a,d,f)), which may decrease Σ xx and Σ z i z i .Moreover, Φ xz i is smaller than Ψ xz i (Fig. 1(b,c)), which may strengthen the positive correlation between x and z i .Therefore, Φ xx , Φ z i z i , and Φ xz i may improve estimation, which is consistent with our discussion.However, Φ z 1 z 2 is larger than Ψ z 1 z 2 (Fig. 1(e)), which may weaken the positive correlation between z 1 and z 2 .It seems to be contrary to our discussion because it may worsen estimation.
We compare Φ with Π to investigate Φ z 1 z 2 .The absolute values of Φ are larger than those of Π except for Φ z 1 z 2 (Fig. 1(a,b,c,d,f)).This is reasonable because estimation is more important in ML-DSC than in ML-POSC.The problem is only Φ z 1 z 2 (Fig. 1(e)).In ML-POSC, because the estimation between z 1 and z 2 is not important, Π z 1 z 2 is determined only from the control perspective.Π z 1 z 2 is almost the same with Π z i z i (Fig. 1(d,e,f)) because borrowing control is more efficient than increasing control.In contrast, because the estimation between z 1 and z 2 is important in ML-DSC, Φ z 1 z 2 is smaller than Π z 1 z 2 (Fig. 1(e)), which may strengthen the positive correlation between z 1 and z 2 .Therefore, Φ z 1 z 2 is determined by a trade-off between control and estimation.
In order to clarify the significance of the decentralized Riccati equation (33), we compare the performance of the optimal control function (29) with that of the following control functions: which replaces Φ with Ψ and Π, respectively.We note that the second terms are not important because µ(t) = 0 in this problem.The result is shown in Fig. 2. The expected cumulative cost of (47) is larger than that of (29) (Fig. 2(d)) because (47) does not account the estimation of the state and the other memory.Moreover, the expected cumulative cost of (48) is larger than that of (29) (Fig. 2(d)) because (48) does not account the estimation of the other memory.These results indicate that the decentralized Riccati equation ( 33) is significant in ML-DSC.

VI. CONCLUSION
In this work, we proposed ML-DSC, in which each controller compresses the observation history into the finitedimensional memory.Because this compression simplifies the estimation among the controllers, ML-DSC can be solved in a general case based on the mean-field control theory.We demonstrated ML-DSC in the general LQG problem involving a non-nested structure.Because estimation and control are not clearly separated in the general LQG problem, the Riccati equation is modified to the decentralized Riccati equation, which may improve estimation as well as control.Our numerical experiment showed that the decentralized Riccati equation is superior to the conventional Riccati equations.
ML-DSC can be solved in practice even in a non-LQG problem.The optimal control function of ML-DSC is obtained by solving the system of HJB-FP equations.Because the system of HJB-FP equations also appears in the mean-field game and control, numerous numerical algorithms have been developed [12].Especially, neural network-based algorithms have been proposed recently, which can solve high-dimensional state problems efficiently [22], [23].By exploiting these algorithms, we may efficiently solve ML-DSC consisting of a large number of agents. )