# Memory-Limited Partially Observable Stochastic Control and Its Mean-Field Control Approach

^{1}

^{2}

^{3}

^{4}

^{*}

## Abstract

**:**

## 1. Introduction

## 2. Review of Partially Observable Stochastic Control

#### 2.1. Problem Formulation

#### 2.2. Derivation of Optimal Control Function

**Proposition 1**

**.**The optimal control function of POSC is provided by

## 3. Memory-Limited Partially Observable Stochastic Control

#### 3.1. Problem Formulation

#### 3.2. Problem Reformulation

## 4. Mean-Field Control Approach

#### 4.1. Derivation of Optimal Control Function

**Lemma 1.**

**Proof.**

**Theorem 1.**

**Proof.**

#### 4.2. Comparison with Completely Observable Stochastic Control

**Proposition 2**

**.**The optimal control function of COSC of the extended state is provided by

**Proof.**

#### 4.3. Numerical Algorithm

## 5. Linear-Quadratic-Gaussian Problem without Memory Limitation

#### 5.1. Review of Partially Observable Stochastic Control

**Proposition 3**

**.**In the LQG problem without memory limitation, the optimal control function of POSC (33) is provided by

#### 5.2. Memory-Limited Partially Observable Stochastic Control

**Theorem 2.**

**Proof.**

## 6. Linear-Quadratic-Gaussian Problem with Memory Limitation

#### 6.1. Problem Formulation

#### 6.2. Problem Reformulation

#### 6.3. Derivation of Optimal Control Function

**Theorem 3.**

**Proof.**

#### 6.4. Comparison with Completely Observable Stochastic Control

**Proposition 4**

**.**In the LQG problem, the optimal control function of COSC of the extended state is provided by

#### 6.5. Numerical Algorithm

## 7. Numerical Experiments

#### 7.1. LQG Problem with Memory Limitation

#### 7.2. Non-LQG Problem

## 8. Discussion

## Author Contributions

## Funding

## Institutional Review Board Statement

## Informed Consent Statement

## Data Availability Statement

## Acknowledgments

## Conflicts of Interest

## Abbreviations

COSC | Completely Observable Stochastic Control |

POSC | Partially Observable Stochastic Control |

ML-POSC | Memory-Limited Partially Observable Stochastic Control |

POMDP | Partially Observable Markov Decision Process |

DSC | Decentralized Stochastic Control |

LQG | Linear-Quadratic-Gaussian |

HJB | Hamilton–Jacobi–Bellman |

FP | Fokker–Planck |

SDE | Stochastic Differential Equation |

## Appendix A. Proof of Lemma 1

## Appendix B. Proof of Theorem 1

## Appendix C. Proof of Proposition 2

## Appendix D. Proof of Theorem 2

## Appendix E. Proof of Theorem 3

## References

- Fox, R.; Tishby, N. Minimum-information LQG control part I: Memoryless controllers. In Proceedings of the 2016 IEEE 55th Conference on Decision and Control (CDC), Las Vegas, NV, USA, 12–14 December 2016; pp. 5610–5616. [Google Scholar] [CrossRef][Green Version]
- Fox, R.; Tishby, N. Minimum-information LQG control Part II: Retentive controllers. In Proceedings of the 2016 IEEE 55th Conference on Decision and Control (CDC), Las Vegas, NV, USA, 12–14 December 2016; pp. 5603–5609. [Google Scholar] [CrossRef][Green Version]
- Li, W.; Todorov, E. An Iterative Optimal Control and Estimation Design for Nonlinear Stochastic System. In Proceedings of the 45th IEEE Conference on Decision and Control, San Diego, CA, USA, 13–15 December 2006; pp. 3242–3247. [Google Scholar] [CrossRef]
- Li, W.; Todorov, E. Iterative linearization methods for approximately optimal control and estimation of non-linear stochastic system. Int. J. Control
**2007**, 80, 1439–1453. [Google Scholar] [CrossRef] - Nakamura, K.; Kobayashi, T.J. Connection between the Bacterial Chemotactic Network and Optimal Filtering. Phys. Rev. Lett.
**2021**, 126, 128102. [Google Scholar] [CrossRef] [PubMed] - Nakamura, K.; Kobayashi, T.J. Optimal sensing and control of run-and-tumble chemotaxis. Phys. Rev. Res.
**2022**, 4, 013120. [Google Scholar] [CrossRef] - Pezzotta, A.; Adorisio, M.; Celani, A. Chemotaxis emerges as the optimal solution to cooperative search games. Phys. Rev. E
**2018**, 98, 042401. [Google Scholar] [CrossRef][Green Version] - Borra, F.; Cencini, M.; Celani, A. Optimal collision avoidance in swarms of active Brownian particles. J. Stat. Mech. Theory Exp.
**2021**, 2021, 083401. [Google Scholar] [CrossRef] - Bensoussan, A. Stochastic Control of Partially Observable Systems; Cambridge University Press: Cambridge, UK, 1992. [Google Scholar] [CrossRef]
- Yong, J.; Zhou, X.Y. Stochastic Controls; Springer: New York, NY, USA, 1999. [Google Scholar] [CrossRef]
- Nisio, M. Stochastic Control Theory. In Probability Theory and Stochastic Modelling; Springer: Tokyo, Japan, 2015; Volume 72. [Google Scholar] [CrossRef]
- Fabbri, G.; Gozzi, F.; Święch, A. Stochastic Optimal Control in Infinite Dimension. In Probability Theory and Stochastic Modelling; Springer International Publishing: Cham, Switzerland, 2017; Volume 82. [Google Scholar] [CrossRef]
- Bensoussan, A.; Frehse, J.; Yam, S.C.P. The Master equation in mean field theory. J. de Math. Pures et Appl.
**2015**, 103, 1441–1474. [Google Scholar] [CrossRef] - Bensoussan, A.; Frehse, J.; Yam, S.C.P. On the interpretation of the Master Equation. Stoch. Process. Their Appl.
**2017**, 127, 2093–2137. [Google Scholar] [CrossRef][Green Version] - Bensoussan, A.; Yam, S.C.P. Mean field approach to stochastic control with partial information. ESAIM Control Optim. Calc. Var.
**2021**, 27, 89. [Google Scholar] [CrossRef] - Hansen, E. An Improved Policy Iteration Algorithm for Partially Observable MDPs. In Advances in Neural Information Processing Systems; MIT Press: Cambridge, MA, USA, 1998; Volume 10. [Google Scholar]
- Hansen, E.A. Solving POMDPs by Searching in Policy Space. In Proceedings of the Fourteenth Conference on Uncertainty in Artificial Intelligence, Madison, WI, USA, 24–26 July 1998; pp. 211–219. [Google Scholar]
- Kaelbling, L.P.; Littman, M.L.; Cassandra, A.R. Planning and acting in partially observable stochastic domains. Artif. Intell.
**1998**, 101, 99–134. [Google Scholar] [CrossRef][Green Version] - Meuleau, N.; Kim, K.E.; Kaelbling, L.P.; Cassandra, A.R. Solving POMDPs by Searching the Space of Finite Policies. In Proceedings of the Fifteenth Conference on Uncertainty in Artificial Intelligence, Stockholm, Sweden, 30 July–1 August 1999; pp. 417–426. [Google Scholar]
- Meuleau, N.; Peshkin, L.; Kim, K.E.; Kaelbling, L.P. Learning Finite-State Controllers for Partially Observable Environments. In Proceedings of the Fifteenth Conference on Uncertainty in Artificial Intelligence, Stockholm, Sweden, 30 July–1 August 1999; pp. 427–436. [Google Scholar]
- Poupart, P.; Boutilier, C. Bounded Finite State Controllers. In Advances in Neural Information Processing Systems; MIT Press: Cambridge, MA, USA, 2003; Volume 16. [Google Scholar]
- Amato, C.; Bonet, B.; Zilberstein, S. Finite-State Controllers Based on Mealy Machines for Centralized and Decentralized POMDPs. Proc. AAAI Conf. Artif. Intell.
**2010**, 24, 1052–1058. [Google Scholar] [CrossRef] - Bensoussan, A. Estimation and Control of Dynamical Systems. In Interdisciplinary Applied Mathematics; Springer International Publishing: Cham, Switzerland, 2018; Volume 48. [Google Scholar] [CrossRef]
- Laurière, M.; Pironneau, O. Dynamic Programming for Mean-Field Type Control. J. Optim. Theory Appl.
**2016**, 169, 902–924. [Google Scholar] [CrossRef][Green Version] - Pham, H.; Wei, X. Bellman equation and viscosity solutions for mean-field stochastic control problem. ESAIM Control Optim. Calc. Var.
**2018**, 24, 437–461. [Google Scholar] [CrossRef] - Kushner, H.J.; Dupuis, P.G. Numerical Methods for Stochastic Control Problems in Continuous Time; Springer: New York, NY, USA, 1992. [Google Scholar] [CrossRef]
- Fleming, W.H.; Soner, H.M. Controlled Markov Processes and Viscosity Solutions, 2nd ed.; Number 25 in Applications of mathematics; Springer: New York, NY, USA, 2006. [Google Scholar] [CrossRef]
- Bensoussan, A.; Frehse, J.; Yam, P. Mean Field Games and Mean Field Type Control Theory; Springer Briefs in Mathematics; Springer: New York, NY, USA, 2013. [Google Scholar] [CrossRef]
- Carmona, R.; Delarue, F. Probabilistic Theory of Mean Field Games with Applications I; Number volume 83 in Probability theory and stochastic modelling; Springer Nature: Cham, Switzerland, 2018. [Google Scholar] [CrossRef]
- Carmona, R.; Delarue, F. Probabilistic Theory of Mean Field Games with Applications II. In Probability Theory and Stochastic Modelling; Springer International Publishing: Cham, Switzerland, 2018; Volume 84. [Google Scholar] [CrossRef]
- Achdou, Y. Finite Difference Methods for Mean Field Games. In Hamilton-Jacobi Equations: Approximations, Numerical Analysis and Applications: Cetraro, Italy 2011, Editors: Paola Loreti, Nicoletta Anna Tchou; Lecture Notes in Mathematics; Achdou, Y., Barles, G., Ishii, H., Litvinov, G.L., Eds.; Springer: Berlin/Heidelberg, Germany, 2013; pp. 1–47. [Google Scholar] [CrossRef]
- Achdou, Y.; Laurière, M. Mean Field Games and Applications: Numerical Aspects. In Mean Field Games: Cetraro, Italy 2019; Lecture Notes in Mathematics; Achdou, Y., Cardaliaguet, P., Delarue, F., Porretta, A., Santambrogio, F., Cardaliaguet, P., Porretta, A., Eds.; Springer International Publishing: Cham, Switzerland, 2020; pp. 249–307. [Google Scholar] [CrossRef]
- Lauriere, M. Numerical Methods for Mean Field Games and Mean Field Type Control. Mean Field Games
**2021**, 78, 221. [Google Scholar] [CrossRef] - Tottori, T.; Kobayashi, T.J. Pontryagin’s Minimum Principle and Forward-Backward Sweep Method for the System of HJB-FP Equations in Memory-Limited Partially Observable Stochastic Control. arXiv
**2022**, arXiv:2210.13040. [Google Scholar] - Carlini, E.; Silva, F.J. Semi-Lagrangian schemes for mean field game models. In Proceedings of the 52nd IEEE Conference on Decision and Control, Firenze, Italy, 10–13 December 2013; pp. 3115–3120. [Google Scholar] [CrossRef][Green Version]
- Carlini, E.; Silva, F.J. A Fully Discrete Semi-Lagrangian Scheme for a First Order Mean Field Game Problem. SIAM J. Numer. Anal.
**2014**, 52, 45–67. [Google Scholar] [CrossRef] - Carlini, E.; Silva, F.J. A semi-Lagrangian scheme for a degenerate second order mean field game system. Discret. Contin. Dyn. Syst.
**2015**, 35, 4269. [Google Scholar] [CrossRef] - Crisan, D.; Doucet, A. A survey of convergence results on particle filtering methods for practitioners. IEEE Trans. Signal Process.
**2002**, 50, 736–746. [Google Scholar] [CrossRef][Green Version] - Budhiraja, A.; Chen, L.; Lee, C. A survey of numerical methods for nonlinear filtering problems. Phys. D Nonlinear Phenom.
**2007**, 230, 27–36. [Google Scholar] [CrossRef] - Bain, A.; Crisan, D. Fundamentals of Stochastic Filtering. In Stochastic Modelling and Applied Probability; Springer: New York, NY, USA, 2009; Volume 60. [Google Scholar] [CrossRef]
- Nayyar, A.; Mahajan, A.; Teneketzis, D. Decentralized Stochastic Control with Partial History Sharing: A Common Information Approach. IEEE Trans. Autom. Control
**2013**, 58, 1644–1658. [Google Scholar] [CrossRef] - Charalambous, C.D.; Ahmed, N.U. Centralized Versus Decentralized Optimization of Distributed Stochastic Differential Decision Systems With Different Information Structures-Part I: A General Theory. IEEE Trans. Autom. Control
**2017**, 62, 1194–1209. [Google Scholar] [CrossRef] - Charalambous, C.D.; Ahmed, N.U. Centralized Versus Decentralized Optimization of Distributed Stochastic Differential Decision Systems With Different Information Structures—Part II: Applications. IEEE Trans. Autom. Control
**2018**, 63, 1913–1928. [Google Scholar] [CrossRef] - Oliehoek, F.A.; Amato, C. A Concise Introduction to Decentralized POMDPs; SpringerBriefs in Intelligent Systems; Springer International Publishing: Cham, Switzerland, 2016. [Google Scholar] [CrossRef][Green Version]
- Bernstein, D.S. Bounded Policy Iteration for Decentralized POMDPs. In Proceedings of the Nineteenth International Joint Conference on Artificial Intelligence, Edinburgh, UK, 30 July–5 August 2005; pp. 1287–1292. [Google Scholar]
- Bernstein, D.S.; Amato, C.; Hansen, E.A.; Zilberstein, S. Policy Iteration for Decentralized Control of Markov Decision Processes. J. Artif. Intell. Res.
**2009**, 34, 89–132. [Google Scholar] [CrossRef][Green Version] - Amato, C.; Bernstein, D.S.; Zilberstein, S. Optimizing Memory-Bounded Controllers for Decentralized POMDPs. In Proceedings of the Twenty-Third Conference on Uncertainty in Artificial Intelligence, Vancouver, BC, Canada, 19–22 July 2007; pp. 1–8. [Google Scholar]
- Tottori, T.; Kobayashi, T.J. Forward and Backward Bellman Equations Improve the Efficiency of the EM Algorithm for DEC-POMDP. Entropy
**2021**, 23, 551. [Google Scholar] [CrossRef] - Ruthotto, L.; Osher, S.J.; Li, W.; Nurbekyan, L.; Fung, S.W. A machine learning framework for solving high-dimensional mean field game and mean field control problems. Proc. Natl. Acad. Sci. USA
**2020**, 117, 9183–9193. [Google Scholar] [CrossRef][Green Version] - Lin, A.T.; Fung, S.W.; Li, W.; Nurbekyan, L.; Osher, S.J. Alternating the population and control neural networks to solve high-dimensional stochastic mean-field games. Proc. Natl. Acad. Sci. USA
**2021**, 118, e2024713118. [Google Scholar] [CrossRef] [PubMed]

**Figure 1.**Schematic diagram of (

**a**) completely observable stochastic control (COSC), (

**b**) partially observable stochastic control (POSC), and (

**c**) memory-limited partially observable stochastic control (ML-POSC). The top and bottom figures represent the system and controller, respectively; ${x}_{t}\in {\mathbb{R}}^{{d}_{x}}$ is the state of the system; ${y}_{t}\in {\mathbb{R}}^{{d}_{y}}$, ${z}_{t}\in {\mathbb{R}}^{{d}_{z}}$, and ${u}_{t}\in {\mathbb{R}}^{{d}_{u}}$ are the observation, memory, and control of the controller, respectively. (

**a**) In COSC, the controller can completely observe the state ${x}_{t}$, and determines the control ${u}_{t}$ based on the state ${x}_{t}$, i.e., ${u}_{t}=u(t,{x}_{t})$. Only finite-dimensional memory is required to store the state ${x}_{t}$, and the optimal control ${u}_{t}^{*}$ is obtained by solving the Hamilton–Jacobi–Bellman (HJB) equation, which is a partial differential equation. (

**b**) In POSC, the controller cannot completely observe the state ${x}_{t}$; instead, it obtains the noisy observation ${y}_{t}$ of the state ${x}_{t}$. The control ${u}_{t}$ is determined based on the observation history ${y}_{0:t}:=\left\{{y}_{\tau}\right|\tau \in [0,t]\}$, i.e., ${u}_{t}=u(t,{y}_{0:t})$. An infinite-dimensional memory is implicitly assumed to store the observation history ${y}_{0:t}$. Furthermore, to obtain the optimal control ${u}_{t}^{*}$, the Bellman equation (a functional differential equation) needs to be solved, which is generally intractable, even numerically. (

**c**) In ML-POSC, the controller is only accessible to the noisy observation ${y}_{t}$ of the state ${x}_{t}$, as in POSC. In addition, it has only finite-dimensional memory ${z}_{t}$, which cannot completely memorize the the observation history ${y}_{0:t}$. The controller of ML-POSC compresses the observation history ${y}_{0:t}$ into the finite-dimensional memory ${z}_{t}$, then determines the control ${u}_{t}$ based on the memory ${z}_{t}$, i.e., ${u}_{t}=u(t,{z}_{t})$. The optimal control ${u}_{t}^{*}$ is obtained by solving the HJB equation (a partial differential equation), as in COSC.

**Figure 2.**Numerical simulation of the LQG problem with memory limitation. (

**a**–

**c**) Trajectories of the elements of $\Psi \left(t\right)\in {\mathbb{R}}^{2\times 2}$ and $\Pi \left(t\right)\in {\mathbb{R}}^{2\times 2}$. Because ${\Psi}_{zx}\left(t\right)={\Psi}_{xz}\left(t\right)$ and ${\Pi}_{zx}\left(t\right)={\Pi}_{xz}\left(t\right)$, ${\Psi}_{zx}\left(t\right)$ and ${\Pi}_{zx}\left(t\right)$ are not visualized. (

**d**–

**f**) Stochastic behaviors of the state ${x}_{t}$ (

**d**), the memory ${z}_{t}$ (

**e**), and the cumulative cost (

**f**) for 100 samples. The expectation of the cumulative cost at $t=10$ corresponds to the objective function (70). Blue and orange curves are controlled by (71) and (58), respectively.

**Figure 3.**Numerical simulation of the non-LQG problem for the local LQG approximation (blue) and ML-POSC (orange). (

**a**) Stochastic behaviors of state ${x}_{t}$ for 100 samples. The black rectangles and cross represent the obstacles and goal, respectively. (

**b**) The objective function (74), computed from 100 samples.

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |

© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

## Share and Cite

**MDPI and ACS Style**

Tottori, T.; Kobayashi, T.J.
Memory-Limited Partially Observable Stochastic Control and Its Mean-Field Control Approach. *Entropy* **2022**, *24*, 1599.
https://doi.org/10.3390/e24111599

**AMA Style**

Tottori T, Kobayashi TJ.
Memory-Limited Partially Observable Stochastic Control and Its Mean-Field Control Approach. *Entropy*. 2022; 24(11):1599.
https://doi.org/10.3390/e24111599

**Chicago/Turabian Style**

Tottori, Takehiro, and Tetsuya J. Kobayashi.
2022. "Memory-Limited Partially Observable Stochastic Control and Its Mean-Field Control Approach" *Entropy* 24, no. 11: 1599.
https://doi.org/10.3390/e24111599