Next Article in Journal
Mather β-Function for Ellipses and Rigidity
Next Article in Special Issue
H State-Feedback Control of Multi-Agent Systems with Data Packet Dropout in the Communication Channels: A Markovian Approach
Previous Article in Journal
Entropy as a High-Level Feature for XAI-Based Early Plant Stress Detection
Previous Article in Special Issue
Trajectory Tracking within a Hierarchical Primitive-Based Learning Approach
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Memory-Limited Partially Observable Stochastic Control and Its Mean-Field Control Approach

by
Takehiro Tottori
1,* and
Tetsuya J. Kobayashi
1,2,3,4
1
Department of Mathematical Informatics, Graduate School of Information Science and Technology, The University of Tokyo, Tokyo 113-8654, Japan
2
Institute of Industrial Science, The University of Tokyo, Tokyo 153-8505, Japan
3
Department of Electrical Engineering and Information Systems, Graduate School of Engineering, The University of Tokyo, Tokyo 113-8654, Japan
4
Universal Biology Institute, The University of Tokyo, Tokyo 113-8654, Japan
*
Author to whom correspondence should be addressed.
Entropy 2022, 24(11), 1599; https://doi.org/10.3390/e24111599
Submission received: 22 September 2022 / Revised: 28 October 2022 / Accepted: 28 October 2022 / Published: 3 November 2022
(This article belongs to the Special Issue Information Theory in Control Systems)

Abstract

:
Control problems with incomplete information and memory limitation appear in many practical situations. Although partially observable stochastic control (POSC) is a conventional theoretical framework that considers the optimal control problem with incomplete information, it cannot consider memory limitation. Furthermore, POSC cannot be solved in practice except in special cases. In order to address these issues, we propose an alternative theoretical framework, memory-limited POSC (ML-POSC). ML-POSC directly considers memory limitation as well as incomplete information, and it can be solved in practice by employing the technique of mean-field control theory. ML-POSC can generalize the linear-quadratic-Gaussian (LQG) problem to include memory limitation. Because estimation and control are not clearly separated in the LQG problem with memory limitation, the Riccati equation is modified to the partially observable Riccati equation, which improves estimation as well as control. Furthermore, we demonstrate the effectiveness of ML-POSC for a non-LQG problem by comparing it with the local LQG approximation.

1. Introduction

Control problems of systems with incomplete information and memory limitation appear in many practical situations. These constraints become especially predominant when designing the control of small devices [1,2], and are important for understanding the control mechanisms of biological systems [3,4,5,6,7,8] because their sensors are extremely noisy and their controllers can only have severely limited memories.
Partially observable stochastic control (POSC) is a conventional theoretical framework that considers the optimal control problem with one of these constraints, namely, the incomplete information of the system state (Figure 1b) [9]. Because the POSC controller cannot completely observe the state of the system, it determines the control based on the noisy observation history of the state. POSC can be solved in principle [10,11,12] by converting it to a completely observable stochastic control (COSC) of the posterior probability of the state, as the posterior probability represents the sufficient statistics of the observation history. The posterior probability and the optimal control are obtained by solving the Zakai equation and the Bellman equation, respectively.
However, POSC has three practical problems with respect to the implementation of the controller which originate from the ignorance of the other constraint, namely, the memory limitation of the controller [1,2]. First, a controller designed by POSC should ideally have an infinite-dimensional memory to store and compute the posterior probability from the observation history. Second, the memory of the controller cannot have intrinsic stochasticity other than the observation noise to accurately compute the posterior probability via the Zakai equation. Third, POSC does not consider the cost originating from the memory update, which can be regarded as a cost of estimation. In light of the dualistic roles played by estimation and control, considering only control cost by ignoring estimation cost is asymmetric. As a result, POSC is not practical for control problems where the memory size, noise, and cost are non-negligible. Therefore, we need an alternative theoretical framework considering memory limitation to circumvent these three problems.
Furthermore, POSC has another crucial problem in obtaining the optimal state control by solving the Bellman equation [3,4]. Because the posterior probability of the state is infinite-dimensional, POSC corresponds to an infinite-dimensional COSC. In the infinite-dimensional COSC, the Bellman equation becomes a functional differential equation, which needs to be solved in order to obtain the optimal state control. However, solving a functional differential equation is generally intractable, even numerically.
In this work, we propose an alternative theoretical framework to the conventional POSC which can address the above-mentioned two issues. We call it memory-limited POSC (ML-POSC), in which memory limitation as well as incomplete information are directly accounted (Figure 1c). The conventional POSC derives the Zakai equation without considering memory limitations. Then, the optimal state control is supposed to be derived by solving the Bellman equation, even though we do not have any practical way to do this. In contrast, ML-POSC first postulates the finite-dimensional and stochastic memory dynamics explicitly by taking the memory limitation into account and then jointly optimizes the memory dynamics and state control by considering the memory and control costs. As a result, unlike the conventional POSC, ML-POSC finds both the optimal state control and the optimal memory dynamics with given memory limitations. Furthermore, we show that the Bellman equation of ML-POSC can be reduced to the Hamilton–Jacobi–Bellman (HJB) equation by employing a trick from the mean-field control theory [13,14,15]. While the Bellman equation is a functional differential equation, the HJB equation is a partial differential equation. As a result, ML-POSC can be solved, at least numerically.
The idea behind ML-POSC is closely related to that of the finite-state controller [16,17,18,19,20,21,22]. Finite-state controllers have been studied using the partially observable Markov decision process (POMDP), that is, the discrete time and state POSC. The finite-dimensional memory of ML-POSC can be regarded as an extension of the finite-state controller of POMDP to the continuous time and state setting. Nonetheless, the algorithms of the finite-state controller cannot be directly extended to the continuous setting, as they strongly depend on the discreteness. Although Fox and Tishby extended the finite-state controller to the continuous setting, their algorithm is restricted to the special case [1,2]. ML-POSC resolves this problem by employing the technique of the mean-field control theory.
In the linear-quadratic-Gaussian (LQG) problem of the conventional POSC, the Zakai equation and the Bellman equation are reduced to the Kalman filter and the Riccati equation, respectively [9,23]. Because the infinite-dimensional Zakai equation is reduced to the finite-dimensional Kalman filter, the LQG problem of the conventional POSC can be discussed in terms of ML-POSC. We show that the Kalman filter corresponds to the optimal memory dynamics of ML-POSC. Moreover, ML-POSC can generalize the LQG problem to include memory limitations such as the memory noise and cost. Because estimation and control are not clearly separated in the LQG problem with memory limitation, the Riccati equation for control is modified to include estimation, which in this paper is called the partially observable Riccati equation. We demonstrate that the partially observable Riccati equation is superior to the conventional Riccati equation as concerns the LQG problem with memory limitation.
Then, we investigate the potential effectiveness of ML-POSC for a non-LQG problem by comparing it with the local LQG approximation of the conventional POSC [3,4]. In the local LQG approximation, the Zakai equation and the Bellman equation are locally approximated by the Kalman filter and the Riccati equation, respectively. Because the Bellman equation (a functional differential equation) is reduced to the Riccati equation (an ordinary differential equation), the local LQG approximation can be solved numerically. However, the performance of the local LQG approximation may be poor in a highly non-LQG problem, as the local LQG approximation ignores non-LQG information. In contrast, ML-POSC reduces the Bellman equation to the HJB equation while maintaining non-LQG information. We demonstrate that ML-POSC can provide a better result than the local LQG approximation in a non-LQG problem.
This paper is organized as follows: In Section 2, we briefly review the conventional POSC. In Section 3, we formulate ML-POSC. In Section 4, we propose the mean-field control approach to ML-POSC. In Section 5, we investigate the LQG problem of the conventional POSC based on ML-POSC. In Section 6, we generalize the LQG problem to include memory limitation. In Section 7, we show numerical experiments involving a LQG problem with memory limitation and a non-LQG problem. Finally, in Section 8, we discuss our work.

2. Review of Partially Observable Stochastic Control

In this section, we briefly review the conventional POSC [11,15].

2.1. Problem Formulation

In this subsection, we formulate the conventional POSC [11,15]. The state x t R d x and the observation y t R d y at time t [ 0 , T ] evolve by the following stochastic differential equations (SDEs):
d x t = b ( t , x t , u t ) d t + σ ( t , x t , u t ) d ω t ,
d y t = h ( t , x t ) d t + γ ( t ) d ν t ,
where x 0 and y 0 obey p 0 ( x 0 ) and p 0 ( y 0 ) , respectively, ω t R d ω and ν t R d ν are independent standard Wiener processes, and u t R d u is the control. Here, γ ( t ) γ ( t ) is assumed to be invertible. In POSC, because the controller cannot completely observe the state x t , the control u t is determined based on the observation history y 0 : t : = { y τ | τ [ 0 , t ] } , as follows:
u t = u ( t , y 0 : t ) .
The objective function of POSC is provided by the following expected cumulative cost function:
J [ u ] : = E p ( x 0 : T , y 0 : T ; u ) 0 T f ( t , x t , u t ) d t + g ( x T ) ,
where f is the cost function, g is the terminal cost function, p ( x 0 : T , y 0 : T ; u ) is the probability of x 0 : T and y 0 : T given u as a parameter, and E p · is the expectation with respect to probability p. Throughout this paper, the time horizon T is assumed to be finite.
POSC is the problem of finding the optimal control function u * that minimizes the objective function J [ u ] as follows:
u * : = argmin u J [ u ] .

2.2. Derivation of Optimal Control Function

In this subsection, we briefly review the derivation of the optimal control function of the conventional POSC [11,15]. We first define the unnormalized posterior probability density function q t ( x ) : = p ( x t = x , y 0 : t ) . We omit y 0 : t for notational simplicity. Here, q t ( x ) obeys the following Zakai equation:
d q t ( x ) = L q t ( x ) d t + q t ( x ) h ( t , x ) ( γ ( t ) γ ( t ) ) 1 d y t ,
where q 0 ( x ) = p 0 ( x ) p 0 ( y ) and L is the forward diffusion operator, which is defined by
L q ( x ) : = i = 1 d x ( b i ( t , x , u ) q ( x ) ) x i + 1 2 i , j = 1 d x 2 ( D i j ( t , x , u ) q ( x ) ) x i x j ,
where D ( t , x , u ) : = σ ( t , x , u ) σ ( t , x , u ) . Then, the objective function (4) can be calculated as follows:
J [ u ] = E p ( q 0 : T ; u ) 0 T f ¯ ( t , q t , u t ) d t + g ¯ ( q T ) ,
where f ¯ ( t , q , u ) : = E q ( x ) f ( t , x , u ) and g ¯ ( q ) : = E q ( x ) g ( x ) . From (6) and (8), POSC is converted into a COSC of q t . As a result, POSC can be approached in the similar way as COSC, and the optimal control function is provided by the following proposition.
Proposition 1 
([11,15]). The optimal control function of POSC is provided by
u * ( t , q ) = argmin u E q ( x ) H t , x , u , δ V ( t , q ) δ q ( x ) ,
where H is the Hamiltonian, which is defined by
H t , x , u , δ V ( t , q ) δ q ( x ) : = f ( t , x , u ) + L δ V ( t , q ) δ q ( x ) .
L is the backward diffusion operator, which is defined by
L q ( x ) : = i = 1 d x b i ( t , x , u ) q ( x ) x i + 1 2 i , j = 1 d x D i j ( t , x , u ) 2 q ( x ) x i x j .
We note that L is the conjugate of L ; furthermore, V ( t , q ) is the value function, which is the solution of the following Bellman equation:
V ( t , q ) t = E q ( x ) H t , x , u * , δ V ( t , q ) δ q ( x ) + 1 2 E q ( x ) q ( x ) δ δ q δ V ( t , q ) δ q ( x , x ) h ( t , x ) ( γ ( t ) γ ( t ) ) 1 h ( t , x ) ,
where V ( T , q ) = E q ( x ) g ( x ) .
Proof. 
The proof is shown in [11,15]. □
The optimal control function u * ( t , q ) is obtained by solving the Bellman Equation (12). The controller determines the optimal control u t * = u * ( t , q t ) based on the posterior probability q t . The posterior probability q t is obtained by solving the Zakai Equation (6). As a result, POSC can be solved in principle.
However, POSC has three practical problems with respect to the memory of the controller. First, the controller should have an infinite-dimensional memory to store and compute the posterior probability q t from the observation history y 0 : t . Second, the memory of the controller cannot have intrinsic stochasticity other than the observation d y t to accurately compute the posterior probability q t via the Zakai Equation (6). Third, POSC does not consider the cost originating from the memory update, which can be regarded as a cost of estimation. In light of the dualistic roles played by estimation and control, considering only control cost by ignoring estimation cost is asymmetric. As a result, POCS is not practical for control problems where the memory size, noise, and cost are non-negligible.
Furthermore, POSC has another crucial problem in obtaining the optimal control function u * ( t , q ) by solving the Bellman Equation (12). Because the posterior probability q is infinite-dimensional, the associated Bellman Equation (12) becomes a functional differential equation. However, solving a functional differential equation is generally intractable even numerically. As a result, POCS cannot be solved in practice.

3. Memory-Limited Partially Observable Stochastic Control

In order to address the above-mentioned problems, we propose an alternative theoretical framework to the conventional POSC called ML-POSC. In this section, we formulate ML-POSC.

3.1. Problem Formulation

In this subsection, we formulate ML-POSC. ML-POSC determines the control u t based on the finite-dimensional memory z t R d z as follows:
u t = u ( t , z t ) .
The memory dimension d z is determined not by the optimization but by the prescribed memory limitation of the controller to be used. Comparing (3) and (13), the memory z t can be interpreted as the compression of the observation history y 0 : t . While the conventional POSC compresses the observation history y 0 : t into the infinite-dimensional posterior probability q t , ML-POSC compresses it into the finite-dimensional memory z t .
ML-POSC formulates the memory dynamics with the following SDE:
d z t = c ( t , z t , v t ) d t + κ ( t , z t , v t ) d y t + η ( t , z t , v t ) d ξ t ,
where z 0 obeys p 0 ( z 0 ) , ξ t R d ξ is the standard Wiener process, and v t = v ( t , z t ) R d v is the control for the memory dynamics. This memory dynamics has three important properties: (i) because it depends on the observation d y t , the memory z t can be interpreted as the compression of the observation history y 0 : t ; (ii) because it depends on the standard Wiener process d ξ t , ML-POSC can consider the memory noise explicitly; (iii) because it depends on the control v t , it can be optimized through the control v t .
The objective function of ML-POSC is provided by the following expected cumulative cost function:
J [ u , v ] : = E p ( x 0 : T , y 0 : T , z 0 : T ; u , v ) 0 T f ( t , x t , u t , v t ) d t + g ( x T ) .
Because the cost function f depends on the memory control v t as well as the state control u t , ML-POSC can consider the memory control cost (state estimation cost) as well as the state control cost explicitly.
ML-POSC optimizes the state control function u and the memory control function v based on the objective function J [ u , v ] , as follows:
u * , v * : = argmin u , v J [ u , v ] .
ML-POSC first postulates the finite-dimensional and stochastic memory dynamics explicitly, then jointly optimizes the state and memory control function by considering the state and memory control cost. As a result, unlike the conventional POSC, ML-POSC can consider memory limitation as well as incomplete information.

3.2. Problem Reformulation

Although the formulation of ML-POSC in the previous subsection clarifies its relationship with that of the conventional POSC, it is inconvenient for further mathematical investigations. In order to resolve this problem, we reformulate ML-POSC in this subsection. The formulation in this subsection is simpler and more general than that in the previous subsection.
We first define the extended state s t as follows:
s t : = x t z t R d s ,
where d s = d x + d z . The extended state s t evolves by the following SDE:
d s t = b ˜ ( t , s t , u ˜ t ) d t + σ ˜ ( t , s t , u ˜ t ) d ω ˜ t ,
where s 0 obeys p 0 ( s 0 ) , ω ˜ t R d ω ˜ is the standard Wiener process, and u ˜ t R d u ˜ is the control. ML-POSC determines the control u ˜ t R d u ˜ based solely on the memory z t , as follows:
u ˜ t = u ˜ ( t , z t ) .
The extended state SDE (18) includes the previous state, observation, and memory SDEs (1), (2) and (14) as a special case; they can be represented as follows:
d s t = b ( t , x t , u t ) c ( t , z t , v t ) + κ ( t , z t , v t ) h ( t , x t ) d t + σ ( t , x t , u t ) O O O κ ( t , z t , v t ) γ ( t ) η ( t , z t , v t ) d ω t d ν t d ξ t ,
where p 0 ( s 0 ) = p 0 ( x 0 ) p 0 ( z 0 ) .
The objective function of ML-POSC is provided by the following expected cumulative cost function:
J [ u ˜ ] : = E p ( s 0 : T ; u ˜ ) 0 T f ˜ ( t , s t , u ˜ t ) d t + g ˜ ( s T ) ,
where f ˜ is the cost function and g ˜ is the terminal cost function. It is obvious that this objective function (21) is more general than the previous one (15).
ML-POSC is the problem of finding the optimal control function u ˜ * that minimizes the objective function J [ u ˜ ] as follows:
u ˜ * : = argmin u ˜ J [ u ˜ ] .
In the following section, we mainly consider the formulation in this subsection rather than that of the previous subsection, as it is simpler and more general. Moreover, we omit · ˜ for the notational simplicity.

4. Mean-Field Control Approach

If the control u t is determined based on the extended state s t , i.e., u t = u ( t , s t ) , ML-POSC is the same as COSC of the extended state s t , and can be solved by the conventional COSC approach [10]. However, because ML-POSC determines the control u t based solely on the memory z t , i.e., u t = u ( t , z t ) , ML-POSC cannot be solved in a similar way as COSC. In order to solve ML-POSC, we propose the mean-field control approach in this section. Because the mean-field control approach is more general than the COSC approach, it can solve COSC and ML-POSC in a unified way.

4.1. Derivation of Optimal Control Function

In this subsection, we propose the mean-field control approach to ML-POSC. We first show that ML-POSC can be converted into a deterministic control of the probability density function, which is similar to the conventional POSC [11,15]. This approach is used in the mean-field control as well [13,14,24,25]. The extended state SDE (18) can be converted into the following Fokker–Planck (FP) equation:
p t ( s ) t = L p t ( s ) ,
where the initial condition is provided by p 0 ( s ) and the forward diffusion operator L is defined by (7). The objective function of ML-POSC (21) can be calculated as follows:
J [ u ] = 0 T f ¯ ( t , p t , u t ) d t + g ¯ ( p T ) ,
where f ¯ ( t , p , u ) : = E p ( s ) [ f ( t , s , u ) ] and g ¯ ( p ) : = E p ( s ) [ g ( s ) ] . From (23) and (24), ML-POSC is converted into a deterministic control of p t . As a result, ML-POSC can be approached in a similar way as the deterministic control, and the optimal control function is provided by the following lemma.
Lemma 1. 
The optimal control function of ML-POSC is provided by
u * ( t , z ) = argmin u E p t ( x | z ) H t , s , u , δ V ( t , p t ) δ p ( s ) ,
where H is the Hamiltonian (10), p t ( x | z ) = p t ( s ) / p t ( s ) d x is the conditional probability density function of a state x given memory z, p t ( s ) is the solution of the FP Equation (23), and V ( t , p ) is the solution of the following Bellman equation:
V ( t , p ) t = E p ( s ) H t , s , u * , δ V ( t , p ) δ p ( s ) ,
where V ( T , p ) = E p ( s ) [ g ( s ) ] .
Proof. 
The proof is shown in Appendix A. □
The controller of ML-POSC determines the optimal control u t * = u * ( t , z t ) based on the memory z t , not the posterior probability q t . Therefore, ML-POSC can consider memory limitation as well as incomplete information.
However, because the Bellman Equation (26) is a functional differential equation, it cannot be solved, even numerically, which is the same problem as the conventional POSC. We resolve this problem by employing the technique of the mean-field control theory [13,14] as follows.
Theorem 1. 
The optimal control function of ML-POSC is provided by
u * ( t , z ) = argmin u E p t ( x | z ) H t , s , u , w ( t , s ) ,
where H is the Hamiltonian (10), p t ( x | z ) = p t ( s ) / p t ( s ) d x is the conditional probability density function of a state x given memory z, p t ( s ) is the solution of the FP Equation (23), and w ( t , s ) is the solution of the following Hamilton–Jacobi–Bellman (HJB) equation:
w ( t , s ) t = H t , s , u * , w ( t , s ) ,
where w ( T , s ) = g ( s ) .
Proof. 
The proof is shown in Appendix B. □
While the Bellman Equation (26) is a functional differential equation, the HJB Equation (28) is a partial differential equation. As a result, unlike the conventional POSC, ML-POSC can be solved in practice.
We note that the mean-field control technique is applicable to the conventional POSC as well, and we obtain the HJB equation of the conventional POSC [15]. However, the HJB equation of the conventional POSC is not closed by a partial differential equation due to the last term of the Bellman Equation (12). As a result, the mean-field control technique is not effective with the conventional POSC except in a special case [15].
In the conventional POSC, the state estimation (memory control) and the state control are clearly separated. As a result, the state estimation and the state control are optimized by the Zakai Equation (6) and the Bellman Equation (12), respectively. In contrast, because ML-POSC considers memory limitation as well as incomplete information, the state estimation and the state control are not clearly separated. As a result, ML-POSC jointly optimizes the state estimation and the state control based on the FP Equation (23) and the HJB Equation (28).

4.2. Comparison with Completely Observable Stochastic Control

In this subsection, we show the similarities and differences between ML-POSC and COSC of the extended state. While ML-POSC determines the control u t based solely on the memory z t , i.e., u t = u ( t , z t ) , COSC of the extended state determines the control u t based on the extended state s t , i.e., u t = u ( t , s t ) . The optimal control function of COSC of the extended state is provided by the following proposition.
Proposition 2 
([10]). The optimal control function of COSC of the extended state is provided by
u * ( t , s ) = argmin u H t , s , u , w ( t , s ) ,
where H is the Hamiltonian (10) and w ( t , s ) is the solution of the HJB Equation (28).
Proof. 
The conventional proof is shown in [10]. We note that it can be proven in a similar way as ML-POSC, which is shown in Appendix C. □
Although the HJB Equation (28) is the same between ML-POSC and COSC, the optimal control function is different. While the optimal control function of COSC is provided by the minimization of the Hamiltonian (29), that of ML-POSC is provided by the minimization of the conditional expectation of the Hamiltonian (27). This is reasonable, as the controller of ML-POSC needs to estimate the state from the memory.

4.3. Numerical Algorithm

In this subsection, we briefly explain a numerical algorithm to obtain the optimal control function of ML-POSC (27). Because the optimal control function of COSC (29) depends only on the backward HJB Equation (28), it can be obtained by solving the HJB equation backwards from the terminal condition [10,26,27]. In contrast, because the optimal control function of ML-POSC (27) depends on the forward FP Equation (23) as well as the backward HJB Equation (28), it cannot be obtained in a similar way as COSC. Because the backward HJB equation depends on the forward FP equation through the optimal control function of ML-POSC, the HJB equation cannot be solved backwards from the terminal condition. As a result, ML-POSC needs to solve the system of HJB-FP equations.
The system of HJB-FP equations appears in the mean-field game and control [28,29,30], and many numerical algorithms have been developed [31,32,33]. Therefore, unlike the conventional POSC, ML-POSC can be solved in practice using these algorithms. Furthermore, unlike the mean-field game and control, the coupling of HJB-FP equations is limited to the optimal control function in ML-POSC. By exploiting this property, more efficient algorithms may be proposed for ML-POSC [34].
In this paper, we use the forward–backward sweep method (the fixed-point iteration method) to obtain the optimal control function of ML-POSC [33,34,35,36,37], which is one of the most basic algorithms for the system of HJB-FP equations. The forward–backward sweep method computes the forward FP Equation (23) and the backward HJB Equation (28) alternately. In the mean-field game and control, the convergence of the forward–backward sweep method is not guaranteed. In contrast, it is guaranteed in ML-POSC because the coupling of HJB-FP equations is limited to the optimal control function [34].

5. Linear-Quadratic-Gaussian Problem without Memory Limitation

In the LQG problem of the conventional POSC, the Zakai Equation (6) and the Bellman Equation (12) are reduced to the Kalman filter and the Riccati equation, respectively [9,23]. Because the infinite-dimensional Zakai equation is reduced to the finite-dimensional Kalman filter, the LQG problem of the conventional POSC can be discussed in terms of ML-POSC. In this section, we briefly review the LQG problem of the conventional POSC, then reproduce the Kalman filter and the Riccati equation from the viewpoint of ML-POSC. The LQG problem of the conventional POSC corresponds to the LQG problem without memory limitation, as it does not consider the memory noise and cost.

5.1. Review of Partially Observable Stochastic Control

In this subsection, we briefly review the LQG problem of the conventional POSC [9,23]. The state x t R d x and the observation y t R d y at time t [ 0 , T ] evolve by the following SDEs:
d x t = A ( t ) x t + B ( t ) u t d t + σ ( t ) d ω t ,
d y t = H ( t ) x t d t + γ ( t ) d ν t ,
where x 0 obeys the Gaussian distribution p 0 ( x 0 ) = N x 0 μ x , 0 , Σ x x , 0 , y 0 is an arbitrary real vector, ω t R d ω and ν t R d ν are independent standard Wiener processes, and u t = u ( t , y 0 : t ) R d u is the control. Here, γ ( t ) γ ( t ) is assumed to be invertible. The objective function is provided by the following expected cumulative cost function:
J [ u ] : = E p ( x 0 : T , y 0 : T ; u ) 0 T x t Q ( t ) x t + u t R ( t ) u t d t + x T P x T ,
where Q ( t ) O , R ( t ) O , and P O . The LQG problem of the conventional POSC is to find the optimal control function u * that minimizes the objective function J [ u ] , as follows:
u * : = argmin u J [ u ] .
In the LQG problem of the conventional POSC, the posterior probability is provided by the Gaussian distribution p ( x t | y 0 : t ) = N ( x t | μ ˇ ( t ) , Σ ˇ ( t ) ) , and u t = u ( t , y 0 : t ) is reduced to u t = u ( t , μ ˇ t ) without loss of performance.
Proposition 3 
([9,23]). In the LQG problem without memory limitation, the optimal control function of POSC (33) is provided by
u * ( t , μ ˇ ) = R 1 B Ψ μ ˇ ,
where μ ˇ ( t ) and Σ ˇ ( t ) are the solutions of the following Kalman filter:
d μ ˇ = A B R 1 B Ψ μ ˇ d t + Σ ˇ H ( γ γ ) 1 d y t H μ ˇ d t ,
d Σ ˇ d t = σ σ + A Σ ˇ + Σ ˇ A Σ ˇ H ( γ γ ) 1 H Σ ˇ ,
and where μ ˇ ( 0 ) = μ x , 0 and Σ ˇ ( 0 ) = Σ x x , 0 . Ψ ( t ) is the solution of the following Riccati equation:
d Ψ d t = Q + A Ψ + Ψ A Ψ B R 1 B Ψ ,
where Ψ ( T ) = P .
Proof. 
The proof is shown in [9,23]. □
In the LQG problem of the conventional POSC, the Zakai Equation (6) and the Bellman Equation (12) are reduced to the Kalman filter (35) and (36) and the Riccati Equation (37), respectively.

5.2. Memory-Limited Partially Observable Stochastic Control

Because the infinite-dimensional Zakai Equation (6) is reduced to the finite-dimensional Kalman filter (35) and (36), the LQG problem of the conventional POSC can be discussed in terms of ML-POSC. In this subsection, we reproduce the Kalman filter (35) and (36) and the Riccati Equation (37) from the viewpoint of ML-POSC.
ML-POSC defines the finite-dimensional memory z t R d z . In the LQG problem of the conventional POSC, the memory dimension d z is the same as the state dimension d x . The controller of ML-POSC determines the control u t based on the memory z t , i.e., u t = u ( t , z t ) . The memory z t is assumed to evolve by the following SDE:
d z t = v t d t + κ t d y t ,
where z 0 = μ 0 , x x , while v t = v ( t , z t ) R d z and κ t = κ ( t , z t ) R d z × d y are the memory controls. We note that the LQG problem of the conventional POSC does not consider the memory noise. The objective function of ML-POSC is provided by the following expected cumulative cost function:
J [ u , v , κ ] : = E p ( x 0 : T , y 0 : T , z 0 : T ; u , v , κ ) 0 T x t Q ( t ) x t + u t R ( t ) u t d t + x T P x T .
We note that the LQG problem of the conventional POSC does not consider the memory control cost. ML-POSC optimizes u, v, and κ based on J [ u , v , κ ] , as follows:
u * , v * , κ * : = argmin u , v , κ J [ u , v , κ ] .
In the LQG problem of the conventional POSC, the probability of the extended state s t (17) is provided by the Gaussian distribution p t ( s t ) = N ( s t | μ ( t ) , Σ ( t ) ) . The posterior probability of the state x t given the memory z t is provided by the Gaussian distribution p t ( x t | z t ) = N ( x t | μ x | z ( t , z t ) , Σ x | z ( t ) ) , where μ x | z ( t , z t ) and Σ x | z ( t ) are provided as follows:
μ x | z ( t , z t ) = μ x ( t ) + Σ x z ( t ) Σ z z 1 ( t ) ( z t μ z ( t ) ) ,
Σ x | z ( t ) = Σ x x ( t ) Σ x z ( t ) Σ z z 1 ( t ) Σ z x ( t ) .
Theorem 2. 
In the LQG problem without memory limitation, the optimal control functions of ML-POSC (40) are provided by
u * ( t , z ) = R 1 B Ψ z ,
v * ( t , z ) = A B R 1 B Ψ Σ x | z H ( γ γ ) 1 H z ,
κ * ( t , z ) = Σ x | z H ( γ γ ) 1 .
From v * ( t , z ) and κ * ( t , z ) , z t and Σ x | z ( t ) obey the following equations:
d z t = A B R 1 B Ψ z t d t + Σ x | z H ( γ γ ) 1 d y t H z t d t ,
d Σ x | z d t = σ σ + A Σ x | z + Σ x | z A Σ x | z H ( γ γ ) 1 H Σ x | z ,
where z 0 = μ x , 0 and Σ x | z ( 0 ) = Σ x x , 0 . Furthermore, μ x | z ( t , z t ) = z t holds in this problem. Ψ ( t ) is the solution of the Riccati Equation (37).
Proof. 
The proof is shown in Appendix D. □
In the LQG problem of the conventional POSC, the optimal memory dynamics of ML-POSC (46) and (47) corresponds to the Kalman filter (35) and (36). Furthermore, ML-POSC reproduces the Riccati Equation (37).

6. Linear-Quadratic-Gaussian Problem with Memory Limitation

The LQG problem of the conventional POSC does not consider memory limitation because it does not consider the memory noise and cost. Furthermore, because the memory dimension is restricted to the state dimension, the memory dimension cannot be determined according to a given controller. ML-POSC can generalize the LQG problem to include the memory limitation. In this section, we discuss the LQG problem with memory limitation based on ML-POSC.

6.1. Problem Formulation

In this subsection, we formulate the LQG problem with memory limitation. The state and observation SDEs are the same as in the previous section, which are provided by (30) and (31), respectively. The controller of ML-POSC determines the control u t R d u based on the memory z t R d z , i.e., u t = u ( t , z t ) . Unlike the LQG problem of the conventional POSC, the memory dimension d z is not necessarily the same as the state dimension d x .
The memory z t is assumed to evolve according to the following SDE:
d z t = v t d t + κ ( t ) d y t + η ( t ) d ξ t ,
where z 0 obeys the Gaussian distribution p 0 ( z 0 ) = N z 0 μ z , 0 , Σ z z , 0 , ξ t R d ξ is the standard Wiener process, and v t = v ( t , z t ) R d v is the control. Because the initial condition z 0 is stochastic and the memory SDE (48) includes the intrinsic stochasticity d ξ t , the LQG problem of ML-POSC can consider the memory noise explicitly. We note that κ ( t ) is independent of the memory z t . If κ ( t ) depends on the memory z t , the memory SDE (48) becomes non-linear and non-Gaussian. As a result, the optimal control functions cannot be derived explicitly in this case. In order to keep the memory SDE (48) linear and Gaussian for obtaining the optimal control functions explicitly, we restrict κ ( t ) being independent of the memory z t in the LQG problem with memory limitation. The LQG problem without memory limitation is the special case in which the optimal control κ t * = κ * ( t , z t ) in (45) does not depend on the memory z t .
The objective function is provided by the following expected cumulative cost function:
J [ u , v ] : = E p ( x 0 : T , y 0 : T , z 0 : T ; u , v ) 0 T x t Q ( t ) x t + u t R ( t ) u t + v t M ( t ) v t d t + x T P x T ,
where Q ( t ) O , R ( t ) O , M ( t ) O , and P O . Because the cost function includes v t M ( t ) v t , the LQG problem of ML-POSC can consider the memory control cost explicitly. ML-POSC optimizes the state control function u and the memory control function v based on the objective function J [ u , v ] , as follows:
u * , v * : = argmin u , v J [ u , v ] .
For the sake of simplicity, we do not optimize κ ( t ) , although this can be accomplished by considering unobservable stochastic control.

6.2. Problem Reformulation

Although the formulation of the LQG problem with memory limitation in the previous subsection clarifies its relationship with that of the LQG problem without memory limitation, it is inconvenient for further mathematical investigations. In order to resolve this problem, we reformulate the LQG problem with memory limitation based on the extended state s t (17). The formulation in this subsection is simpler and more general than that in the previous subsection.
In the LQG problem with memory limitation, the extended state SDE (18) is provided as follows:
d s t = A ˜ ( t ) s t + B ˜ ( t ) u ˜ t d t + σ ˜ ( t ) d ω ˜ t ,
where s 0 obeys the Gaussian distribution p 0 ( s 0 ) : = N s 0 μ 0 , Σ 0 , ω ˜ t R d ω ˜ is the standard Wiener process, and u ˜ t = u ˜ ( t , z t ) R d u ˜ is the control. The extended state SDE (51) includes the previous state, observation, and memory SDEs (30), (31) and (48) as a special case because they can be represented as follows:
d s t = A O κ H O s t + B O O I u ˜ t d t + σ O O O κ γ η d ω t d ν t d ξ t ,
where p 0 ( s 0 ) = p 0 ( x 0 ) p 0 ( z 0 ) .
The objective function (21) is provided by the following expected cumulative cost function:
J [ u ˜ ] : = E p ( s 0 : T ; u ˜ ) 0 T s t Q ˜ ( t ) s t + u ˜ t R ˜ ( t ) u ˜ t d t + s T P ˜ s T ,
where Q ˜ ( t ) O , R ˜ ( t ) O , and P ˜ O . This objective function (53) includes the previous objective function (49) as a special case because it can be represented as follows:
J [ u ˜ ] = E p ( s 0 : T ; u ˜ ) 0 T s t Q O O O s t + u ˜ t R O O M u ˜ t d t + s T P O O O s T .
The objective of the LQG problem with memory limitation is to find the optimal control function u ˜ * that minimizes the objective function J [ u ˜ ] , as follows:
u ˜ * : = argmin u ˜ J [ u ˜ ] .
In the following subsection, we mainly consider the formulation of this subsection rather than that of the previous subsection because it is simpler and more general. Moreover, we omit · ˜ for notational simplicity.

6.3. Derivation of Optimal Control Function

In this subsection, we derive the optimal control function of the LQG problem with memory limitation by applying Theorem 1. In the LQG problem with memory limitation, the probability of the extended state s at time t is provided by the Gaussian distribution p t ( s ) = N s | μ ( t ) , Σ ( t ) . By defining the stochastic extended state s ^ : = s μ , E p t ( x | z ) s is provided as follows:
E p t ( x | z ) s = K ( t ) s ^ + μ ( t ) ,
where K ( t ) is defined by
K ( t ) : = O Σ x z ( t ) Σ z z 1 ( t ) O I .
By applying Theorem 1 to the LQG problem with memory limitation, we obtain the following theorem:
Theorem 3. 
In the LQG problem with memory limitation, the optimal control function of ML-POSC is provided by
u * ( t , z ) = R 1 B Π K s ^ + Ψ μ ,
where K ( t ) (57) depends on Σ ( t ) , and μ ( t ) and Σ ( t ) are the solutions of the following ordinary differential equations:
d μ d t = A B R 1 B Ψ μ ,
d Σ d t = σ σ + A B R 1 B Π K Σ + Σ A B R 1 B Π K ,
where μ ( 0 ) = μ 0 and Σ ( 0 ) = Σ 0 , while Ψ ( t ) and Π ( t ) are the solutions of the following ordinary differential equations:
d Ψ d t = Q + A Ψ + Ψ A Ψ B R 1 B Ψ ,
d Π d t = Q + A Π + Π A Π B R 1 B Π + ( I K ) Π B R 1 B Π ( I K ) ,
where Ψ ( T ) = Π ( T ) = P .
Proof. 
The proof is shown in Appendix E. □
Here, (61) is the Riccati equation [9,10,23], which appears in the LQG problem without memory limitation as well (37). In contrast, (62) is a new equation of the LQG problem with memory limitation, which in this paper we call the partially observable Riccati equation. Because estimation and control are not clearly separated in the LQG problem with memory limitation, the Riccati Equation (61) for control is modified to include estimation, which corresponds to the partially observable Riccati Equation (62). As a result, the partially observable Riccati Equation (62) is able to improve estimation as well as control.
In order to support this interpretation, we analyze the partially observable Riccati Equation (62) by comparing it with the Riccati Equation (61). Because only the last term of (62) is different from (61), we denote it as follows:
Q : = ( I K ) Π B R 1 B Π ( I K ) .
Q can be calculated as follows:
Q = P x x P x x Σ x z Σ z z 1 Σ z z 1 Σ z x P x x Σ z z 1 Σ z x P x x Σ x z Σ z z 1 ,
where P x x : = ( Π B R 1 B Π ) x x . Because P x x O and Σ z z 1 Σ z x P x x Σ x z Σ z z 1 O , Π x x and Π z z may be larger than Ψ x x and Ψ z z , respectively. Because Π x x and Π z z are the negative feedback gains of the state x and the memory z, respectively, Q may decrease Σ x x and Σ z z . Moreover, when Σ x z is positive/negative, Π x z may be smaller/larger than Ψ x z , which may increase/decrease Σ x z . A similar discussion is possible for Σ z x , Π z x , and Ψ z x , as Σ , Π , and Ψ are symmetric matrices. As a result, Q may decrease the following conditional covariance matrix:
Σ x | z : = Σ x x Σ x z Σ z z 1 Σ z x ,
which corresponds to the estimation error of the state from the memory. Therefore, the partially observable Riccati Equation (62) may improve estimation as well as control, which is different from the Riccati Equation (61).
Because the problem in Section 6.1 is specialized more than that in Section 6.2, we can carry out a more specific discussion. In the problem in Section 6.1, Ψ x x is the same as the solution of the Riccati equation of the conventional POSC (37), and Ψ x z = O , Ψ z x = O , and Ψ z z = O are satisfied. As a result, the memory control does not appear in the Riccati equation of ML-POSC (61). In contrast, because of the last term of the partially observable Riccati Equation (62), Π x x is not the solution of the Riccati Equation (37), and Π x z O , Π z x O , and Π z z O are satisfied. As a result, the memory control appears in the partially observable Riccati Equation (62), which may improve the state estimation.

6.4. Comparison with Completely Observable Stochastic Control

In this subsection, we compare ML-POSC with COSC of the extended state. By applying Proposition 2 in the LQG problem, the optimal control function of COSC of the extended state can be obtained as follows:
Proposition 4 
([10,23]). In the LQG problem, the optimal control function of COSC of the extended state is provided by
u * ( t , s ) = R 1 B Ψ s = R 1 B Ψ s ^ + Ψ μ ,
where Ψ ( t ) is the solution of the Riccati Equation (61).
Proof. 
The proof is shown in [10,23]. □
The optimal control function of COSC of the extended state (66) can be derived intuitively from that of ML-POSC (58). In ML-POSC, K s ^ = E p t ( x | z ) s ^ is the estimator of the stochastic extended state. In COSC of the extended state, because the stochastic extended state is completely observable, its estimator is provided by s ^ , which corresponds to K = I . By changing the definition of K from (57) to K = I , the partially observable Riccati Equation (62) is reduced to the Riccati Equation (61), and the optimal control function of ML-POSC (58) is reduced to that of COSC (66). As a result, the optimal control function of ML-POSC (58) can be interpreted as the generalization of that of COSC (66).
While the second term is the same between (58) and (66), the first term is different. The second term is the control of the expected extended state μ , which does not depend on the realization. In contrast, the first term is the control of the stochastic extended state s ^ , which depends on the realization. The first term has two different points: (i) The estimators of the stochastic extended state in COSC and ML-POSC are provided by s ^ and K s ^ = E p t ( x | z ) s ^ , respectively, which is reasonable because ML-POSC needs to estimate the state from the memory; and (ii) The control gains of the stochastic extended state in COSC and ML-POSC are provided by Ψ and Π , respectively. While Ψ improves only control, Π improves estimation as well as control.

6.5. Numerical Algorithm

In the LQG problem, the partial differential equations are reduced to the ordinary differential equations. The FP Equation (23) is reduced to (59) and (60), and the HJB Equation (28) is reduced to (61) and (62). As a result, the optimal control function (58) can be obtained more easily in the LQG problem.
The Riccati Equation (61) can be solved backwards from the terminal condition. In contrast, the partially observable Riccati Equation (62) cannot be solved in the same way as the Riccati Equation (61), as it depends on the forward equation of Σ (60) through K (57). Because the forward equation of Σ (60) depends on the backward equation of Π (62) as well, they must be solved simultaneously.
A similar problem appears in the mean-field game and control, and numerous numerical methods have been developed to deal with it [33]. In this paper, we solve the system of (60) and (62) using the forward–backward sweep method, which computes (60) and (62) alternately [33,34]. In ML-POSC, the convergence of the forward–backward sweep method is guaranteed [34].

7. Numerical Experiments

In this section, we demonstrate the effectiveness of ML-POSC using numerical experiments on the LQG problem with memory limitation as well as on the non-LQG problem.

7.1. LQG Problem with Memory Limitation

In this subsection, we show the significance of the partially observable Riccati Equation (62) by a numerical experiment of the LQG problem with memory limitation. We consider the state x t R , the observation y t R , and the memory z t R , which evolve by the following SDEs:
d x t = x t + u t d t + d ω t ,
d y t = x t d t + d ν t ,
d z t = v t d t + d y t ,
where x 0 and z 0 obey standard Gaussian distributions, y 0 is an arbitrary real number, ω t R and ν t R are independent standard Wiener processes, and u t = u ( t , z t ) R and v t = v ( t , z t ) R are the controls. The objective function to be minimized is provided as follows:
J [ u , v ] : = E 0 10 x t 2 + u t 2 + v t 2 d t .
Therefore, the objective of this problem is to minimize the state variance by the small state and memory controls. Because this problem includes the memory control cost, it corresponds to the LQG problem with memory limitation.
Figure 2a–c shows the trajectories of Ψ and Π ; Π x x and Π z z are larger than Ψ x x and Ψ z z , respectively, and Π x z is smaller than Ψ x z , which is consistent with our discussion in Section 6.3. Therefore, the partially observable Riccati equation may reduce the estimation error of the state from the memory. Moreover, while the memory control does not appear in the Riccati equation ( Ψ x z = Ψ z z = 0 ), it appears in the partially observable Riccati equation ( Π x z 0 , Π z z 0 ), which is consistent with our discussion in Section 6.3. As a result, the memory control plays an important role in estimating the state from the memory.
In order to clarify the significance of the partially observable Riccati Equation (62), we compare the performance of the optimal control function (58) with that of the following control function:
u Ψ ( t , z ) = R 1 B Ψ K s ^ + Ψ μ ,
in which Π is replaced with Ψ . This result is shown in Figure 2d–f. In the control function (71), the distributions of the state and the memory are unstable, and the cumulative cost diverges. By contrast, in the optimal control function (58), the distributions of the state and memory are stable, and the cumulative cost is smaller. This result indicates that the partially observable Riccati Equation (62) plays an important role in the LQG problem with memory limitation.

7.2. Non-LQG Problem

In this subsection, we investigate the potential effectiveness of ML-POSC for a non-LQG problem by comparing it with the local LQG approximation of the conventional POSC [3,4]. We consider the state x t R and the observation y t R , which evolve according to the following SDEs:
d x t = u t d t + d ω t ,
d y t = x t d t + d ν t ,
where x 0 obeys the Gaussian distribution p 0 ( x 0 ) = N ( x 0 | 0 , 0.01 ) , y 0 is an arbitrary real number, ω t R and ν t R are independent standard Wiener processes, and u t = u ( t , y 0 : t ) R is the control. The objective function to be minimized is provided as follows:
J [ u ] : = E 0 1 Q ( t , x t ) + u t 2 d t + 10 x 1 2 ,
where
Q ( t , x ) : = 1000 ( 0.3 t 0.6 , 0.1 | x | 2.0 ) , 0 ( o t h e r s ) .
The cost function is high on the black rectangles in Figure 3a, which represent the obstacles. In addition, the terminal cost function is the lowest on the black cross in Figure 3a, which represents the desirable goal. Therefore, the system should avoid the obstacles and reach the goal with the small control. Because the cost function is non-quadratic, it is a non-LQG problem, which cannot be solved exactly by the conventional POSC.
In the local LQG approximation of the conventional POSC [3,4], the Zakai equation and the Bellman equation are locally approximated by the Kalman filter and the Riccati equation, respectively. Because the Bellman equation is reduced to the Riccati equation, the local LQG approximation can be solved numerically even in the non-LQG problem.
ML-POSC determines the control u t R based on the memory z t R , i.e., u t = u ( t , z t ) . The memory dynamics is formulated with the following SDE:
d z t = d y t ,
where p 0 ( z 0 ) = N ( z 0 | 0 , 0.01 ) . For the sake of simplicity, the memory control is not considered.
Figure 3 is the numerical result comparing the local LQG approximation and ML-POSC. Because the local LQG approximation reduces the Bellman equation to the Riccati equation by ignoring non-LQG information, it cannot avoid the obstacles, which results in a higher objective function. In contrast, because ML-POSC reduces the Bellman equation to the HJB equation while maintaining non-LQG information, it can avoid the obstacles, which results in a lower objective function. Therefore, our numerical experiment shows that ML-POSC can be superior to local LQG approximation.

8. Discussion

In this work, we propose ML-POSC, which is an alternative theoretical framework to the conventional POSC. ML-POSC first formulates the finite-dimensional and stochastic memory dynamics explicitly, then optimizes the memory dynamics considering the memory cost. As a result, unlike the conventional POSC, ML-POSC can consider memory limitation as well as incomplete information. Furthermore, because the optimal control function of ML-POSC is obtained by solving the system of HJB-FP equations, ML-POSC can be solved in practice even in non-LQG problems. ML-POSC can generalize the LQG problem to include memory limitation. Because estimation and control are not clearly separated in the LQG problem with memory limitation, the Riccati equation can be modified to the partially observable Riccati equation, which improves estimation as well as control. Furthermore, ML-POSC can provide a better result than the local LQG approximation in a non-LQG problem, as ML-POSC reduces the Bellman equation while maintaining non-LQG information.
ML-POSC is effective for the state estimation problem as well, which is a part of the POSC problem. Although the state estimation problem can be solved in principle by the Zakai equation [38,39,40], it cannot be solved directly, as the Zakai equation is infinite-dimensional. In order to resolve this problem, a particle filter is often used to approximate the infinite-dimensional Zakai equation as a finite number of particles [38,39,40]. However, because the performance of the particle filter is guaranteed only in the limit of a large number of particles, a particle filter may not be practical in cases where the available memory size is severely limited. Furthermore, a particle filter cannot take the memory noise and cost into account. ML-POSC resolves these problems, as it can optimize the state estimation under memory limitation.
ML-POSC may be extended from a single-agent system to a multi-agent system. POSC of a multi-agent system is called decentralized stochastic control (DSC) [41,42,43], which consists of a system and multiple controllers. In DSC, each controller needs to estimate the controls of the other controllers as well as the state of the system, which is essentially different from the conventional POSC. Because the estimation among the controllers is generally intractable, the conventional POSC approach cannot be straightforwardly extended to DSC. In contrast, ML-POSC compresses the observation history into the finite-dimensional memory, which simplifies estimation among the controllers. Therefore, ML-POSC may provide an effective approach to DSC. Actually, the finite-state controller, the idea of which is similar with ML-POSC, plays a key role in extending POMDP from a single-agent system to a multi-agent system [22,44,45,46,47,48]. ML-POSC may be extended to a multi-agent system in a similar way as a finite-state controller.
ML-POSC can be naturally extended to the mean-field control setting [28,29,30] because ML-POSC is solved based on the mean-field control theory. Therefore, ML-POSC can be applied to an infinite number of homogeneous agents. Furthermore, ML-POSC can be extended to a risk-sensitive setting, as this is a special case of the mean-field control setting [28,29,30]. Therefore, ML-POSC can consider the variance of the cost as well as its expectation.
Nonetheless, more efficient algorithms are needed in order to solve ML-POSC with a high-dimensional state and memory. In the mean-field game and control, neural network-based algorithms have recently been proposed which can solve high-dimensional problems efficiently [49,50]. By extending these algorithms, it might be possible to solve high-dimensional ML-POSC efficiently. Furthermore, unlike the mean-field game and control, the coupling of HJB-FP equations is limited to the optimal control function in ML-POSC. By exploiting this property, more efficient algorithms for ML-POSC may be proposed [34].

Author Contributions

Conceptualization, Formal analysis, Funding acquisition, Writing—original draft: T.T. and T.J.K.; Software, Visualization: T.T. All authors have read and agreed to the published version of the manuscript.

Funding

The first author received a JSPS Research Fellowship (Grant No. 21J20436). This work was supported by JSPS KAKENHI (Grant No. 19H05799) and JST CREST (Grant No. JPMJCR2011).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Not applicable.

Acknowledgments

We thank Kenji Kashima and Kaito Ito for useful discussions.

Conflicts of Interest

The authors declare no conflict of interest.

Abbreviations

The following abbreviations are used in this manuscript:
COSCCompletely Observable Stochastic Control
POSCPartially Observable Stochastic Control
ML-POSCMemory-Limited Partially Observable Stochastic Control
POMDPPartially Observable Markov Decision Process
DSCDecentralized Stochastic Control
LQGLinear-Quadratic-Gaussian
HJBHamilton–Jacobi–Bellman
FPFokker–Planck
SDEStochastic Differential Equation

Appendix A. Proof of Lemma 1

We define the value function V ( t , p ) as follows:
V ( t , p ) : = min u t : T t T f ¯ ( t , p τ , u τ ) d τ + g ¯ T ( p T ) ,
where { p τ | τ [ t , T ] } is the solution of the FP Equation (23), where p t = p . Then, V ( t , p ) can be calculated as follows:
V ( t , p ) = min u f ¯ ( t , p , u ) d t + V ( t + d t , p + L p d t ) = min u f ¯ ( t , p , u ) d t + V ( t , p ) + V ( t , p ) t d t + δ V ( t , p ) δ p ( s ) L p ( s ) d s d t .
By rearranging the above equation, the following equation is obtained:
V ( t , p ) t = min u f ¯ ( t , p , u ) + δ V ( t , p ) δ p ( s ) L p ( s ) d s .
Because
δ V ( t , p ) δ p ( s ) L p ( s ) d s = p ( s ) L δ V ( t , p ) δ p ( s ) d s ,
the following equation is obtained:
V ( t , p ) t = min u p ( s ) f ( t , s , u ) + L δ V ( t , p ) δ p ( s ) d s .
From the definition of the Hamiltonian H (10), the following Bellman equation is obtained:
V ( t , p ) t = min u E p ( s ) H t , s , u , δ V ( t , p ) δ p ( s ) .
Because the control u is the function of the memory z in ML-POSC, the minimization by u can be exchanged with the expectation by p ( z ) as follows:
V ( t , p ) t = E p ( z ) min u E p ( x | z ) H t , s , u , δ V ( t , p ) δ p ( s ) .
Because the optimal control function is provided by the right-hand side of the Bellman Equation (A7) [10], the optimal control function is provided by
u * ( t , z , p ) = argmin u E p ( x | z ) H t , s , u , δ V ( t , p ) δ p ( s ) .
Because the FP Equation (23) is deterministic, the optimal control function is provided by u * ( t , z ) = u * ( t , z , p t ) .

Appendix B. Proof of Theorem 1

We first define
W ( t , p , s ) : = δ V ( t , p ) δ p ( s ) ,
which satisfies W ( T , p , s ) = g ( s ) . Differentiating the Bellman Equation (26) with respect to p, the following equation is obtained:
W ( t , p , s ) t = H t , s , u * , W + E p ( s ) L δ W ( t , p , s ) δ p ( s ) .
Because
p ( s ) L δ W ( t , p , s ) δ p ( s ) d s = δ W ( t , p , s ) δ p ( s ) L p ( s ) d s ,
the following equation is obtained:
W ( t , p , s ) t = H t , s , u * , W + δ W ( t , p , s ) δ p ( s ) L p ( s ) d s .
We then define
w ( t , s ) : = W ( t , p t , s ) ,
where p t is the solution of the FP Equation (23). The time derivative of w ( t , s ) can be calculated as follows:
w ( t , s ) t = W ( t , p t , s ) t + δ W ( t , p t , s ) δ p ( s ) p t ( s ) t d s .
By substituting (A12) into (A14), the following equation is obtained:
w ( t , s ) t = H t , s , u * , w δ W ( t , p t , s ) δ p ( s ) p t ( s ) t L p t ( s ) ( * ) d s .
From the FP Equation (23), ( * ) = 0 holds. Therefore, the HJB Equation (28) is obtained.

Appendix C. Proof of Proposition 2

From the proof of Lemma 1 (Appendix A), the Bellman Equation (A6) is obtained. Because the control u is the function of the extended state s in COSC of the extended state, the minimization by u can be exchanged with the expectation by p ( s ) as follows:
V ( t , p ) t = E p ( s ) min u H t , s , u , δ V ( t , p ) δ p ( s ) .
Because the optimal control function is provided by the right-hand side of the Bellman Equation (A16) [10], the optimal control function is provided by
u * ( t , s , p ) = argmin u H t , s , u , δ V ( t , p ) δ p ( s ) .
Because the FP Equation (23) is deterministic, the optimal control function is provided by u * ( t , s ) = u * ( t , s , p t ) . The rest of the proof is the same as the proof of Theorem 1 (Appendix B).

Appendix D. Proof of Theorem 2

From Theorem 1, the optimal control functions u * , v * , and κ * are provided by the minimization of the conditional expectation of the Hamiltonian, as follows:
u * ( t , z ) , v * ( t , z ) , κ * ( t , z ) = argmin u , v , κ E p t ( x | z ) H t , s , u , v , κ , w .
In the LQG problem of the conventional POSC, the Hamiltonian (10) is provided by
H ( t , s , u , v , κ , w ) = x Q x + u R u + w ( t , s ) x A x + B u + w ( t , s ) z v + κ H x + 1 2 tr x w ( t , s ) x σ σ + 1 2 tr z w ( t , s ) z κ γ γ κ .
From
E p t ( x | z ) H u = 2 R u + B E p t ( x | z ) w ( t , s ) x ,
E p t ( x | z ) H v = E p t ( x | z ) w ( t , s ) z ,
E p t ( x | z ) H κ = E p t ( x | z ) w ( t , s ) z x H + E p t ( x | z ) z w ( t , s ) z κ γ γ ,
the optimal control functions are provided by
u * ( t , z ) = 1 2 R 1 B E p t ( x | z ) w ( t , s ) x ,
v * ( t , z ) = + E p t ( x | z ) w ( t , s ) z < 0 , a r b i t r a r y E p t ( x | z ) w ( t , s ) z = 0 , E p t ( x | z ) w ( t , s ) z > 0 ,
κ * ( t , z ) = E p t ( x | z ) z w ( t , s ) z 1 E p t ( x | z ) w ( t , s ) z x H ( γ γ ) 1 .
We assume that p t ( s ) is provided by the Gaussian distribution
p t ( s ) = N ( s | μ ( t ) , Σ ( t ) ) ,
and w ( t , s ) is provided by the quadratic function
w ( t , s ) = x Ψ ( t ) x + ( x z ) Φ ( t ) ( x z ) + β ( t ) .
From the initial condition of the FP equation,
μ ( 0 ) = μ x ( 0 ) μ z ( 0 ) = μ x , 0 μ x , 0 ,
Σ ( 0 ) = Σ x x ( 0 ) Σ x z ( 0 ) Σ z x ( 0 ) Σ z z ( 0 ) = Σ x x , 0 O O O
are satisfied. From the terminal condition of the HJB equation, Ψ ( T ) = P , Φ ( T ) = O , and β ( T ) = 0 are satisfied. In this case, u * ( t , z ) , E p t ( x | z ) [ w ( t , s ) / z ] , and κ * ( t , z ) can be calculated as follows:
u * ( t , z ) = R 1 B Ψ + Φ μ x | z Φ z ,
E p t ( x | z ) w ( t , s ) z = 2 Φ z μ x | z ,
κ * ( t , z ) = Σ x | z + μ x | z z μ x | z H ( γ γ ) 1 .
We then assume that the following equations are satisfied:
μ x = μ z ,
Σ z z = Σ x z .
In this case, μ x | z , Σ x | z , u * ( t , z ) , E p t ( x | z ) [ w ( t , s ) / z ] , and κ * ( t , z ) can be calculated as follows:
μ x | z = z ,
Σ x | z = Σ x x Σ z z ,
u * ( t , z ) = R 1 B Ψ z ,
E p t ( x | z ) w ( t , s ) z = 0 ,
κ * ( t , z ) = Σ x | z H ( γ γ ) 1 .
Because v * ( t , z ) is arbitrary when E p t ( x | z ) [ w ( t , s ) / z ] = 0 , we consider v * ( t , z ) with the following equation:
v * ( t , z ) = A B R 1 B Ψ Σ x | z H ( γ γ ) 1 H z .
In this case, the extended state SDE is provided by the following equation:
d s t = A ˜ ( t ) s t d t + σ ˜ ( t ) d ω ˜ t ,
where p 0 ( s ) = N ( s | μ ( 0 ) , Σ ( 0 ) ) , and
A ˜ : = A B R 1 B Ψ Σ x | z H ( γ γ ) 1 H A B R 1 B Ψ Σ x | z H ( γ γ ) 1 H ,
σ ˜ : = σ O O Σ x | z H ( γ γ ) 1 γ , d ω ˜ t : = d ω t d ν t .
Because the drift and diffusion coefficients of (A41) are linear and constant with respect to s, respectively, p t ( s ) becomes the Gaussian distribution, which is consistent with our assumption (A26), while μ ( t ) and Σ ( t ) evolve by the following ordinary differential equations:
d μ d t = A ˜ μ ,
d Σ d t = σ ˜ σ ˜ + A ˜ Σ + Σ A ˜ .
If μ x = μ z and Σ z z = Σ x z are satisfied, d μ x / d t = d μ z / d t and d Σ x z / d t = d Σ z z / d t are satisfied as well, which is consistent with our assumptions of μ x = μ z and Σ z z = Σ x z .
From v * and κ * , the dynamics of μ x | z ( t , z t ) = z t is provided by
d z t = A B R 1 B Ψ z t d t + Σ x | z H ( γ γ ) 1 d y t H z t d t ,
where z 0 = μ x , 0 . From d Σ x x / d t and d Σ z z / d t , the dynamics of Σ x | z = Σ x x Σ z z is provided by
d Σ x | z d t = σ σ + A Σ x | z + Σ x | z A Σ x | z H ( γ γ ) 1 H Σ x | z ,
where Σ x | z ( 0 ) = Σ x x , 0 . We note that (A46) and (A47) correspond to the Kalman filter (35) and (36).
By substituting w ( t , s ) , u * ( t , z ) , v * ( t , z ) , and κ * ( t , z ) into the HJB Equation (28), we obtain the following ordinary differential equations:
d Ψ d t = Q + A Ψ + Ψ A Ψ B R 1 B Ψ , d Φ d t = A Σ x | z H ( γ γ ) 1 H Φ + Φ A Σ x | z H ( γ γ ) 1 H
+ Ψ B R 1 B Ψ ,
d β d t = tr Ψ + Φ σ σ + tr Φ Σ x | z H ( γ γ ) 1 H Σ x | z ,
where Ψ ( T ) = P , Φ ( T ) = O , and β ( T ) = 0 . If Ψ ( t ) , Φ ( t ) , and β ( t ) satisfy (A48), (A49), and (A50), respectively, the HJB Equation (28) is satisfied, which is consistent with our assumption (A27). We note that (A48) corresponds to the Riccati Equation (37).

Appendix E. Proof of Theorem 3

From Theorem 1, the optimal control function u * is provided by the minimization of the conditional expectation of the Hamiltonian, as follows:
u * ( t , z ) = argmin u E p t ( x | z ) H t , s , u , w .
In the LQG problem with memory limitation, the Hamiltonian (10) is provided as follows:
H ( t , s , u , w ) = s Q s + u R u + w ( t , s ) s A s + B u + 1 2 tr s w ( t , s ) s σ σ .
From
E p t ( x | z ) H ( t , s , u , w ) u = 2 R u + B E p t ( x | z ) w ( t , s ) s ,
the optimal control function is provided by
u * ( t , z ) = 1 2 R 1 B E p t ( x | z ) w ( t , s ) s .
We assume that p t ( s ) is provided by the Gaussian distribution
p t ( s ) = N ( s | μ ( t ) , Σ ( t ) ) ,
and w ( t , s ) is provided by the quadratic function
w ( t , s ) = s Π ( t ) s + α ( t ) s + β ( t ) .
From the initial condition of the FP equation, μ ( 0 ) = μ 0 and Σ ( 0 ) = Σ 0 are satisfied. From the terminal condition of the HJB equation, Π ( T ) = P , α ( T ) = 0 , and β ( T ) = 0 are satisfied. In this case, the optimal control function (A54) can be calculated as follows:
u * ( t , z ) = 1 2 R 1 B 2 Π K s ^ + 2 Π μ + α ,
where we use (56). Because the optimal control function (A57) is linear with respect to s ^ , p t ( s ) is the Gaussian distribution, which is consistent with our assumption (A55).
By substituting (A56) and (A57) into the HJB Equation (28), we obtain the following ordinary differential equations:
d Π d t = Q + A Π + Π A Π B R 1 B Π + Q ,
d α d t = ( A B R 1 B Π ) α 2 Q μ ,
d β d t = tr ( Π σ σ ) 1 4 α B R 1 B α + μ Q μ ,
where Q : = ( I K ) Π B R 1 B Π ( I K ) . If Π ( t ) , α ( t ) , and β ( t ) satisfy (A58), (A59), and (A60), respectively, the HJB Equation (28) is satisfied, which is consistent with our assumption (A56).
By defining Y ( t ) by α ( t ) = 2 Y ( t ) μ ( t ) , the optimal control function (A57) can be calculated as follows:
u * ( t , z ) = R 1 B Π K s ^ + ( Π + Y ) μ .
In this case, μ ( t ) obeys the following ordinary differential equation:
d μ d t = A B R 1 B ( Π + Y ) μ .
From α ( t ) = 2 Y ( t ) μ ( t ) , (A59) and (A62), Y ( t ) obeys the following ordinary differential equation:
d Y d t = A B R 1 B Π Y + Y A B R 1 B Π Y B R 1 B Y Q ,
where Y ( T ) = O .
By defining Ψ ( t ) : = Π ( t ) + Y ( t ) , the optimal control function (A61) can be calculated as follows:
u * ( t , z ) = R 1 B Π K s ^ + Ψ μ .
From Ψ ( t ) = Π ( t ) + Y ( t ) , (A58), and (A63), Ψ ( t ) obeys the following ordinary differential equation:
d Ψ d t = Q + A Ψ + Ψ A Ψ B R 1 B Ψ ,
where Ψ ( T ) = O . Therefore, the optimal control function (58) is obtained.

References

  1. Fox, R.; Tishby, N. Minimum-information LQG control part I: Memoryless controllers. In Proceedings of the 2016 IEEE 55th Conference on Decision and Control (CDC), Las Vegas, NV, USA, 12–14 December 2016; pp. 5610–5616. [Google Scholar] [CrossRef] [Green Version]
  2. Fox, R.; Tishby, N. Minimum-information LQG control Part II: Retentive controllers. In Proceedings of the 2016 IEEE 55th Conference on Decision and Control (CDC), Las Vegas, NV, USA, 12–14 December 2016; pp. 5603–5609. [Google Scholar] [CrossRef] [Green Version]
  3. Li, W.; Todorov, E. An Iterative Optimal Control and Estimation Design for Nonlinear Stochastic System. In Proceedings of the 45th IEEE Conference on Decision and Control, San Diego, CA, USA, 13–15 December 2006; pp. 3242–3247. [Google Scholar] [CrossRef]
  4. Li, W.; Todorov, E. Iterative linearization methods for approximately optimal control and estimation of non-linear stochastic system. Int. J. Control 2007, 80, 1439–1453. [Google Scholar] [CrossRef]
  5. Nakamura, K.; Kobayashi, T.J. Connection between the Bacterial Chemotactic Network and Optimal Filtering. Phys. Rev. Lett. 2021, 126, 128102. [Google Scholar] [CrossRef] [PubMed]
  6. Nakamura, K.; Kobayashi, T.J. Optimal sensing and control of run-and-tumble chemotaxis. Phys. Rev. Res. 2022, 4, 013120. [Google Scholar] [CrossRef]
  7. Pezzotta, A.; Adorisio, M.; Celani, A. Chemotaxis emerges as the optimal solution to cooperative search games. Phys. Rev. E 2018, 98, 042401. [Google Scholar] [CrossRef] [Green Version]
  8. Borra, F.; Cencini, M.; Celani, A. Optimal collision avoidance in swarms of active Brownian particles. J. Stat. Mech. Theory Exp. 2021, 2021, 083401. [Google Scholar] [CrossRef]
  9. Bensoussan, A. Stochastic Control of Partially Observable Systems; Cambridge University Press: Cambridge, UK, 1992. [Google Scholar] [CrossRef]
  10. Yong, J.; Zhou, X.Y. Stochastic Controls; Springer: New York, NY, USA, 1999. [Google Scholar] [CrossRef]
  11. Nisio, M. Stochastic Control Theory. In Probability Theory and Stochastic Modelling; Springer: Tokyo, Japan, 2015; Volume 72. [Google Scholar] [CrossRef]
  12. Fabbri, G.; Gozzi, F.; Święch, A. Stochastic Optimal Control in Infinite Dimension. In Probability Theory and Stochastic Modelling; Springer International Publishing: Cham, Switzerland, 2017; Volume 82. [Google Scholar] [CrossRef]
  13. Bensoussan, A.; Frehse, J.; Yam, S.C.P. The Master equation in mean field theory. J. de Math. Pures et Appl. 2015, 103, 1441–1474. [Google Scholar] [CrossRef]
  14. Bensoussan, A.; Frehse, J.; Yam, S.C.P. On the interpretation of the Master Equation. Stoch. Process. Their Appl. 2017, 127, 2093–2137. [Google Scholar] [CrossRef] [Green Version]
  15. Bensoussan, A.; Yam, S.C.P. Mean field approach to stochastic control with partial information. ESAIM Control Optim. Calc. Var. 2021, 27, 89. [Google Scholar] [CrossRef]
  16. Hansen, E. An Improved Policy Iteration Algorithm for Partially Observable MDPs. In Advances in Neural Information Processing Systems; MIT Press: Cambridge, MA, USA, 1998; Volume 10. [Google Scholar]
  17. Hansen, E.A. Solving POMDPs by Searching in Policy Space. In Proceedings of the Fourteenth Conference on Uncertainty in Artificial Intelligence, Madison, WI, USA, 24–26 July 1998; pp. 211–219. [Google Scholar]
  18. Kaelbling, L.P.; Littman, M.L.; Cassandra, A.R. Planning and acting in partially observable stochastic domains. Artif. Intell. 1998, 101, 99–134. [Google Scholar] [CrossRef] [Green Version]
  19. Meuleau, N.; Kim, K.E.; Kaelbling, L.P.; Cassandra, A.R. Solving POMDPs by Searching the Space of Finite Policies. In Proceedings of the Fifteenth Conference on Uncertainty in Artificial Intelligence, Stockholm, Sweden, 30 July–1 August 1999; pp. 417–426. [Google Scholar]
  20. Meuleau, N.; Peshkin, L.; Kim, K.E.; Kaelbling, L.P. Learning Finite-State Controllers for Partially Observable Environments. In Proceedings of the Fifteenth Conference on Uncertainty in Artificial Intelligence, Stockholm, Sweden, 30 July–1 August 1999; pp. 427–436. [Google Scholar]
  21. Poupart, P.; Boutilier, C. Bounded Finite State Controllers. In Advances in Neural Information Processing Systems; MIT Press: Cambridge, MA, USA, 2003; Volume 16. [Google Scholar]
  22. Amato, C.; Bonet, B.; Zilberstein, S. Finite-State Controllers Based on Mealy Machines for Centralized and Decentralized POMDPs. Proc. AAAI Conf. Artif. Intell. 2010, 24, 1052–1058. [Google Scholar] [CrossRef]
  23. Bensoussan, A. Estimation and Control of Dynamical Systems. In Interdisciplinary Applied Mathematics; Springer International Publishing: Cham, Switzerland, 2018; Volume 48. [Google Scholar] [CrossRef]
  24. Laurière, M.; Pironneau, O. Dynamic Programming for Mean-Field Type Control. J. Optim. Theory Appl. 2016, 169, 902–924. [Google Scholar] [CrossRef] [Green Version]
  25. Pham, H.; Wei, X. Bellman equation and viscosity solutions for mean-field stochastic control problem. ESAIM Control Optim. Calc. Var. 2018, 24, 437–461. [Google Scholar] [CrossRef]
  26. Kushner, H.J.; Dupuis, P.G. Numerical Methods for Stochastic Control Problems in Continuous Time; Springer: New York, NY, USA, 1992. [Google Scholar] [CrossRef]
  27. Fleming, W.H.; Soner, H.M. Controlled Markov Processes and Viscosity Solutions, 2nd ed.; Number 25 in Applications of mathematics; Springer: New York, NY, USA, 2006. [Google Scholar] [CrossRef]
  28. Bensoussan, A.; Frehse, J.; Yam, P. Mean Field Games and Mean Field Type Control Theory; Springer Briefs in Mathematics; Springer: New York, NY, USA, 2013. [Google Scholar] [CrossRef]
  29. Carmona, R.; Delarue, F. Probabilistic Theory of Mean Field Games with Applications I; Number volume 83 in Probability theory and stochastic modelling; Springer Nature: Cham, Switzerland, 2018. [Google Scholar] [CrossRef]
  30. Carmona, R.; Delarue, F. Probabilistic Theory of Mean Field Games with Applications II. In Probability Theory and Stochastic Modelling; Springer International Publishing: Cham, Switzerland, 2018; Volume 84. [Google Scholar] [CrossRef]
  31. Achdou, Y. Finite Difference Methods for Mean Field Games. In Hamilton-Jacobi Equations: Approximations, Numerical Analysis and Applications: Cetraro, Italy 2011, Editors: Paola Loreti, Nicoletta Anna Tchou; Lecture Notes in Mathematics; Achdou, Y., Barles, G., Ishii, H., Litvinov, G.L., Eds.; Springer: Berlin/Heidelberg, Germany, 2013; pp. 1–47. [Google Scholar] [CrossRef]
  32. Achdou, Y.; Laurière, M. Mean Field Games and Applications: Numerical Aspects. In Mean Field Games: Cetraro, Italy 2019; Lecture Notes in Mathematics; Achdou, Y., Cardaliaguet, P., Delarue, F., Porretta, A., Santambrogio, F., Cardaliaguet, P., Porretta, A., Eds.; Springer International Publishing: Cham, Switzerland, 2020; pp. 249–307. [Google Scholar] [CrossRef]
  33. Lauriere, M. Numerical Methods for Mean Field Games and Mean Field Type Control. Mean Field Games 2021, 78, 221. [Google Scholar] [CrossRef]
  34. Tottori, T.; Kobayashi, T.J. Pontryagin’s Minimum Principle and Forward-Backward Sweep Method for the System of HJB-FP Equations in Memory-Limited Partially Observable Stochastic Control. arXiv 2022, arXiv:2210.13040. [Google Scholar]
  35. Carlini, E.; Silva, F.J. Semi-Lagrangian schemes for mean field game models. In Proceedings of the 52nd IEEE Conference on Decision and Control, Firenze, Italy, 10–13 December 2013; pp. 3115–3120. [Google Scholar] [CrossRef] [Green Version]
  36. Carlini, E.; Silva, F.J. A Fully Discrete Semi-Lagrangian Scheme for a First Order Mean Field Game Problem. SIAM J. Numer. Anal. 2014, 52, 45–67. [Google Scholar] [CrossRef]
  37. Carlini, E.; Silva, F.J. A semi-Lagrangian scheme for a degenerate second order mean field game system. Discret. Contin. Dyn. Syst. 2015, 35, 4269. [Google Scholar] [CrossRef]
  38. Crisan, D.; Doucet, A. A survey of convergence results on particle filtering methods for practitioners. IEEE Trans. Signal Process. 2002, 50, 736–746. [Google Scholar] [CrossRef] [Green Version]
  39. Budhiraja, A.; Chen, L.; Lee, C. A survey of numerical methods for nonlinear filtering problems. Phys. D Nonlinear Phenom. 2007, 230, 27–36. [Google Scholar] [CrossRef]
  40. Bain, A.; Crisan, D. Fundamentals of Stochastic Filtering. In Stochastic Modelling and Applied Probability; Springer: New York, NY, USA, 2009; Volume 60. [Google Scholar] [CrossRef]
  41. Nayyar, A.; Mahajan, A.; Teneketzis, D. Decentralized Stochastic Control with Partial History Sharing: A Common Information Approach. IEEE Trans. Autom. Control 2013, 58, 1644–1658. [Google Scholar] [CrossRef]
  42. Charalambous, C.D.; Ahmed, N.U. Centralized Versus Decentralized Optimization of Distributed Stochastic Differential Decision Systems With Different Information Structures-Part I: A General Theory. IEEE Trans. Autom. Control 2017, 62, 1194–1209. [Google Scholar] [CrossRef]
  43. Charalambous, C.D.; Ahmed, N.U. Centralized Versus Decentralized Optimization of Distributed Stochastic Differential Decision Systems With Different Information Structures—Part II: Applications. IEEE Trans. Autom. Control 2018, 63, 1913–1928. [Google Scholar] [CrossRef]
  44. Oliehoek, F.A.; Amato, C. A Concise Introduction to Decentralized POMDPs; SpringerBriefs in Intelligent Systems; Springer International Publishing: Cham, Switzerland, 2016. [Google Scholar] [CrossRef] [Green Version]
  45. Bernstein, D.S. Bounded Policy Iteration for Decentralized POMDPs. In Proceedings of the Nineteenth International Joint Conference on Artificial Intelligence, Edinburgh, UK, 30 July–5 August 2005; pp. 1287–1292. [Google Scholar]
  46. Bernstein, D.S.; Amato, C.; Hansen, E.A.; Zilberstein, S. Policy Iteration for Decentralized Control of Markov Decision Processes. J. Artif. Intell. Res. 2009, 34, 89–132. [Google Scholar] [CrossRef] [Green Version]
  47. Amato, C.; Bernstein, D.S.; Zilberstein, S. Optimizing Memory-Bounded Controllers for Decentralized POMDPs. In Proceedings of the Twenty-Third Conference on Uncertainty in Artificial Intelligence, Vancouver, BC, Canada, 19–22 July 2007; pp. 1–8. [Google Scholar]
  48. Tottori, T.; Kobayashi, T.J. Forward and Backward Bellman Equations Improve the Efficiency of the EM Algorithm for DEC-POMDP. Entropy 2021, 23, 551. [Google Scholar] [CrossRef]
  49. Ruthotto, L.; Osher, S.J.; Li, W.; Nurbekyan, L.; Fung, S.W. A machine learning framework for solving high-dimensional mean field game and mean field control problems. Proc. Natl. Acad. Sci. USA 2020, 117, 9183–9193. [Google Scholar] [CrossRef] [Green Version]
  50. Lin, A.T.; Fung, S.W.; Li, W.; Nurbekyan, L.; Osher, S.J. Alternating the population and control neural networks to solve high-dimensional stochastic mean-field games. Proc. Natl. Acad. Sci. USA 2021, 118, e2024713118. [Google Scholar] [CrossRef] [PubMed]
Figure 1. Schematic diagram of (a) completely observable stochastic control (COSC), (b) partially observable stochastic control (POSC), and (c) memory-limited partially observable stochastic control (ML-POSC). The top and bottom figures represent the system and controller, respectively; x t R d x is the state of the system; y t R d y , z t R d z , and u t R d u are the observation, memory, and control of the controller, respectively. (a) In COSC, the controller can completely observe the state x t , and determines the control u t based on the state x t , i.e., u t = u ( t , x t ) . Only finite-dimensional memory is required to store the state x t , and the optimal control u t * is obtained by solving the Hamilton–Jacobi–Bellman (HJB) equation, which is a partial differential equation. (b) In POSC, the controller cannot completely observe the state x t ; instead, it obtains the noisy observation y t of the state x t . The control u t is determined based on the observation history y 0 : t : = { y τ | τ [ 0 , t ] } , i.e., u t = u ( t , y 0 : t ) . An infinite-dimensional memory is implicitly assumed to store the observation history y 0 : t . Furthermore, to obtain the optimal control u t * , the Bellman equation (a functional differential equation) needs to be solved, which is generally intractable, even numerically. (c) In ML-POSC, the controller is only accessible to the noisy observation y t of the state x t , as in POSC. In addition, it has only finite-dimensional memory z t , which cannot completely memorize the the observation history y 0 : t . The controller of ML-POSC compresses the observation history y 0 : t into the finite-dimensional memory z t , then determines the control u t based on the memory z t , i.e., u t = u ( t , z t ) . The optimal control u t * is obtained by solving the HJB equation (a partial differential equation), as in COSC.
Figure 1. Schematic diagram of (a) completely observable stochastic control (COSC), (b) partially observable stochastic control (POSC), and (c) memory-limited partially observable stochastic control (ML-POSC). The top and bottom figures represent the system and controller, respectively; x t R d x is the state of the system; y t R d y , z t R d z , and u t R d u are the observation, memory, and control of the controller, respectively. (a) In COSC, the controller can completely observe the state x t , and determines the control u t based on the state x t , i.e., u t = u ( t , x t ) . Only finite-dimensional memory is required to store the state x t , and the optimal control u t * is obtained by solving the Hamilton–Jacobi–Bellman (HJB) equation, which is a partial differential equation. (b) In POSC, the controller cannot completely observe the state x t ; instead, it obtains the noisy observation y t of the state x t . The control u t is determined based on the observation history y 0 : t : = { y τ | τ [ 0 , t ] } , i.e., u t = u ( t , y 0 : t ) . An infinite-dimensional memory is implicitly assumed to store the observation history y 0 : t . Furthermore, to obtain the optimal control u t * , the Bellman equation (a functional differential equation) needs to be solved, which is generally intractable, even numerically. (c) In ML-POSC, the controller is only accessible to the noisy observation y t of the state x t , as in POSC. In addition, it has only finite-dimensional memory z t , which cannot completely memorize the the observation history y 0 : t . The controller of ML-POSC compresses the observation history y 0 : t into the finite-dimensional memory z t , then determines the control u t based on the memory z t , i.e., u t = u ( t , z t ) . The optimal control u t * is obtained by solving the HJB equation (a partial differential equation), as in COSC.
Entropy 24 01599 g001
Figure 2. Numerical simulation of the LQG problem with memory limitation. (ac) Trajectories of the elements of Ψ ( t ) R 2 × 2 and Π ( t ) R 2 × 2 . Because Ψ z x ( t ) = Ψ x z ( t ) and Π z x ( t ) = Π x z ( t ) , Ψ z x ( t ) and Π z x ( t ) are not visualized. (df) Stochastic behaviors of the state x t (d), the memory z t (e), and the cumulative cost (f) for 100 samples. The expectation of the cumulative cost at t = 10 corresponds to the objective function (70). Blue and orange curves are controlled by (71) and (58), respectively.
Figure 2. Numerical simulation of the LQG problem with memory limitation. (ac) Trajectories of the elements of Ψ ( t ) R 2 × 2 and Π ( t ) R 2 × 2 . Because Ψ z x ( t ) = Ψ x z ( t ) and Π z x ( t ) = Π x z ( t ) , Ψ z x ( t ) and Π z x ( t ) are not visualized. (df) Stochastic behaviors of the state x t (d), the memory z t (e), and the cumulative cost (f) for 100 samples. The expectation of the cumulative cost at t = 10 corresponds to the objective function (70). Blue and orange curves are controlled by (71) and (58), respectively.
Entropy 24 01599 g002
Figure 3. Numerical simulation of the non-LQG problem for the local LQG approximation (blue) and ML-POSC (orange). (a) Stochastic behaviors of state x t for 100 samples. The black rectangles and cross represent the obstacles and goal, respectively. (b) The objective function (74), computed from 100 samples.
Figure 3. Numerical simulation of the non-LQG problem for the local LQG approximation (blue) and ML-POSC (orange). (a) Stochastic behaviors of state x t for 100 samples. The black rectangles and cross represent the obstacles and goal, respectively. (b) The objective function (74), computed from 100 samples.
Entropy 24 01599 g003
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Share and Cite

MDPI and ACS Style

Tottori, T.; Kobayashi, T.J. Memory-Limited Partially Observable Stochastic Control and Its Mean-Field Control Approach. Entropy 2022, 24, 1599. https://doi.org/10.3390/e24111599

AMA Style

Tottori T, Kobayashi TJ. Memory-Limited Partially Observable Stochastic Control and Its Mean-Field Control Approach. Entropy. 2022; 24(11):1599. https://doi.org/10.3390/e24111599

Chicago/Turabian Style

Tottori, Takehiro, and Tetsuya J. Kobayashi. 2022. "Memory-Limited Partially Observable Stochastic Control and Its Mean-Field Control Approach" Entropy 24, no. 11: 1599. https://doi.org/10.3390/e24111599

APA Style

Tottori, T., & Kobayashi, T. J. (2022). Memory-Limited Partially Observable Stochastic Control and Its Mean-Field Control Approach. Entropy, 24(11), 1599. https://doi.org/10.3390/e24111599

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop