Learn Quasi-Stationary Distributions of Finite State Markov Chain

Cai, Zhiqiang; Lin, Ling; Zhou, Xiang

doi:10.3390/e24010133

Open AccessArticle

Learn Quasi-Stationary Distributions of Finite State Markov Chain

by

Zhiqiang Cai

^1,*

,

Ling Lin

²

and

Xiang Zhou

^1,3

¹

School of Data Science, City University of Hong Kong, Tat Chee Ave, Kowloon, Hong Kong, China

²

School of Mathematics, Sun Yat-sen University, Guangzhou 510275, China

³

Department of Mathematics, City University of Hong Kong, Tat Chee Ave, Kowloon, Hong Kong, China

^*

Author to whom correspondence should be addressed.

Entropy 2022, 24(1), 133; https://doi.org/10.3390/e24010133

Submission received: 7 November 2021 / Revised: 9 January 2022 / Accepted: 11 January 2022 / Published: 17 January 2022

(This article belongs to the Special Issue Entropy Methods for Stochastic Dynamical Systems and Evolution Equations)

Download

Browse Figures

Versions Notes

Abstract

:

We propose a reinforcement learning (RL) approach to compute the expression of quasi-stationary distribution. Based on the fixed-point formulation of quasi-stationary distribution, we minimize the KL-divergence of two Markovian path distributions induced by candidate distribution and true target distribution. To solve this challenging minimization problem by gradient descent, we apply a reinforcement learning technique by introducing the reward and value functions. We derive the corresponding policy gradient theorem and design an actor-critic algorithm to learn the optimal solution and the value function. The numerical examples of finite state Markov chain are tested to demonstrate the new method.

Keywords:

quasi-stationary distribution; reinforcement learning; KL-divergence; actor-critic algorithm

1. Introduction

Quasi-stationary distribution (QSD) is the long time statistical behavior of a stochastic process that will be surely killed when this process is conditioned to survive [1]. This concept has been widely used in applications, such as in biology and ecology [2,3], chemical kinetics [4,5], epidemics [6,7,8], medicine [9] and neuroscience [10,11]. Many works for rare events in meta-stable systems also focus on this quasi-stationary distribution [12,13]. In addition, some new Monte Carlo sampling methods, for instance, the Quasi-stationary Monte Carlo method [14,15], also arise by using QSD instead of true stationary distribution, for instance, the Quasi-stationary Monte Carlo method [14,15]

We are interested in the numerical computation of QSD and focus on the finite state Markov chain in this paper. Mathematically, the quasi-stationary distribution can be solved as the principal left eigenvector of a sub-Markovian transition matrix. Thus, traditional numerical algebra methods can be applied to solve the quasi-stationary distribution in finite state space, for example, the power method [16], the multi-grid method [17] and Arnoldi’s algorithm [18]. These eigenvector methods can produce a stochastic vector for QSD instead of generating samples of QSD.

In search of efficient algorithms for large state space, stochastic approaches are in favor of either sampling the QSD or computing the expression of QSD, and these methods can be applied or extended easily to continuous state space. A popular approach for sampling quasi-stationary distribution is the Fleming–Viot stochastic method [19]. The Flemming–Viot method first simulates N particles independently. When any one of the particles falls into the absorbing state and becomes killed, a new particle is uniformly selected from the remaining

N - 1

surviving particles to replace the dead one, and the simulation continues. When time and N tend to infinity, the particles’ empirical distribution can converge to the quasi-stationary distribution.

In [20,21,22], the authors proposed to recursively update the expression of QSD at each iteration based on the empirical distribution of a single-particle simulation. It is shown in [21] that the convergence rate can be

O (n^{- 1 / 2})

, where n is the iteration number. This method is later improved in [23,24] by applying the stochastic approximation method [25] and the Polyak–Ruppert averaging technique [26]. These improved algorithms have a choice of flexible step size but require a projection operator onto probability simplex, which carries some extra computational overhead increasing with the number of states. Ref. [15] extended the algorithm to the diffusion process.

In this paper, we focus on how to compute the expression of the quasi-stationary distribution, which is denoted by

α (x)

on a metric space

E

. If

E

is finite,

α

is a probability vector, and if

E

is a domain in

R^{d}

, then

α

is a probability density function on

E

. We assume

α

can be numerically represented in parametric form

α_{θ}

and

θ \in Θ

. This family

{α_{θ}}

can be in tabular form or any neural network. Then, the problem of finding the QSD

α

becomes answering the question of how to compute the optimal parameter

θ

in

Θ

. We call this problem the learning problem for QSD. In addition, we want to directly learn QSD and not use the distribution family

{α_{θ}}

to fit the simulated samples generated by other traditional simulation methods.

Our minimization problem for QSD is similar to the variational inference (VI) [27], which minimizes an objective functional measuring the distance between the target and candidate distributions. However, unlike the mainstream VI methods such as evidence lower bound (ELBO) technique [28] or particle-based [29], flow-based methods [30], our approach is based on recent important progresses from reinforcement learning (RL) method [31], particularly the policy gradient method and actor-critic algorithm. We first regard the learning process of the quasi-stationary distribution as the interaction with the environment, which is constructed by the property of QSD. Reinforcement learning has recently shown tremendous advancements and remarkable successes in applications (e.g., [32,33,34]). The RL framework provides an innovative and powerful modeling and computation approach for many scientific computing problems.

The essential question is how to formulate the QSD problem as an RL problem. Firstly, for the sub-Markovian kernel K of a Markov process, we can define a Markovian kernel

K_{α}

on

E

(see Definition 1) and then QSD is defined by the equation

α = α K_{α}

, which equals

α

as the initial distribution and the distribution after one step. Secondly, we consider an optimal

α

(in our parametric family of distribution) to minimize the Kullback–Leibler divergence (i.e., relative entropy) of two path distributions, denoted by

P

and

Q

, associated with two Markovian kernels

K_{α}

and

K_{β}

where

β : = α K_{α}

. Thirdly, inspired by the recent work [35] of using RL for rare events sampling problems, we transform the minimization of KL divergence between

P

and

Q

into the maximization of a time-averaged reward function and defined the corresponding value function

V (x)

at each state x. This completes our modeling of RL for the quasi-stationary distribution problem. Lastly, we derive the policy gradient theorem (Theorem 1) to compute the gradient with respect to

θ

of the averaged reward for the learning dynamic for the averaged reward. This is known as the “actor” part. The “critic” part is to learn the value function V in its parametric form

V_{ψ}

. The actor-critic algorithm uses the stochastic gradient descent to train the parameter

θ

for the action

α_{θ}

and the parameter

ψ

for the value function

V_{ψ}

(see Algorithm 1).

Our contribution is that we first devise a method to transform the QSD problem into the RL problem. Similar to [35], our paper also uses the KL-divergence to define the RL problem. However, our paper fully adapts the unique property of QSD that is a fixed point problem

α = α K_{α}

to define the RL problem.

Our learning method allows the flexible parametrization of the distributions and uses the stochastic gradient method to train the optimal distribution. It is easy to implement optimization with scale up to large state spaces. The numerical examples we tested have shown our that methods converge faster than other existing methods [22,23].

Finally, we remark that our method works very well for QSD of the strict sub-Markovian kernel K but is not applicable to compute the invariant distribution when K is Markovian. This is because we transform the problem into the variational problem between two Markovian kernels

K_{α}

and

K_{β}

(where

β = α K_{α}

). Note that

K_{α} (x, y) = K (x, y) + (1 - K (x, E)) α (y)

(Definition 1), and our method is based on the fact that

α = β

if and only if

K_{α} = K_{β}

. If K is Markovian kernel, then

K_{α} \equiv K

for any

α

, and our method cannot work. Thus,

K (x, E)

has to be strictly less than 1 for some

x \in E

.

This paper is organized as follows. Section 2 is a short review of the quasi-stationary distribution and some basic simulation methods of QSD. In Section 3, we first formulate the reinforcement learning problem by KL-divergence and derive the policy gradient theorem (Theorem 1). Using the above formulation, we then develop the actor-critic algorithm to estimate the quasi-stationary distribution. In Section 4, the efficiency of our algorithms is illustrated by four examples compared with the simulation methods in [24].

Algorithm 1: (ac- $α$ method) Actor-critic algorithm for quasi-stationary distribution

α_{θ}

2. Problem Setup and Review

2.1. Quasi-Stationary Distribution

We start with an abstract setting. Let

E

be a finite state equipped with the Borel

σ

-field

B (E)

, and let

P (E)

be the space of probabilities over

E

. A sub-Markovian kernel on

E

is defined as a map

K : E \times B (E) \mapsto [0, 1]

such that for all

x \in E, A \mapsto K (x, A)

is a nonzero measure with

K (x, E) \leq 1

and for all

A \in B (E), x \mapsto K (x, A)

is measurable. In particular, if

K (x, E) = 1

for all

x \in E

, then K is called a Markovian kernel. Throughout the paper, we assume that K is strictly sub-Markovian, i.e.,

K (x, E) < 1

for some x.

Let

X_{t}

be a Markov chain with values in

E \cup \{\partial\}

where

\partial \notin E

denotes an absorbing state. We define the extinction time

τ : = inf \{t > 0 : X_{t} = \partial\} .

We define the quasi-stationary distribution (QSD)

α

as the long time limit of the conditional distribution, if there exists a probability distribution

ν

on

E

such that the following is the case:

α (A) : = lim_{t \to \infty} P_{ν} (X_{t} \in A ∣ τ > t), A \in B (E) .

(1)

where

P_{ν}

refers to the probability distribution of

X_{t}

associated with the initial distribution

ν

on

E

. Such a conditional distribution well describes the behavior of the process before extinction, and it is easy to see that

α

satisfies the following fixed point problem:

P_{α} (X_{t} \in A ∣ τ > t) = α (A)

(2)

where

P_{α}

refers to the probability distribution of

X_{t}

associated with the initial distribution

α

on

E

. Equation (2) is equivalent to the following stationary condition such that the following is the case:

α = \frac{α K}{α K 1}, or α (y) = \frac{\sum_{x} α (x) K (x, y)}{\sum_{x} α (x) K (x, E)}

(3)

where

α

is a row vector and

1

denotes the column vector with all entries being one and

K (x, E) = \sum_{x^{'} \in E} K (x, x^{'}) .

For any sub-Markovian kernel K, we can associate K with a Markovian kernel

\tilde{K}

on

E \cup {\partial}

defined by the following:

\{\begin{matrix} \tilde{K} (x, A) = K (x, A) \\ \tilde{K} (x, {\partial}) = 1 - K (x, E) \\ \tilde{K} (\partial, {\partial}) = 1 . \end{matrix}

for all

x \in E, A \in B (E)

. The kernel

\tilde{K}

can be understood as the Markovian transition kernel of the Markov chain

(X_{t})

on

E \cup {\partial}

for which its transitions in

E

is specified by K, but it is “killed” forever once it leaves

E

.

In this paper, we assume

E

is a finite state space and the process in consideration has a unique QSD. Assume that K is irreducible, then existence and uniqueness of the quasi-stationary distribution can be obtained by the Perron–Frobenius theorem [36].

An important Markovian kernel is the following

K_{α}

, which is defined on

E

only and has a “regenerative probability”

α

.

Definition 1.

For any given

α \in P (E)

and a sub-Markovian kernel K on

E

, we define

K_{α}

, a Markovian kernel on

E

, as follows:

K_{α} (x, A) : = K (x, A) + (1 - K (x, E)) α (A)

(4)

for all

x \in E

and

A \in B (E)

.

K_{α}

is a Markovian kernel because

K_{α} (x, E) = 1

. It is easy to sample

X_{t + 1} \sim K_{α} (X_{t}, \cdot)

from any state

X_{t} \in E

: run the transition as normal by using

\tilde{K}

to have a next state denoted by Y, then

X_{t + 1} = Y

if

Y \in E

; otherwise, sample

X_{t + 1}

from

α

.

We know that

α

is the quasi-stationary distribution of K if and only if it is the stationary distribution of

K_{α}

.

α = α K_{α} .

(5)

It is easy to observe that

α = β

if and only if

K_{α} = K_{β}

for any two distributions

α

and

β

. Moreover, for every

α^{'}

,

K_{α^{'}}

has a unique invariant probability denoted by

Γ (α^{'})

. Then,

α^{'} \mapsto Γ (α^{'})

is continuous in

P (E)

(i.e., for the topology of weak convergence), and there exists

α \in P (E)

such that

α = Γ (α)

or, equivalently,

α

is a QSD for K.

2.2. Review of Simulation Methods for Quasi-Stationary Distribution

According to the above subsection, the QSD

α

satisfies the fixed point problem as follows:

α = Γ (α),

(6)

where

Γ (α)

is the stationary distribution of

K_{α}

on

E

. In general, (6) can be solved recursively by

α_{n + 1} \leftarrow Γ (α_{n})

.

The Fleming–Viot (FV) method [19] evolves N particles independently of each other as a Markov process associated with the transition kernel

K_{α}

until one succeeds in jumping to the absorbing state ∂. At that time, this killed particle is immediately reset to

E

as an initial state uniformly chosen from one of the remaining

N - 1

particles. The QSD

α

is approximated by the empirical distribution of the N particles in total, and these particles can be regarded as samples from the quasi-stationary distribution

α

such as the MCMC method.

Ref. [37] proposed a simulation method by only using one particle at each iteration to update

α

. At iteration n, given an

α_{n} \in P (E)

, one can run a discrete-time Markov chain

X^{(n + 1)}

as normal on

\partial \cup E

with initial

X_{0}^{(n + 1)} \sim α_{n}

; then,

α_{n + 1}

is computed as the following weighted average of empirical distributions:

\begin{matrix} α_{n + 1} (x) : = α_{n} (x) + \frac{1}{n + 1} \sum_{k = 0}^{τ^{(n + 1)} - 1} \frac{I (X_{k}^{(n + 1)} = x ∣ X_{0}^{(n + 1)} \sim α_{n}) - α_{n} (x)}{\frac{1}{n + 1} \sum_{j = 1}^{n + 1} τ^{(j)}} \end{matrix}

(7)

where

n \geq 0

and I are the indicator functions, and

τ^{(j)} = min \{k \geq 0 ∣ X_{k}^{(j)} \in \partial\}

is the first extinction time for the process

X^{(j)}

. This iterative scheme has a convergence rate of

O (\frac{1}{\sqrt{n}})

.

In [23,24], the above method is extended to the stochastic approximations framework:

α_{n + 1} (x) = Θ_{H} [α_{n} + ϵ_{n} \sum_{k = 0}^{τ^{(n + 1)} - 1} (I (X_{k}^{(n + 1)} = x | X_{0}^{(n + 1)} \sim α_{n}) - α_{n} (x))]

(8)

where

Θ_{H}

denotes the

L_{2}

projection into the probability simplex, and

ϵ_{n}

is the step size satisfying

\sum ϵ_{n} = \infty

and

\sum ϵ_{n}^{2} < \infty

. Specifically, if

ϵ_{n} = O (\frac{1}{n^{r}})

for

0.5 < r < 1

, under a sufficient condition, they have

\sqrt{n^{r}} (α_{n} - α) \overset{d}{\to} N (0, V)

for some matrix V [23,24]. If the Polyak–Ruppert averaging technique is applied to generate the following:

ν_{n} : = \frac{1}{n} \sum_{k = 1}^{n} α_{k},

(9)

then the convergence rate of

ν_{n} \to α

becomes

\frac{1}{\sqrt{n}}

[23,24].

The simulation schemes (7) and (8) need to sample the initial states according to

α_{n}

and to add the empirical distribution and

α_{n}

at each x point wisely. Thus, they are suitable for finite state space where

α

is a probability vector saved in the tabular form. In (8), there is no need to record all exit times

τ^{(j)}, j = 1, \dots, n

, but the additional projection operation in (8) is computationally expensive since the cost is

O (m log m)

where

m = | E |

[38,39].

3. Learn Quasi-Stationary Distribution

We focus on the computation of the expression of the quasi-stationary distribution. In particular, when this distribution is parametrized in a certain manner by

θ

, we can extend the tabular form for finite-state Markov chain to any flexible form, even in the neural networks for probability density function in

R^{d}

. However, we do not pursue this representation and expressivity issue here and restrict our discussion to finite state space only to illustrate our main idea first. In finite state space,

α (x)

for

x \in E = {1, \dots, m}

can be simply described as a softmax function with

m - 1

parameter

θ_{i} : α (i) \propto e^{θ_{i}}, 1 \leq i \leq m - 1

(

θ_{m} = 0

). This introduces no representation error. For the generalization to continuous space

E

in jump and diffusion processes or even for a huge finite state space, a good representation of

α_{θ} (x)

is important in practice.

In this section, we shall formulate our QSD problem in terms of reinforcement learning (RL) so that the problem of seeking optimal parameters becomes a policy optimization problem. We derive the policy gradient theorem to construct a gradient descent method for the optimal parameter. We then show a method for designing actor-critic algorithms based on stochastic optimization.

3.1. Formulation of RL and Policy Gradient Theorem

Before introducing the RL method of our QSD problem, we develop a general formulation by introducing the KL-divergence between two path distributions.

Let

P_{θ}

and

Q_{θ}

be two families of Markovian kernels on

E

in parametric forms with the same set of parameters

θ \in Θ

. Assume both

P_{θ}

and

Q_{θ}

are ergodic for any

θ

. Let

T > 0

and denote a path up to time T by

ω_{0}^{T} = (X_{0}, X_{1}, \dots, X_{T}) \in E^{T + 1}

. Define the path distributions under the Markov chain kernel

P_{θ}

and

Q_{θ}

, respectively.

P_{θ} (ω_{0}^{T}) : = \prod_{t = 1}^{T} P_{θ} (X_{t} ∣ X_{t - 1}), Q_{θ} (ω_{0}^{T}) : = \prod_{t = 1}^{T} Q_{θ} (X_{t} ∣ X_{t - 1}) .

(10)

Define the KL divergence from

P_{θ}

to

Q_{θ}

on

E^{T + 1}

:

D_{K L} (P_{θ} ∣ Q_{θ}) : = \sum_{ω_{0}^{T}} P_{θ} (ω_{0}^{T}) ln \frac{P_{θ} (ω_{0}^{T})}{Q_{θ} (ω_{0}^{T})} = - E_{P_{θ}} \sum_{t = 1}^{T} R_{θ} (X_{t - 1}, X_{t}),

(11)

where the expectation

E_{P_{θ}}

is for the path

(X_{0}, X_{1}, \dots, X_{T})

generated by the transition kernel

P_{θ}

, and the following is called the (one-step) reward.

R_{θ} (X_{t - 1}, X_{t}) : = - ln \frac{P_{θ} (X_{t} ∣ X_{t - 1})}{Q_{θ} (X_{t} ∣ X_{t - 1})} .

(12)

Define the average reward

r (θ)

as the time averaged negative KL divergence in the limit of

T \to \infty

.

\begin{matrix} r (θ) & : = - lim_{T \to \infty} \frac{1}{T} D_{K L} (P_{θ} ∣ Q_{θ}) = - lim_{T \to \infty} \frac{1}{T} E_{P_{θ}} \sum_{t = 1}^{T} R_{θ} (X_{t - 1}, X_{t}) . \end{matrix}

(13)

Due to ergodicity of

P_{θ}

,

r (θ) = \sum_{x_{0}, x_{1}} R_{θ} (x_{0}, x_{1}) P_{θ} (x_{1} | x_{0}) μ_{θ} (x_{0})

where

μ_{θ}

is the invariant measure of

P_{θ}

,

r (θ)

is independent of initial state

X_{0}

. Obviously,

r (θ) \leq 0

for any

θ

.

Property 1.

The following are equivalent:

1.: $r (θ)$ reaches its maximal value 0 at $θ^{*}$ ;
2.: $P_{θ^{*}} = Q_{θ^{*}}$ in $P (E^{T + 1})$ for any $T > 0$ ;
3.: $P_{θ^{*}} = Q_{θ^{*}}$ ;
4.: $R_{θ^{*}} \equiv 0$ .

Proof.

We only need to show

(1) ⟹ (3)

. It is easy to see that

r (θ) = - \sum_{x_{0}} D_{K L} (P_{θ} (\cdot | x_{0}) | Q_{θ} (\cdot | x_{0})) μ_{θ} (x_{0}) .

If

r (θ) = 0

, since

μ_{θ} > 0

, then

D_{K L} (P_{θ} (\cdot | x_{0}) | Q_{θ} (\cdot | x_{0})) = 0 \forall x_{0} .

Thus, we have

P_{θ} = Q_{θ}

. □

The above property establishes the relationship between the RL problem and QSD problem.

We show our theoretic main result below as the foundation of our algorithm to be developed later. This theorem can be regarded as one type of the policy gradient theorem for the policy gradient method in reinforcement learning [31].

Define the following value function ([31] Chapter 13).

V (x) : = lim_{T \to \infty} \sum_{t = 1}^{T} E_{P_{θ}} [R_{θ} (X_{t - 1}, X_{t}) - r (θ) ∣ X_{0} = x] .

(14)

Certainly, V also depends on

θ

, although we do not write

θ

explicitly.

Theorem 1(policy gradient theorem).

We have the following two properties:

1.: At any θ, for any $x \in E$ , the following Bellman-type equation holds for the value function V and the average reward $r (θ)$ :

$V (x) = E_{Y \sim P_{θ} (\cdot ∣ x)} [V (Y) + R_{θ} (x, Y) - r (θ)] .$

(15)
2.: The gradient of the average reward $r (θ)$ is the following:

$\begin{matrix} \nabla_{θ} r (θ) & = E [\nabla_{θ} ln Q_{θ} (Y ∣ X)] + \\ E [(V (Y) - V (X) + R_{θ} (X, Y) - r (θ)) \nabla_{θ} ln P_{θ} (Y ∣ X)], \end{matrix}$

(16)

where expectations are for the joint distribution

(X, Y) \sim μ_{θ} (x) P_{θ} (y ∣ x)

where

μ_{θ}

is the stationary measure of

P_{θ}

.

Proof.

We shall prove the Bellman equation first and then we use the Bellman equation to derive the gradient of the average reward

r (θ)

. For any

x_{0} \in E

, by writing

ω_{0}^{T} = (x_{0}, \dots, x_{T})

and defining

Δ R_{θ} (ω_{0}^{T}) = \sum_{t = 1}^{T} (R (x_{t - 1}, x_{t}) - r (θ)),

we have the following:

\begin{matrix} V (x_{0}) & = lim_{T \to \infty} E_{P_{θ}} [Δ R_{θ} (ω_{0}^{T}) ∣ X_{0} = x] \\ = lim_{T \to \infty} \sum_{x_{2}, \dots, x_{T}} \sum_{x_{1}} ((\prod_{t = 2}^{T} P_{θ} (x_{t} ∣ x_{t - 1})) P_{θ} (x_{1} ∣ x_{0}) Δ R (ω_{0}^{T})) \\ = lim_{T \to \infty} \sum_{x_{1}} (P_{θ} (x_{1} ∣ x_{0}) \sum_{x_{2}, \dots, x_{T}} (\prod_{t = 2}^{T} P_{θ} (x_{t} ∣ x_{t - 1}) [Δ R (ω_{1}^{T}) + Δ R (ω_{0}^{1})])) \\ = \sum_{x_{1}} (P_{θ} (x_{1} ∣ x_{0}) (lim_{T \to \infty} [\sum_{x_{2}, \dots, x_{T}} \prod_{t = 2}^{T} P_{θ} (x_{t} ∣ x_{t - 1}) Δ R (ω_{1}^{T})] + Δ R (ω_{0}^{1}))) \\ = \sum_{x_{1}} P_{θ} (x_{1} ∣ x_{0}) [V (x_{1}) + R_{θ} (x_{0}, x_{1})] - r (θ), \end{matrix}

(17)

which proves (15); in other words, we have the following.

r (θ) = E_{Y \sim P_{θ} (\cdot ∣ x)} [V (Y) + R_{θ} (x, Y) - V (x)], \forall x \in E .

Next, we compute the gradient of

r (θ)

. By trivial equality of the following:

\sum_{x_{1}} P_{θ} (x_{1} ∣ x_{0}) \nabla_{θ} ln P_{θ} (x_{1} ∣ x_{0}) = \nabla_{θ} \sum_{x_{1}} P_{θ} (x_{1} ∣ x_{0}) = 0,

(18)

and the definition (12), we can write the gradient of

r (θ)

as follows.

\begin{matrix} \nabla_{θ} r (θ) = & \sum_{y} \nabla_{θ} P_{θ} (y ∣ x) [V (y) + R_{θ} (x, y) - V (x)] \\ + \sum_{y} P_{θ} (y ∣ x) [\nabla_{θ} V (y) - \nabla_{θ} V (x) + \nabla_{θ} ln Q_{θ} (y ∣ x)] . \end{matrix}

We here keep the term

V (x)

in the first line, even though it has no contribution here (in fact, to add any constant to

V (x)

is also fine). Since this equation holds for all states x on the right-hand side, we take the expectation with respect to

μ_{θ}

, the stationary distribution of

P_{θ}

. Thus, we have the following.

\begin{matrix} \nabla_{θ} r (θ) = & \sum_{x, y} μ_{θ} (x) \nabla_{θ} P_{θ} (y ∣ x) [V (y) + R_{θ} (x, y) - V (x)] \\ + \sum_{x, y} μ_{θ} (x) P_{θ} (y ∣ x) [\nabla_{θ} V (y) - \nabla_{θ} V (x) + \nabla_{θ} ln Q_{θ} (y ∣ x)] \\ = & \sum_{x, y} μ_{θ} (x) \nabla_{θ} P_{θ} (y ∣ x) [V (y) + R_{θ} (x, y) - V (x)] \\ + \sum_{y} μ_{θ} (y) \nabla_{θ} V (y) - \sum_{x} μ_{θ} (x) \nabla_{θ} V (x) + \sum_{x, y} μ_{θ} (x) P_{θ} (y ∣ x) \nabla_{θ} ln Q_{θ} (y ∣ x) \\ = & \sum_{x, y} μ_{θ} (x) P_{θ} (y ∣ x) [V (y) + R_{θ} (x, y) - V (x)] \nabla_{θ} ln P_{θ} (y ∣ x) \\ + \sum_{x, y} μ_{θ} (x) P_{θ} (y ∣ x) \nabla_{θ} ln Q_{θ} (y ∣ x) . \end{matrix}

In fact, we can add any constant number b (independent of x and y) inside the squared bracket of the last line without changing the equality due to the following fact similar to (18):

\sum_{x, y} μ_{θ} (x) \nabla_{θ} P_{θ} (y ∣ x) = \sum_{y} μ_{θ} (y) \nabla_{θ} \sum_{x} P_{θ} (x ∣ y) = 0

. (16) is a special case of

b = r (θ)

. □

Remark 1.

As shown in the proof, (16) holds if

r (θ)

at the right-hand side is replaced by any constant number b.

b = r (θ)

is a good choice to reduce the variance since

r (θ)

can be regarded as the expectation of

R_{θ}

.

Remark 2.

If

P_{θ} = Q_{θ}

, then the first term of (16) vanishes due to (18) and the second term of (16) vanishes due to (15).

Remark 3.

The name of “policy” here refers to the role of θ as the policy for decision makers to improve reward

r (θ)

.

3.2. Learn QSD

Now, we discuss how to connect QSD with the results in the previous subsection. In view of Equation (5), we introduce

β : = α K_{α}

as the one-step distribution if starting from the initial

α

; in other words, we have the following.

β (y) : = \sum_{x \in E} α (x) K_{α} (x, y), \forall y

(19)

By (5),

α

is a QSD if and only if

β = α

. However, we do not directly compare these two distributions

α

and

β

. Instead, we consider their Markovian kernels induced by (4):

K_{α}

and

K_{β}

. Our approach is to consider KL divergence similar to (11) between two kernels

K_{α}

and

K_{β}

since

α = β

if and only if

K_{α} = K_{β}

. In this manner, one can view

K_{α}

and

K_{β}

(note

β = α K_{α}

) as two transition matrices

P_{θ}

and

Q_{θ}

in the previous section, in which the parameter

θ

here is in fact the distribution

α

.

To have a further representation of the distribution

α

, which is a (probability mass) function on

E

, we propose a parametrized family for

α

in the form

α_{θ}

where

θ

is a generic parameter. In the simplest case,

α_{θ}

takes the so-called soft-max form

α_{θ} (i) = \frac{e^{θ_{i}}}{\sum_{j \geq 1} e^{θ_{j}}}

if

E = {1, \dots, N}

for

θ = (θ_{1}, \dots, θ_{N - 1}, θ_{N} \equiv 0) .

This parametrization represents

α

without any approximation error for finite state space and the effective space of

θ

is just

R^{N - 1}

. For certain problems, particularly with large state space, if one has some prior knowledge about the structure of the function

α

on

E

, one might propose other parametric forms of

α_{θ}

with the dimension of

θ

less than the cardinality

| E |

to improve the efficiency, although the extra representation error in this manner has to be introduced.

For any given

α_{θ} \in P (E)

, the corresponding Markovian kernel

K_{α_{θ}}

is then defined in (4) and

β_{θ} = α_{θ} K_{α_{θ}} i

is defined by (19).

K_{β_{θ}}

is like-wise defined by (4) again. To use the formulation in Section 3.1, we chose

P_{θ} = K_{α_{θ}}

and

Q_{θ} = K_{β_{θ}}

. Define the objective function as before:

\begin{matrix} r (θ) & : = - lim_{T \to \infty} \frac{1}{T} D_{K L} (P_{θ} ∣ Q_{θ}) = - lim_{T \to \infty} \frac{1}{T} E_{P_{θ}} \sum_{t = 1}^{T} R_{θ} (X_{t - 1}, X_{t}) . \end{matrix}

where the following is the case.

R_{θ} (x, y) = - ln \frac{K_{α_{θ}} (x, y)}{K_{β_{θ}} (x, y)} .

The value function

V (x)

is defined similarly. Theorem 1 now provides the expression of the following gradient:

\begin{matrix} \nabla_{θ} r (θ) & = E [(R_{θ} (X, Y) - r (θ) + V (Y) - V (X)) \nabla_{θ} ln K_{α_{θ}} (X, Y) \\ + \nabla_{θ} ln K_{β_{θ}} (X, Y)] \end{matrix}

(20)

where

(X, Y) \sim μ_{θ} (x) K_{α_{θ}} (x, y)

and where

μ_{θ}

is the stationary measure of

K_{α_{θ}}

.

The optimal

θ^{*}

for the QSD

α_{θ}

is to maximize

r (θ)

, and this can be solved by the gradient descent algorithm:

θ_{t + 1} = θ_{t} + η_{t}^{θ} \nabla_{θ} r (θ_{t}) .

(21)

where

η_{t}^{θ} > 0

is the step size. In practice, the stochastic gradient is applied:

\nabla_{θ} r (θ_{t}) \approx \nabla_{θ} ln K_{α_{θ}} (X_{t}, X_{t + 1}) \times δ (X_{t}, X_{t + 1}) + \nabla_{θ} ln K_{β_{θ}} (X_{t}, X_{t + 1})

where

X_{t}, X_{t + 1}

are sampled based on the Markovian kernel

K_{α_{θ}}

(see Algorithm 1) and the differential temporal (TD) error

δ_{t}

is as follows.

δ_{t} = δ (X_{t}, X_{t + 1}) = R_{θ} (X_{t}, X_{t + 1}) - r (θ_{t}) + V (X_{t + 1}) - V (X_{t}) .

(22)

Next, we need to address a remaining issue, which is the question of how to compute value functions V and

r (θ_{t})

in the TD error (22). In addition, we also need to show the details of computing

\nabla_{θ} K_{α_{θ}}

and

\nabla_{θ} K_{β_{θ}}

.

3.3. Actor-Critic Algorithm

With the stochastic gradient method (21), we can obtain optimal policy

θ^{*}

. We refer to (21) as the learning dynamics for the policy, and it is generally known as actor. To calculate the value function V appearing in

\nabla r (θ)

, we need to have a new learning dynamic, which is called critic. Then, the overall policy-gradient method is termed as the actor-critic method.

We start with the Bellman Equation (15) for the value function and considered the mean-square-error loss as follows:

MSE [V] = \frac{1}{2} \sum_{x} ν (x) {(\sum_{y} K_{α_{θ}} (x, y) [V (y) + R_{θ} (x, y) - r (θ)] - V (x))}^{2}

where

ν

is any distribution supported on

E

.

MSE [V] = 0

if and only if V satisfies the Bellman Equation (15), i.e., V is the value function. To learn V, we introduce function approximation for the value function,

V_{ψ}

, with the parameter

ψ

and considered to minimize the following:

MSE (ψ) = \frac{1}{2} \sum_{x} ν (x) {(\sum_{y} K_{α_{θ}} (x, y) [V (y) + R_{θ} (x, y) - r (θ)] - V_{ψ} (x))}^{2}

by the semi-gradient method ([31], Chapter 9).

\begin{matrix} \nabla_{ψ} MSE (ψ) & = - \sum_{x, y} ν (x) K_{α_{θ}} (x, y) [V (y) + R_{θ} (x, y) - r (θ) - V_{ψ} (x)] \nabla_{ψ} V_{ψ} (x) \\ \approx - \sum_{x, y} ν (x) K_{α_{θ}} (x, y) [V_{ψ} (y) + R_{θ} (x, y) - r (θ) - V_{ψ} (x)] \nabla_{ψ} V_{ψ} (x) \end{matrix}

Here, the term

V (y)

is frozen first and then approximated by

V_{ψ}

since it could be treated as a prior guess of the value function for the future state.

Then, for the gradient descent iteration

ψ_{t + 1} = ψ_{t} - η_{t}^{ψ} \nabla_{ψ} {MSE}_{V} (ψ_{t})

where

η_{t}^{ψ}

is the step size, we can have the following stochastic gradient iteration:

ψ_{t + 1} = ψ_{t} + η_{t}^{ψ} δ (X_{t}, X_{t + 1}) \nabla_{ψ} V_{ψ_{t}} (X_{t})

(23)

where the differential temporal (TD) error

δ

is defined above in (22).

δ_{t} = δ (X_{t}, X_{t + 1}) = R_{θ_{t}} (X_{t}, X_{t + 1}) - r (θ_{t}) + V_{ψ_{t}} (X_{t + 1}) - V_{ψ_{t}} (X_{t}) .

Here, for the sake of simplicity,

(X_{t}, X_{t + 1})

are the same samples as in the actor method for

θ_{t}

. This means that distribution

ν

above is chosen as

μ

used for the gradient

\nabla_{θ} r (θ)

.

Next, we consider the calculation of the reward

r (θ)

by the following Bellman Equation (15).

\sum_{x} μ (x) \sum_{y} K_{α_{θ}} (x, y) (R_{θ} (x, y) - r (θ) + V (y) - V (x)) = 0

Let

r_{t}

be the estimate of the reward

r (θ_{t})

at time t. We can update our estimate of the reward every time a transition occurs as follows:

r_{t + 1} = r_{t} + η_{t}^{r} \times δ_{t}

(24)

where

δ_{t}

is the TD error before

δ_{t} = δ (X_{t}, X_{t + 1}) = R_{θ_{t}} (X_{t}, X_{t + 1}) - r_{t} + V_{ψ_{t}} (X_{t + 1}) - V_{ψ_{t}} (X_{t}) .

In conclusion, (21), (23) and (24) together consist of the actor-critic algorithm, which is summarized in Algorithm 1. We remark that Algorithm 1 can be easily adapted to use the mini-batch gradient method where several copies of

(X_{t}, X_{t + 1})

are sampled, and the average is used to update the parameters. The stationary distribution

μ_{θ}

of

K_{α_{θ}}

is sampled by running the corresponding Markov chain for several steps with “warm start”: the initial for

θ_{t + 1}

is set as the final state generated from the previous iteration at

θ_{t}

. The length of this “burn-in” period can be set as just one step in practice for efficiency.

Remark 4.

Finally, we remark on the computation of

\nabla_{θ} ln K_{α_{θ}}

and

\nabla_{θ} ln K_{β_{θ}}

in Algorithm 1. The details are shown in Appendix A. We comment that the main computational cost is the function

K (x, E)

, which has to be pre-computed and stored. If the problem has some special structure, the function could be approximated in parametric form. Another special case is our second example where

K (x, E) = 0 \forall x \in {2, 3, \dots, N}

.

4. Numerical Experiment

In this section, we present two examples to demonstrate Algorithm 1. We call the algorithm (7), (8) and (9) in Section 2.2 used in [23,24], as Vanilla Algorithm, Projection Algorithm and Polyak Averaging Algorithm, respectively. Let 0 be the absorbing state and

E = {1, \dots, N}

are non-absorbing states; the Markov transition matrix on

{0, \dots, N}

is denoted by the following:

\tilde{K} = [\begin{matrix} 1 & 0 \\ * & K \end{matrix}],

where K is an N-by-N sub-Markovian matrix. For Algorithm 1, distribution

α_{θ}

on

E

is always parameterized as follows:

α_{θ} = \frac{1}{e^{θ_{1}} + \dots + e^{θ_{N - 1}} + 1} [e^{θ_{1}}, \dots, e^{θ_{N - 1}}, 1],

and the value function

V_{ψ} (x)

is represented in tabular form for simplicity:

V_{ψ} = [ψ_{1}, \dots, ψ_{N}]

where

ψ \in R^{N}

.

4.1. Loopy Markov Chain

We test a toy example of the three-state loopy Markov chain, which was considered in [23,24]. The transition probability matrix for the four states

{0, 1, 2, 3}

is as follows.

\tilde{K} = [\begin{matrix} 1 & 0 & 0 & 0 \\ ϵ & \frac{1 - ϵ}{3} & \frac{1 - ϵ}{3} & \frac{1 - ϵ}{3} \\ ϵ & \frac{1 - ϵ}{3} & \frac{1 - ϵ}{3} & \frac{1 - ϵ}{3} \\ ϵ & \frac{1 - ϵ}{3} & \frac{1 - ϵ}{3} & \frac{1 - ϵ}{3} \end{matrix}], ϵ \in (0, 1) .

The state 0 is the absorbing state ∂ and

E = {1, 2, 3}

. K is the sub-matrix of

\tilde{K}

corresponding to the states

{1, 2, 3}

. With the probability

ϵ

, the process exits

E

directly from state 1, 2 or 3. The true quasi-stationary distribution of this example is the uniform distribution for any

ϵ

.

In order to show the advantage of our algorithm, we consider two cases: (1)

ϵ = 0.1

and (2)

ϵ = 0.9

. For a larger

ϵ

, the original Markov chain is very easy to exit; thus, each iteration takes less time, but the convergence rate of Vanilla algorithm is slower.

In order to quantify the accuracy of the learned quasi-stationary distribution, we compute the

L_{2}

norm of the error between the learned quasi-stationary distribution and the true values.

In Figure 1, we compute the QSD when

ϵ = 0.1

. We set the initial value

θ_{0} = [- 1, 1], ψ_{0} = [0, 0, 0], r_{0} = 0

, the learning rate

η_{n}^{θ} = max {1 / n^{0.1}, 0.2}, η_{n}^{ψ} = 0.0001, η_{n}^{r} = 0.0001

and the batch size is 4. The step size for the Projection Algorithm is

ϵ_{n} = n^{- 0.99}

. Figure 2 is for the case when

ϵ = 0.9

We set the initial value

θ_{0} = [4, - 2], ψ_{0} = [0, 0, 0], r_{0} = 0

, the learning rate

η_{n}^{θ} = 0.04, η_{n}^{ψ} = 0.0001, η_{n}^{r} = 0.0001

and the batch size is 32. The step size for the Projection Algorithm is

ϵ_{n} = n^{- 0.99}

.

4.2. M/M/1/N Queue with Finite Capacity and Absorption

Our second example is an M/M/1 queue with finite queue capacity. The 0 state has been set as an absorbing state. The transition probability matrix on

{0, \dots, N}

takes the following form:

\tilde{K} = [\begin{matrix} 1 & 0 & 0 & 0 & 0 & \dots & 0 & 0 \\ μ_{1} & 0 & λ_{1} & 0 & 0 & \dots & 0 & 0 \\ 0 & μ_{2} & 0 & λ_{2} & 0 & \dots & 0 & 0 \\ 0 & 0 & μ_{3} & 0 & λ_{3} & \dots & 0 & 0 \\ ⋮ & ⋮ & ⋮ & ⋮ & ⋮ & ⋮ & ⋮ \\ 0 & 0 & 0 & 0 & 0 & 0 & λ_{N - 1} \\ 0 & 0 & 0 & 0 & 0 & \dots & 1 & 0 \end{matrix}]

where

λ_{i} = \frac{ρ_{i}}{ρ_{i} + 1}

,

μ_{i} = \frac{1}{ρ_{i} + 1}

,

i \in {1, 2, \dots, N - 1}

.

ρ_{i} > 1

means a higher chance to jump to the right than to the left. A larger

ρ_{i}

will have less probability of exiting

E

. Note that

K (x, E) = 1

for

x \in {2, \dots, N}

. Thus,

K_{α} (x, y) = K (x, y)

for any

α

if

x \neq 1

and

K_{α} (1, y) = K (1, y) + μ_{1} α (y) = \{\begin{matrix} λ_{1} + μ_{1} α (1) & y = 1, \\ μ_{1} α (y) & 2 \leq y \leq N \end{matrix} .

Then,

R_{θ} (x, y) = - ln \frac{K_{α_{θ}} (x, y)}{K_{β_{θ}} (x, y)} = 0

if

x \neq 1

and by (20), the gradient is simplified as follows:

\nabla_{θ} r (θ) = E_{Y} [(R_{θ} (1, Y) - r (θ) + V (Y) - V (1)) \nabla_{θ} ln K_{α_{θ}} (1, Y) + \nabla_{θ} ln K_{β_{θ}} (1, Y)]

where Y follows distribution

K_{α} (1, \cdot)

.

We consider two cases: (1) a constant

ρ_{i} = 1.25

and (2) a state-dependent

ρ_{i} = 2 - \frac{3}{2 N - 4} (i - 1)

. Note that

ρ_{i} = 1

gives an equal probability of jumping to the left and to the right. Thus, in case (1), there is a boundary layer at the most right end and in case (2), we expect to see a peak of the QSD near

i \approx 2 N / 3

. Figure 3 shows the true QSD in both cases. We set

N = 500

.

In Figure 4, we consider the case when

ρ_{i} = 1.25

and compute

L_{2}

errors. We set the initial value

θ_{0}^{i} = - 35 + \frac{35}{498} (i - 1)

for

i \in {1, 2, \dots, 498}

and

θ_{0}^{499} = 3

,

ψ_{0} = [0, 0, \dots, 0]

,

r_{0} = 0

and the learning rate

η_{n}^{θ} = 0.0003, η_{n}^{ψ} = 0.0001, η_{n}^{r} = 0.0001

and the batch size is 64. The step size for Projection Algorithm is

ϵ_{n} = n^{- 0.95}

. Figure 5 plots the errors for the state-dependent

ρ_{i} = 2 - \frac{3}{2 N - 4} (i - 1)

. We set the initial value

θ_{0}^{i} = 8 + \frac{35}{250} (i - 1)

for

i \in {1, 2, \dots, 250}

,

θ_{0}^{251} = 44

,

θ_{0}^{i} = 43

for

i \in {252, \dots, 305}

,

θ_{0}^{306} = 48

,

θ_{0}^{307} = 42

and

θ_{0}^{i} = 43 - \frac{38}{293} (i - 1)

for

i \in {308, 309, \dots, 499}

,

ψ_{0} = [0, 0, \dots, 0], r_{0} = 0

and the learning rate is

η_{n}^{θ} = 0.0002

,

η_{n}^{ψ} = 0.0001, η_{n}^{R} = 0.0001

with the batch size as 128. The step size for the Projection Algorithm is

ϵ_{n} = n^{- 0.95}

. Both figures demonstrate that actor-critic algorithm performs quite well on this example.

In Table 1, we compared the CPU time of each algorithm in the M/M/1/500 queue when they obtain an accuracy at

2 \times 10^{- 1}

. We found that our algorithm cost less time on this example.

5. Summary and Conclusions

In this paper, we propose a reinforcement learning (RL) method for quasi-stationary distribution (QSD) in discrete time finite-state Markov chains. By minimizing the KL-divergence of two Markovian path distributions induced by the candidate distribution and the true target distribution, we introduce the formulation in terms of RL and derive the corresponding policy gradient theorem. We devise an actor-critic algorithm to learn the QSD in its parameterized form

α_{θ}

. This formulation of RL can receive benefit from the development of the RL method and the optimization theory. We illustrated our actor-critic methods on two numerical examples by using simple tabular parametrization and gradient descent optimization. It has been observed that the performance of our method is more prominent for large scale problems.

We only demonstrate the preliminary mechanism of the idea here, and there is much space left for improving the efficiency and extensions in future works. The generalization from the current consideration of finite-state Markov chain to the jump Markov process and the diffusion case is in consideration. More importantly, for very large or high dimensional state space, modern function approximation methods such as kernel methods or neural networks should be used for the distribution

α_{θ}

and the value function

V_{ψ}

. The recent tremendous advancement of optimization techniques for policy gradient in reinforcement learning could also contribute much to efficiency improvement of our current formulation.

Author Contributions

Conceptualization, Z.C. and L.L.; Investigation, Z.C.; Methodology, Z.C. and X.Z.; Software, Z.C.; Supervision, X.Z.; Writing—original draft, Z.C.; Writing—review & editing, L.L. and X.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by Government of Hong Kong, Grant Number 11305318; NSFC Grant Number 11871486.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

No new data were created or analyzed in this study. Data sharing is not applicable to this article.

Acknowledgments

L.L. acknowledges the support of NSFC 11871486. X.Z. acknowledges the support of Hong Kong RGC GRF 11305318.

Conflicts of Interest

The authors declare no conflict of interest.

Appendix A

In this appendix, we discuss the computation of the gradient of

\nabla_{θ} ln K_{α_{θ}}

and

\nabla_{θ} ln K_{β_{θ}}

. Note that

\nabla_{θ} α_{θ}

is straightforward since we model

α

in its parametrization form

θ

. By definition (4), we have the following:

\nabla_{θ} ln K_{α_{θ}} (X_{t}, X_{t + 1}) = \frac{1 - K (X_{t}, E)}{K_{α_{θ}} (X_{t}, X_{t + 1})} \nabla_{θ} α_{θ} (X_{t + 1})

and the following as well:

\nabla_{θ} ln K_{β_{θ}} (X_{t}, X_{t + 1}) = \frac{1 - K (X_{t}, E)}{K_{β_{θ}} (X_{t}, X_{t + 1})} \nabla β_{θ} (X_{t + 1}) .

where

K_{β} (x, y) = K (x, y) + (1 - K (x, E)) β (y)

. The vector

K (x, E)

for any x can be pre-computed and saved in tabular form.

By (19), the one-step distribution

β

is computed below.

β (X_{t + 1}) = \sum_{x} α (x) [K (x, X_{t + 1}) + (1 - K (x, E)) α (X_{t + 1})] \approx \frac{1}{n} \sum_{i = 1}^{n} K (Z_{i}, X_{t + 1}) + (1 - K (Z_{i}, E)) α (X_{t + 1})

Here, the samples

Z_{i} \sim α

could be approximated by stationary distribution

μ

; thus, one may simply use the known sample

X_{t}

to replace

Z_{i}

with

n = 1

.

To find

\nabla_{θ} β_{θ}

, we use stochastic approximation again.

\begin{matrix} \nabla_{θ} β_{θ} (X_{t + 1}) & = \sum_{x} \nabla α_{θ} (x) [K (x, X_{t + 1}) + (1 - K (x, E)) α_{θ} (X_{t + 1})] + [\sum_{x} α_{θ} (x) (1 - K (x, E))] \nabla_{θ} α_{θ} (y), \\ \approx \frac{1}{n} \sum_{i = 1}^{n} \nabla ln α_{θ} (Z_{i}) [K (Z_{i}, X_{t + 1}) + (1 - K (Z_{i}, E)) α_{θ} (X_{t + 1})] + (1 - K (Z_{i}, E)) \nabla_{θ} α_{θ} (X_{t + 1}) . \end{matrix}

References

Collet, P.; Martínez, S.; Martín, J.S. Quasi-Stationary Distributions: Markov Chains, Diffusions and Dynamical Systems; Springer Science & Business Media: Cham, Switzerlands, 2012. [Google Scholar]
Buckley, F.; Pollett, P. Analytical methods for a stochastic mainland–island metapopulation model. Ecol. Model. 2010, 221, 2526–2530. [Google Scholar] [CrossRef]
Lambert, A. Population dynamics and random genealogies. Stoch. Model. 2008, 24, 45–163. [Google Scholar] [CrossRef]
De Oliveira, M.M.; Dickman, R. Quasi-stationary distributions for models of heterogeneous catalysis. Phys. Stat. Mech. Appl. 2004, 343, 525–542. [Google Scholar] [CrossRef] [Green Version]
Dykman, M.I.; Horita, T.; Ross, J. Statistical distribution and stochastic resonance in a periodically driven chemical system. J. Chem. Phys. 1995, 103, 966–972. [Google Scholar] [CrossRef] [Green Version]
Artalejo, J.R.; Economou, A.; Lopez-Herrero, M.J. Stochastic epidemic models with random environment: Quasi-stationarity, extinction and final size. J. Math. Biol. 2013, 67, 799–831. [Google Scholar] [CrossRef] [PubMed]
Clancy, D.; Mendy, S.T. Approximating the quasi-stationary distribution of the sis model for endemic infection. Methodol. Comput. Appl. Probab. 2011, 13, 603–618. [Google Scholar] [CrossRef]
Sani, A.; Kroese, D.; Pollett, P. Stochastic models for the spread of hiv in a mobile heterosexual population. Math. Biosci. 2007, 208, 98–124. [Google Scholar] [CrossRef]
Chan, D.C.; Pollett, P.K.; Weinstein, M.C. Quantitative risk stratification in markov chains with limiting conditional distributions. Med. Decis. Mak. 2009, 29, 532–540. [Google Scholar] [CrossRef]
Berglund, N.; Landon, D. Mixed-mode oscillations and interspike interval statistics in the stochastic fitzhugh–nagumo model. Nonlinearity 2012, 25, 2303. [Google Scholar] [CrossRef] [Green Version]
Landon, D. Perturbation et Excitabilité Dans des Modeles Stochastiques de Transmission de l’Influx Nerveux. Ph.D. Thesis, Université d’Orléans, Orléans, France, 2012. [Google Scholar]
Gesù, G.D.; Lelièvre, T.; Peutrec, D.L.; Nectoux, B. Jump markov models and transition state theory: The quasi-stationary distribution approach. Faraday Discuss. 2017, 195, 469–495. [Google Scholar] [CrossRef] [Green Version]
Lelièvre, T.; Nier, F. Low temperature asymptotics for quasistationary distributions in a bounded domain. Anal. PDE 2015, 8, 561–628. [Google Scholar] [CrossRef]
Pollock, M.; Fearnhead, P.; Johansen, A.M.; Roberts, G.O. The scalable langevin exact algorithm: Bayesian inference for big data. arXiv 2016, arXiv:1609.03436. [Google Scholar]
Wang, A.Q.; Roberts, G.O.; Steinsaltz, D. An approximation scheme for quasi-stationary distributions of killed diffusions. Stoch. Process. Appl. 2020, 130, 3193–3219. [Google Scholar] [CrossRef]
Watkins, D.S. Fundamentals of Matrix Computations; John Wiley & Sons: Hoboken, NJ, USA, 2004; Volume 64. [Google Scholar]
Bebbington, M. Parallel implementation of an aggregation/disaggregation method for evaluating quasi-stationary behavior in continuous-time markov chains. Parallel Comput. 1997, 23, 1545–1559. [Google Scholar] [CrossRef]
Pollett, P.; Stewart, D. An efficient procedure for computing quasi-stationary distributions of markov chains by sparse transition structure. Adv. Appl. Probab. 1994, 26, 68–79. [Google Scholar] [CrossRef]
Martinez, S.; Martin, J.S. Quasi-stationary distributions for a brownian motion with drift and associated limit laws. J. Appl. Probab. 1994, 31, 911–920. [Google Scholar] [CrossRef] [Green Version]
Aldous, D.; Flannery, B.; Palacios, J.L. Two applications of urn processes the fringe analysis of search trees and the simulation of quasi-stationary distributions of markov chains. Probab. Eng. Inform. Sci. 1988, 2, 293–307. [Google Scholar] [CrossRef] [Green Version]
Benaïm, M.; Cloez, B. A stochastic approximation approach to quasi-stationary distributions on finite spaces. Electron. Commun. Probab. 2015, 20, 1–13. [Google Scholar] [CrossRef]
De Oliveira, M.M.; Dickman, R. How to simulate the quasistationary state. Phys. Rev. E 2005, 71, 016129. [Google Scholar] [CrossRef] [Green Version]
Blanchet, J.; Glynn, P.; Zheng, S. Analysis of a stochastic approximation algorithm for computing quasi-stationary distributions. Adv. Appl. Probab. 2016, 48, 792–811. [Google Scholar] [CrossRef] [Green Version]
Zheng, S. Stochastic Approximation Algorithms in the Estimation of Quasi-Stationary Distribution of Finite and General State Space Markov Chains. Ph.D. Thesis, Columbia University, New York, NY, USA, 2014. [Google Scholar]
Kushner, H.; Yin, G.G. Stochastic Approximation and Recursive Algorithms and Applications; Springer Science & Business Media: Cham, Switzerlands, 2003; Volume 35. [Google Scholar]
Polyak, B.T.; Juditsky, A.B. Acceleration of stochastic approximation by averaging. SIAM J. Control. Optim. 1992, 30, 838–855. [Google Scholar] [CrossRef]
Blei, D.M.; Kucukelbir, A.; McAuliffe, J.D. Variational inference: A review for statisticians. J. Am. Stat. Assoc. 2017, 112, 859–877. [Google Scholar] [CrossRef] [Green Version]
Jordan, M.I.; Ghahramani, Z.; Jaakkola, T.S.; Saul, L.K. An Introduction to Variational Methods for Graphical Models. Mach. Learn. 1999, 37, 183–233. [Google Scholar] [CrossRef]
Liu, Q.; Wang, D. Stein variational gradient descent: A general purpose bayesian inference algorithm. In Advances in Neural Information Processing Systems; Lee, D., Sugiyama, M., Luxburg, U., Guyon, I., Garnett, R., Eds.; Curran Associates, Inc.: Red Hook, NY, USA, 2016; Volume 29. [Google Scholar]
Rezende, D.; Mohamed, S. Variational inference with normalizing flows. In Proceedings of the 32nd International Conference on Machine Learning, Lille, France, 7–9 July 2015; Bach, F., Blei, D., Eds.; Volume 37, pp. 1530–1538. [Google Scholar]
Sutton, R.S.; Barto, A.G. Reinforcement Learning: An Introduction; MIT Press: Cambridge, MA, USA, 2018. [Google Scholar]
Mnih, V.; Kavukcuoglu, K.; Silver, D.; Graves, A.; Antonoglou, I.; Wierstra, D.; Riedmiller, M. Playing atari with deep reinforcement learning. arXiv 2013, arXiv:1312.5602. [Google Scholar]
Popova, M.; Isayev, O.; Tropsha, A. Deep reinforcement learning for de novo drug design. Sci. Adv. 2018, 4, eaap7885. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Silver, D.; Hubert, T.; Schrittwieser, J.; Antonoglou, I.; Lai, M.; Guez, A.; Lanctot, M.; Sifre, L.; Kumaran, D.; Graepel, T.; et al. A general reinforcement learning algorithm that masters chess, shogi, and go through self-play. Science 2018, 362, 1140–1144. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Rose, D.C.; Mair, J.F.; Garrahan, J.P. A reinforcement learning approach to rare trajectory sampling. New J. Phys. 2021, 23, 013013. [Google Scholar] [CrossRef]
Méléard, S.; Villemonais, D. Quasi-stationary distributions and population processes. Probab. Surv. 2012, 9, 340–410. [Google Scholar] [CrossRef]
Blanchet, J.; Glynn, P.; Zheng, S. Empirical analysis of a stochastic approximation approach for computing quasi-stationary distributions. In EVOLVE—A Bridge between Probability, Set Oriented Numerics, and Evolutionary Computation II; Schütze, O., Coello, C.A.C., Tantar, A.-A., Tantar, E., Bouvry, P., Moral, P.D., Legrand, P., Eds.; Springer: Berlin/Heidelberg, Germany, 2013; pp. 19–37. [Google Scholar]
Boyd, S.; Boyd, S.P.; Vandenberghe, L. Convex Optimization; Cambridge University Press: Cambridge, UK, 2004. [Google Scholar]
Wang, W.; Carreira-Perpinán, M.A. Projection onto the probability simplex: An efficient algorithm with a simple proof, and an application. arXiv 2013, arXiv:1309.1541. [Google Scholar]

Figure 1. The loopy Markov chain example with

ϵ = 0.1

. The figure shows the log–log plots of

L_{2}

-norm error of the Vanilla Algorithm (a), Projection Algorithm (b), Polyak Averaging Algorithm (c) and our actor-critic algorithm (d). The iteration for the actor-critic algorithm is defined as one step of gradient descent (“t” in Algorithm 1).

Figure 1. The loopy Markov chain example with

ϵ = 0.1

. The figure shows the log–log plots of

L_{2}

-norm error of the Vanilla Algorithm (a), Projection Algorithm (b), Polyak Averaging Algorithm (c) and our actor-critic algorithm (d). The iteration for the actor-critic algorithm is defined as one step of gradient descent (“t” in Algorithm 1).

Figure 2. The loopy Markov chain example with

ϵ = 0.9

. The figure shows the log–log plots of

L_{2}

-norm error of Vanilla Algorithm (a), Projection Algorithm (b), Polyak Averaging Algorithm (c) and our actor-critic algorithm (d).

Figure 2. The loopy Markov chain example with

ϵ = 0.9

. The figure shows the log–log plots of

L_{2}

-norm error of Vanilla Algorithm (a), Projection Algorithm (b), Polyak Averaging Algorithm (c) and our actor-critic algorithm (d).

Figure 3. The QSD for M/M/1/500 queue with

ρ_{i} \equiv 1.25

(left) and

ρ_{i} = 2 - \frac{3}{2 N - 4} (i - 1)

(right).

Figure 3. The QSD for M/M/1/500 queue with

ρ_{i} \equiv 1.25

(left) and

ρ_{i} = 2 - \frac{3}{2 N - 4} (i - 1)

(right).

Figure 4. The M/M/1/500 queue with

ρ_{i} = 1.25

. The figure shows the log–log plots of

L_{2}

-norm error of Vanilla Algorithm (a), Projection Algorithm (b), Polyak Averaging Algorithm (c) and our actor-critic algorithm (d).

Figure 4. The M/M/1/500 queue with

ρ_{i} = 1.25

. The figure shows the log–log plots of

L_{2}

-norm error of Vanilla Algorithm (a), Projection Algorithm (b), Polyak Averaging Algorithm (c) and our actor-critic algorithm (d).

Figure 5. The M/M/1/500 queue with

ρ_{i} = 2 - \frac{3}{2 N - 4} (i - 1)

. The figure shows the log–log plots of

L_{2}

-norm error of Vanilla Algorithm (a), Projection Algorithm (b), Polyak Averaging Algorithm (c) and our actor-critic algorithm (d).

Figure 5. The M/M/1/500 queue with

ρ_{i} = 2 - \frac{3}{2 N - 4} (i - 1)

. The figure shows the log–log plots of

L_{2}

-norm error of Vanilla Algorithm (a), Projection Algorithm (b), Polyak Averaging Algorithm (c) and our actor-critic algorithm (d).

Table 1. The CPU time of each algorithm in the M/M/1/500 queue when they obtain the accuracy at

2 \times 10^{- 1}

.

Table 1. The CPU time of each algorithm in the M/M/1/500 queue when they obtain the accuracy at

2 \times 10^{- 1}

.

Algorithm	Vanilla	Projection	Polyak Averaging	ac_ $α$
Time (s)	1038.3279	429.6304	505.2299	186.9280
Time (s)	753.9503	259.0671	268.5476	251.5370

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Cai, Z.; Lin, L.; Zhou, X. Learn Quasi-Stationary Distributions of Finite State Markov Chain. Entropy 2022, 24, 133. https://doi.org/10.3390/e24010133

AMA Style

Cai Z, Lin L, Zhou X. Learn Quasi-Stationary Distributions of Finite State Markov Chain. Entropy. 2022; 24(1):133. https://doi.org/10.3390/e24010133

Chicago/Turabian Style

Cai, Zhiqiang, Ling Lin, and Xiang Zhou. 2022. "Learn Quasi-Stationary Distributions of Finite State Markov Chain" Entropy 24, no. 1: 133. https://doi.org/10.3390/e24010133

APA Style

Cai, Z., Lin, L., & Zhou, X. (2022). Learn Quasi-Stationary Distributions of Finite State Markov Chain. Entropy, 24(1), 133. https://doi.org/10.3390/e24010133

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Learn Quasi-Stationary Distributions of Finite State Markov Chain

Abstract

1. Introduction

2. Problem Setup and Review

2.1. Quasi-Stationary Distribution

2.2. Review of Simulation Methods for Quasi-Stationary Distribution

3. Learn Quasi-Stationary Distribution

3.1. Formulation of RL and Policy Gradient Theorem

3.2. Learn QSD

3.3. Actor-Critic Algorithm

4. Numerical Experiment

4.1. Loopy Markov Chain

4.2. M/M/1/N Queue with Finite Capacity and Absorption

5. Summary and Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

Appendix A

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI