Learn Quasi-Stationary Distributions of Finite State Markov Chain

We propose a reinforcement learning (RL) approach to compute the expression of quasi-stationary distribution. Based on the fixed-point formulation of quasi-stationary distribution, we minimize the KL-divergence of two Markovian path distributions induced by candidate distribution and true target distribution. To solve this challenging minimization problem by gradient descent, we apply a reinforcement learning technique by introducing the reward and value functions. We derive the corresponding policy gradient theorem and design an actor-critic algorithm to learn the optimal solution and the value function. The numerical examples of finite state Markov chain are tested to demonstrate the new method.


Introduction
Quasi-stationary distribution (QSD) is the long time statistical behavior of a stochastic process that will be surely killed when this process is conditioned to survive [13].This concept has been widely used applications, such as in biology and ecology [10,20], chemical kinetics [14,17], epidemics [2,12,33], medicine [11] and neuroscience [5,21].Many works for the rare events in meta-stable systems also focus on this quasistationary distribution [16,22].In addition, some new Monte Carlo sampling methods, for instance, the Quasi-stationary Monte Carlo method [28,36] also arise by using the QSD instead of true stationary distribution.
We are interested in the numerical computation of QSD and focus on the finite state Markov chain in this paper.Mathematically, the quasi-stationary distribution can be solved as the principal left eigenvector of a sub-Markovian transition matrix.So, the traditional numerical algebra methods can be applied to solve the quasi-stationary distribution in finite state space, for example, the power method [38], the multi-grid method [3] and Arnoldi's algorithm [27].These eigenvector methods can produce the stochastic vector for QSD, instead of generating samples of QSD.
In search of efficient algorithms for large state space, stochastic approaches are in favor of either sampling the QSD or computing the expression of QSD, and these methods can be applied or extended easily to continuous state space too.A popular approach for sampling quasi-stationary distribution is the Fleming-Viot stochastic method [24].The Flemming-Viot method first simulates N particles independently.When any one of the particles falls into the absorbing state and gets killed, a new particle is uniformly selected from the remaining N − 1 surviving particles to replace the dead one, and the simulation continues.When the time and N tend to infinity, the particles' empirical distribution can converge to the quasi-stationary distribution.
In [1,4,15], the authors proposed to recursively update the expression of QSD at each iteration based on the empirical distribution of a single-particle simulation.It is shown [4] that the convergence rate can be O(n −1/2 ), where n is the iteration number.This method is later improved in [7,39] by applying the stochastic approximation method [19] and the Polyak-Ruppert averaging technique [29].These improved algorithms have a choice of flexible step size but needs a projection operator onto probability simplex, which carries some extra computational overhead increasing with the number of states.[36] extends the algorithm to the diffusion process.
In this paper, we focus on how to compute the expression of the quasi-stationary distribution, which is denoted by α(x) on a metric space E. If E is finite, α is a probability vector and if E is a domain in R d , then α is a probability density function on E. We assume α can be numerically represented in parametric form α θ , θ ∈ Θ.This family {α θ } can be in tabular form or any neural network.Then the problem of finding the QSD α becomes how to compute the optimal parameter θ in Θ.We call this problem the learning problem for the QSD.In addition, we want to directly learn the QSD, not to use the distribution family {α θ } to fit the simulated samples generated by other traditional simulation methods.
Our minimization problem for QSD is similar to the variational inference (VI) [8], which minimizes an objective functional measuring the distance between the target and candidate distributions.However, unlike the mainstream VI methods such as evidence lower bound (ELBO) technique [18] or particle-based [23], flow-based methods [31], our approach is based on recent important progresses from reinforcement learning(RL) method [35], particularly the policy gradient method and actor-critic algorithm.We first regard the learning process of the quasi-stationary distribution as the interaction with the environment which is constructed by the property of QSD.Reinforcement learning has recently shown tremendous advancements and remarkable successes in applications (e.g.[26,34,30]).The RL framework provides an innovative and powerful modelling and computation approach for many scientific computing problems.
The essential question is how to formulate the QSD problem as an RL problem.Firstly, for the sub-Markovian kernel K of a Markov process, we can define a Markovian kernel K α on E (see Definition 2.1) and then the QSD is defined by the equation α = αK α , which equals α as the initial distribution and the distribution after one step.Secondly, we consider an optimal α (in our parametric family of distribution) to minimize the Kullback-Leibler divergence (i.e., relative entropy) of two path distributions, denoted by P and Q, associated with two Markovian kernels K α and K β where β := αK α .Thirdly, inspired by the recent work [32] of using RL for rare events sampling problems, we transform the minimization of KL divergence between P and Q into the maximization of a time-averaged reward function and define the corresponding value function V (x) at each state x.This completes our modelling of RL for the quasi-stationary distribution problem.Lastly, we derive the policy gradient theorem (Theorem 2) to compute the gradient w.r.t.θ of the averaged reward for the learning dynamic for the averaged reward.This is known as the "actor" part.The "critic" part is to learn the value function V in its parametric form V ψ .The actor-critic algorithm uses the stochastic gradient descent to train the parameter θ for the action α θ and the parameter ψ for the value function V ψ .See Algorithm 1.
Our contribution is that we first devise a method to transform the QSD problem into the RL problem.Similar to [32], our paper also uses the KL-divergence to define the RL problem.However, our paper fully adapts the unique property of QSD that is a fixed point problem α = αK α to define the RL problem.
Our learning method allows the flexible parametrization of the distributions and uses the stochastic gradient method to train the optimal distribution.It is easy to implement optimization with scale up to large state spaces.The numerical examples we tested have shown our methods converge faster than the other existing methods [15,7].
Finally, we remark that our method works very well for QSD of the strict sub-Markovian kernel K, but not applicable to compute the invariant distribution when K is Markovian.This is because we transform the problem into the variational problem between two Markovian kernels K α and K β (where β = αK α ).Note K α (x, y) = K(x, y) + (1 − K(x, E))α(y) (Definition 2.1) and our method is based on the fact that α = β if and only if K α = K β .If K is Markovian kernel, then K α ≡ K for any α, and our method can not work.So K(x, E) has to be strictly less than 1 for some x ∈ E.
This paper is organized as follows.Section 2 is a short review of the quasi-stationary distribution and some basic simulation methods of QSD.In Section 3, we first formulate the reinforcement learning problem by KL-divergence and derive the policy gradient theorem (Theorem 2).Using the above formulation, we then develop the actor-critic algorithm to estimate the quasi-stationary distribution.In Section 4, the efficiency of our algorithms is illustrated by four examples compared with the simulation methods in [39].

Problem Setup and Review
2.1.Quasi-stationary Distribution.We start with an abstract setting.Let E be a finite state equipped with the Borel σ-field B(E), and let P(E) be the space of probabilities over E. A sub-Markovian kernel on E is defined as a map K : E × B(E) → [0, 1] such that for all x ∈ E, A → K(x, A) is a nonzero measure with K(x, E) ≤ 1 and for all A ∈ B(E), x → K(x, A) is measurable.Particularly, if K(x, E) = 1 for all x ∈ E, then K is called a Markovian kernel.Throughout the paper, assume that K is strictly sub-Markovian, i.e., K(x, E) < 1 for some x.
Let X t be a Markov chain with values in E ∪ {∂} where ∂ / ∈ E denotes an absorbing state.We define the extinction time We define the quasi-stationary distribution(QSD) α as the long time limit of the conditional distribution, if there exists a probability distribution ν on E such that: where P ν refers to the probability distribution of X t associated with the initial distribution ν on E. Such a conditional distribution well describes the behavior of the process before extinction and it is easy to see that α satisfies the following fixed point problem where P α refers to the probability distribution of X t associated with the initial distribution α on E. (2) is equivalent to the following stationary condition that where α is a row vector and 1 denotes the column vector with all entries being one and For any sub-Markovian kernel K, we can associate with K a Markovian kernel K on E ∪ {∂} defined by for all x ∈ E, A ∈ B(E).The kernel K can be understood as the Markovian transition kernel of the Markov chain (X t ) on E ∪ {∂} whose transitions in E is specified by K, but is "killed" forever once it leaves E.
In this paper, we assume E is a finite state space and the process in consideration has a unique QSD.Assume that K is irreducible, then existence and uniqueness of the quasi-stationary distribution can be obtained by the Perron-Frobenius theorem [25].
An important Markovian kernel is the following K α which is defined on E only and has a "regenerative probability" α.Definition 2.1.For any given α ∈ P(E) and a sub-Markovian kernel K on E, we define K α , a Markovian kernel on E, as follows for all x ∈ E and A ∈ B(E).
K α is a Markovian kernel because K α (x, E) = 1.It is easy to sample X t+1 ∼ K α (X t , •) from any state X t ∈ E: Run the transition as normal by using K to have a next state denoted by Y , then We know that α is the quasi-stationary distribution of K if and only if it is the stationary distribution of K α , i.e., α = αK α . ( It is easy to see α = β if and only if K α = K β for any two distributions α and β.Also, for every α ′ , K α ′ has a unique invariant probability denoted by Γ(α ′ ).Then α ′ → Γ(α ′ ) is continuous in P(E) (i.e. for the topology of weak convergence) and there exists α ∈ P(E) such that α = Γ(α) or, equivalently, α is a QSD for K.

2.2.
Review of simulation methods for quasi-stationary distribution.According to the above subsection, the QSD α satisfies the fixed point problem where Γ(α) is the stationary distribution of K α on E. In general, (6) can be solved recursively by α n+1 ← Γ(α n ).The Fleming-Viot (FV) method [24] evolves N particles independently of each other as a Markov process associated with the transition kernel K α until one succeeds in jumping to the absorbing state ∂.At that time, this killed particle is immediately reset to E as an initial state uniformly chosen from one of the remaining N − 1 particles.The QSD α is approximated by the empirical distribution of the N particles in total and these particles can be regarded as samples from the quasi-stationary distribution α like the MCMC method.
[6] proposed a simulation method by only using one particle at each iteration to update α.At iteration n, given an α n ∈ P(E), one can run the discrete-time Markov chain X (n+1) as normal on ∂ ∪ E with initial X (n+1) 0 ∼ α n , then α n+1 is computed as the following weighted average of empirical distributions: where n ≥ 0 and I is the indicator function, k ∈ ∂ is the first extinction time for the process X (j) .This iterative scheme has the convergence rate In [7,39], the above method is extended to the stochastic approximations framework where Θ H denotes the L 2 projection into the probability simplex and ǫ n is the step size satisfying 7,39].If the Polyak-Ruppert averaging technique is applied to generate then the convergence rate of ν n → α becomes 1 √ n [7,39].The simulation schemes (7) and (8) need to sample the initial states according to α n and to add the empirical distribution and α n at each x pointwisely.So they are suitable for finite state space where α is a probability vector saved in the tabular form.In (8) there is no need to record all exit times τ (j) , j = 1, . . ., n, but the additional projection operation in ( 8) is computationally expensive since the cost is O(m log m) where m = |E| [9,37].

Learn Quasi-stationary Distribution
We focus on the computation of the expression of the quasi-stationary distribution.Particularly, when this distribution is parametrized in a certain way by θ, we can extend the tabular form for finite-state Markov chain to any flexible form, even in the neural networks for probability density function in R d .But we do not pursue this representation and expressivity issue here, and restrict our discussion to finite state space only to illustrate our main idea first.In finite state space, α(x) for x ∈ E = {1, . . ., m}, can be simply described as a softmax function with m − 1 parameter θ i : α(i) ∝ e θi , 1 ≤ i ≤ m − 1, (θ m = 0).This introduces no representation error.For the generalization to continuous space E in jump and diffusion processes, or even for a huge finite state space, a good representation of α θ (x) is important in practice.
In this section, we shall formulate our QSD problem in terms of reinforcement learning (RL) so that the problem of seeking optimal parameters becomes a policy optimization problem.We derive the policy gradient theorem to construct a gradient descent method for the optimal parameter.We then show how to design the actor-critic algorithms based on stochastic optimization.
3.1.Formulation of RL and Policy Gradient Theorem.Before introducing the RL method of our QSD problem, we develop a general formulation in terms of RL by introducing the KL-divergence between two path distributions.
Let P θ and Q θ be two families of Markovian kernels on E in parametric forms with the same set of parameters θ ∈ Θ. Assume both P θ and Q θ are ergodic for any θ.Let T > 0 and denote a path up to time T by ω T 0 = (X 0 , X 1 , . . ., X T ) ∈ E T +1 .Define the path distributions under the Markov chain kernel P θ and Q θ , respectively: is called the (one-step) reward.Define the average reward r(θ) as the time averaged negative KL divergence in the limit of T → ∞: Due to ergodicity of P θ , r(θ) = x0,x1 R θ (x 0 , x 1 )P θ (x 1 |x 0 )µ θ (x 0 ) where µ θ is the invariant measure of P θ .r(θ) is independent of initial state X 0 .Obviously r(θ) ≤ 0 for any θ.

So we have P
The above property establishes the relationship between the RL problem and QSD problem.
We show our theoretic main result below as the foundation of our algorithm to be developed later.This theorem can be regarded as one type of the policy gradient theorem for policy gradient method in reinforcement learning [35].
Define the value function ( [35] Chapter 13): Certainly, V also depends on θ, though we do not write θ explicitly.
Theorem 2 (policy gradient theorem).We have the following two properties (1) At any θ, for any x ∈ E, the following Bellman-type equation holds for the value function V and the average reward r(θ): (2) The gradient of the average reward r(θ) is where the expectations are for the joint distribution (X, Y ) ∼ µ θ (x)P θ (y | x) where µ θ is the stationary measure of P θ .
Proof.We shall prove the Bellman equation first and then we use the Bellman equation to derive the gradient of the average reward r(θ).For any x 0 ∈ E, by writing ω T 0 = (x 0 , . . ., x T ) and defining we have which proves (15), i.e., ∀x ∈ E. x,y In fact, we can add any constant number b (independent of x and y) inside the squared bracket of the last line without changing the equality, due to the following fact similar to (18): x,y µ θ (x)∇ θ P θ (y | x) = y µ θ (y)∇ θ x P θ (x | y) = 0. ( 16) is a special case of b = r(θ).
Remark 1.As shown in the proof, (16) holds if r(θ) at the right-hand side is replaced by any constant number b. b = r(θ) is a good choice to reduce the variance since r(θ) can be regarded as the expectation of R θ .
Remark 2. If P θ = Q θ , then the first term of (16) vanishes due to (18) and the second term of (16) vanishes due to (15).
Remark 3. The name of "policy" here refers to the role of θ as the policy for decision makers to improve the reward r(θ).

3.2.
Learn QSD.Now we discuss how to connect the QSD with the results in the previous subsection.In view of equation ( 5), we introduce β := αK α as the one-step distribution if starting from the initial α, i.e., By (5), α is a QSD if and only if β = α.However, we do not directly compare these two distributions α and β.Instead, we consider their Markovian kernels induced by (4): K α and K β .Our approach is to consider the KL divergence similar to (11) between two kernels K α and K β since α = β if and only if K α = K β .In this way, one can view K α and K β (note β = αK α ) as two transition matrices P θ and Q θ in the previous section, in which the parameter θ here is in fact the distribution α.
To have a further representation of the distribution α, which is a (probability mass) function on E, we propose a parametrized family for α in the form α θ where θ is a generic parameter.In the simplest case, α θ takes the so-called soft-max form α θ (i) = e θ i j≥1 e θ j if E = {1, . . ., N } for θ = (θ 1 , . . ., θ N −1 , θ N ≡ 0).parametrization represents α without any approximation error for finite state space and the effective space of θ is just R N −1 .For certain problems, particularly with large state space, if one has some prior knowledge about the structure of the function α on E, one might propose other parametric forms of α θ with the dimension of θ less than the cardinality |E| to improve the efficiency, although the extra representation error in this way has to be introduced.
For any given α θ ∈ P(E), the corresponding Markovian kernel K α θ is then defined in (4) and β θ = α θ K α θ i is defined by (19).K β θ is like-wisely defined by ( 4) again.To use the formulation in Section 3.1, we choose P θ = K α θ and Q θ = K β θ .Define the objective function as before: where The value function V (x) is defined like-wisely.Theorem 2 now gives the expression of gradient where (X, Y ) ∼ µ θ (x)K α θ (x, y) where µ θ is the stationary measure of K α θ .The optimal θ * for the QSD α θ is to maximize r(θ) and this can be solved by the gradient descent algorithm where η θ t > 0 is the step size.In practice, the stochastic gradient is applied where X t , X t+1 are sampled based on the Markovian kernel K α θ ( see Algorithm (1)) and the differential temporal (TD) error δ t is Next, we need to address a remaining issue to address: how to compute the value function V and r(θ t ) in the TD error (22).Besides, we also need to show the details of computing ∇ θ K α θ and ∇ θ K β θ .the average is used to update the parameters.The stationary distribution µ θ of K α θ is sampled by running the corresponding Markov chain for several steps with "warm start": the initial for θ t+1 is set as the final state generated from the previous iteration at θ t .The length of this "burn-in" period can be set as just one step in practice for efficiency.
Algorithm 1: (ac-α method) Actor Critic algorithm for quasi-stationary distribution α θ Initialization t = 0; θ = θ 0 ; ψ = ψ 0 ; r t = r 0 ; Sample X 0 ∼ µ θ0 , the stationary distribution of K α θ 0 for t = 0, 1, 2, . . .do Sample X t+1 from the transition kernel K α θ t (X t , X t+1 ) Remark 4. Finally, we remark the computation of ∇ θ ln K α θ and ∇ θ ln K β θ in Algorithm 1.The details are shown in Appendix.We comment that the main computational cost is the function K(x, E), which has to be pre-computed and stored.If the problem has some special structure, the function could be approximated in parametric form.Another special case is our example 2 where K(x, E) = 0 ∀x ∈ {2, 3, . . ., N }.

Numerical experiment
In this section, we present two examples to demonstrate Algorithm 1.We call the algorithm (7), ( 8) and ( 9) in Section 2.2 used in [7,39], as Vanilla Algorithm, Projection Algorithm and Polyak Averaging Algorithm respectively.Let 0 be the absorbing state and E = {1, . . ., N } are non-absorbing states, the Markov transition matrix on {0, . . ., N } is denoted by where K is an N -by-N sub-Markovian matrix.For Algorithm 1, the distribution α θ on E is always parameterized as α θ = 1 e θ1 + . . .+ e θN−1 + 1 e θ1 , . . ., e θN−1 , 1 , and the value function V ψ (x) is represented in tabular form for simplicity: where ψ ∈ R N .0.0001 and the batch size is 4. The step size for Projection Algorithm is ǫ n = n −0.99 .Figure 2 is for the case when ǫ = 0.9 We set the initial value θ 0 = [4, −2], ψ 0 = [0, 0, 0], r 0 = 0, the learning rate η θ n = 0.04, η ψ n = 0.0001, η r n = 0.0001 and the batch size is 32.The step size for Projection Algorithm is ǫ n = n −0.99 .4.2.M/M/1/N queue with finite capacity and absorption.Our second example is a M/M/1 queue with finite queue capacity.The 0 state has been set as an absorbing state.The transition probability matrix on {0, . . ., N } takes the form means a higher chance to jump to right than to left.A larger ρ i will have less probability of exiting E. Note that K(x, E) = 1 for x ∈ {2, . . ., N }.So K α (x, y) = K(x, y) for any α K β θ (x,y) = 0 if x = 1 and by (20), the gradient is simplified as where Y follows the distribution K α (1, •).We consider two cases: (1) a constant ρ i = 1.25 and (2) a state-dependent ρ i = 2 − 3 2N −4 (i − 1).Note ρ i = 1 gives an equal probability of jumping to left and to right.So in case (1), there is a boundary layer at the most right end and in case (2), we expect to see a peak of the QSD near i ≈ 2N/3.Figure 3 shows the true QSD in both cases.We set N = 500.
In Table 1, we compared the CPU time of each algorithm in the M/M/1/500 queue when they obtain the accuracy at 2 × 10 −1 .We found that our algorithm cost less time on this example.

Figure 2 .
Figure 2. The loopy Markov chain example with ǫ = 0.9.The figure shows the log-log plots of L 2 -norm error of Vanilla Algorithm (a), Projection Algorithm(b), Polyak Averaging Algorithm (c) and our actorcritic algorithm (d).

Table 1 .
The CPU time of each algorithm in the M/M/1/500 queue when they obtain the accuracy at 2 × 10 −1 .