Confidence Regions for Steady-State Probabilities and Additive Functionals Based on a Single Sample Path of an Ergodic Markov Chain

Vestring, Yann; Tavakoli, Javad

doi:10.3390/math12233641

Open AccessArticle

Confidence Regions for Steady-State Probabilities and Additive Functionals Based on a Single Sample Path of an Ergodic Markov Chain

by

Yann Vestring

^*,†

and

Javad Tavakoli

^†

Department of Mathematics, University of British Columbia Okanagan, 3187 University Way, Kelowna, BC V1V 1V7, Canada

^*

Author to whom correspondence should be addressed.

^†

These authors contributed equally to this work.

Mathematics 2024, 12(23), 3641; https://doi.org/10.3390/math12233641

Submission received: 27 October 2024 / Revised: 14 November 2024 / Accepted: 19 November 2024 / Published: 21 November 2024

(This article belongs to the Special Issue Markov Chain Models and Applications: Latest Advances and Prospects)

Download

Browse Figures

Versions Notes

Abstract

Discrete, finite-state Markov chains are applied in many different fields. When a system is modeled as a discrete, finite-state Markov chain, the asymptotic properties of the system, such as the steady-state distribution, are often estimated based on a single, empirically observable sample path of the system, whereas the actual steady-state distribution is unknown. A question that arises is: how close is the empirically estimated steady-state distribution to the actual steady-state distribution? In this paper, we propose a method to numerically determine asymptotically exact confidence regions for the steady-state probabilities and confidence intervals for additive functionals of an ergodic Markov chain based on a single sample path.

Keywords:

Markov chain; estimation; steady-state distribution; confidence interval; confidence region

MSC:

60J55

1. Introduction

Consider the following problem. Let

P = [p_{i j}] \in {[0, 1]}^{d \times d}

be the unknown transition matrix of an ergodic Markov chain on a finite, discrete state-space

Ω

of size d. Let

π

be the unknown stationary distribution vector associated with P. Assume that we are interested in approximating

π

based on a single sample path characterized by a sequence of observed states

{(X_{t})}_{t = 0}^{N}

, where N is a positive integer. Assume that

\hat{P} = [{\hat{p}}_{i j}] \in {[0, 1]}^{d \times d}

is the maximum likelihood estimate of P based on

{(X_{t})}_{t = 0}^{N}

and that

\hat{P}

admits the unique stationary distribution vector

\hat{π}

.

A question that naturally arises is: how close is

\hat{π}

to

π

? This question is likely to occur in many applications where a process is assumed to follow a Markov chain; we are interested in the steady-state value of a quantity of interest, which can be expressed as a function g, where

g : Ω \to R

. More specifically, in this context we are interested in

π (g) : = \sum_{i = 1}^{d} π_{i} g (ω_{i}),

where

π

is unknown.

This paradigm is likely to arise in areas as diverse as computational physics [1,2], social and environmental sciences [3], and Markov decision processes (MDPs), particularly in the case of infinite-horizon undiscounted MDPs, also called average reward processes [4].

Depending on the exact context, it may be useful to derive confidence regions for

π

and/or confidence intervals for

π (g)

. In this paper, we propose an approach to this effect that, to the best of our knowledge, has not been previously suggested in the literature.

There is a vast statistical literature on Markov chain estimation. With regards to research that is relevant to the previously mentioned context, the research can be divided into two main categories.

The first category comprises asymptotic results (e.g., [5,6]) that are useful to understand the limiting behavior of

\hat{π}

but are not suitable for finite-time, non-asymptotic analysis, as is required for the context we described. For instance, it is known that the sample mean

\hat{π} (g) : = \frac{1}{N} \sum_{t = 0}^{N} g (X_{t})

converges almost surely to

π (g)

for

N \to \infty

[6]. Moreover, it is also known that, under certain conditions, for

N \to \infty

the distribution of

\sqrt{N} (\hat{π} (g) - π (g))

converges toward a normal distribution with mean 0 and asymptotic variance

{lim}_{N \to \infty} N Var (\hat{π} (g))

[5].

The second category includes results that are appropriate for non-asymptotic analysis (e.g., [7,8,9,10,11]) but assumes that some of the key properties of the Markov chain are known, such as the mixing time (see [12] for more details on this concept), the spectral gap, or even P itself.

Our work adds to the existing literature in that our main result in Section 2 is suitable for non-asymptotic analysis whilst making no assumptions about the properties of P beyond what can be inferred from

{(X_{t})}_{t = 0}^{n}

. To achieve this, we expand on the results in [11].

Despite not strictly falling within the statistical domain, we deem it relevant to mention a third category of results due to the wealth of available literature. This category of results revolves around deriving so-called perturbation bounds for

π

and analyzing the sensitivity of

\hat{π}

(see [13,14,15,16,17,18,19]). By and large, the existing literature revolves around absolute bounds of the following form:

∥ \hat{π} {- π ∥}_{p} \leq k {∥ P - \hat{P} ∥}_{q},

(1)

where k is derived in a variety of ways;

\hat{P}

is a perturbation of P; and

p = q = \infty

or, alternatively,

p = 1

and

q = \infty

. Much of the focus of the existing literature is on ways to derive k, whereby k is commonly expressed in terms of either the fundamental matrix of the underlying Markov chain; the mean first passage times; or the group inverse of

I - P

, where I is the identity matrix. While inequalities of the form of (1) are useful when

∥ P - \hat{P} ∥_{q}

is known, in the setting that we described in the introduction,

∥ P - \hat{P} ∥_{q}

is unknown.

Given the rich available literature on bounds of the form (1), a natural place to start for our purposes is to derive a confidence bound on

∥ P - \hat{P} ∥_{\infty}

, which can then be plugged into a suitable variation of inequality (1). However, numerical experiments revealed that the realized coverage ratios obtained with this approach far exceeded the level of confidence. A major challenge with this kind of approach is that worst-case assumptions need to be made in order to derive inequalities of form (1), which are absolute bounds. We found that confidence regions based on such inequalities therefore tend to be exceedingly conservative. By contrast, our proposed approach relies on asymptotically exact relationships and therefore is not affected by the same shortcoming.

The rest of this paper proceeds as follows. Section 2 presents the results underpinning the proposed approach as well as the proofs of these results. Section 3 presents the proposed approach. Section 4 illustrates the proposed approach with examples and presents numerical experiments to analyze the coverage characteristics of the approach. Section 5 concludes this work and discusses potential future directions of research.

2. Results Underpinning the Proposed Approach and Proofs

2.1. Results Underpinning the Proposed Approach

Let

{X_{t}}

be an irreducible Markov chain with finite and discrete state-space

Ω : = {ω_{k} : k \in [d]}

. Throughout this paper, for any integer n we denote

[n]

as the set of indices from 1 to n. Let P be the transition matrix, and assume that P admits the unique stationary distribution vector

π

.

Let

\hat{π} (g) = : \frac{1}{N} \sum_{t = 1}^{N} g (X_{t})

be based on a sample path

{(X_{t})}_{t = 0}^{N}

of this Markov chain, and let

\hat{P}

be the maximum likelihood estimate of P based on this sample path. Assume that

\hat{P}

admits the unique stationary distribution vector

\hat{π}

. Furthermore,

\forall k \in [d]

, let

N_{k} = | {t \in [N - 1] : X_{t} = ω_{k}} |

. Assume that

\forall k \in [d]

,

N_{k} \geq 1

.

Theorem 1.

\forall (i, j) \in {[d]}^{2},

lim_{N \to \infty} [N Cov (π_{i} - {\hat{π}}_{i}, π_{j} - {\hat{π}}_{j} | {N_{k}}_{k \in [d]}) - \frac{1}{N} \sum_{k = 1}^{d} N_{k} {\hat{h}}_{i}^{T} {\hat{Σ}}_{k} {\hat{h}}_{j}] = 0,

(2)

and

lim_{N \to \infty} [N Var (\hat{π} (g) | {N_{k}}_{k \in [d]}) - \frac{1}{N} \sum_{k = 1}^{d} N_{k} {\hat{γ}}_{k}] = 0,

(3)

where

{\hat{Σ}}_{k}

is a symmetric matrix that depends on the kth row of

\hat{P}

;

\forall i \in [d], {\hat{h}}_{i}

is a column vector that depends on the

i t h

column of a one-condition g-inverse of

\hat{P}

; and

\forall k \in [d]

,

{\hat{γ}}_{k} = \sum_{i = 1}^{d} \sum_{j = 1}^{d} g (ω_{i}) {\hat{h}}_{i}^{T} {\hat{Σ}}_{k} {\hat{h}}_{j} g (ω_{j}) .

The definition of a one-condition g-inverse is given in Section 2.2. We then expose the proof of Theorem 1 in Section 2.3. The proof of Equation (3) is similar to that of Theorem 3.1 in [11]. The key difference between Equation (3) and Theorem 3.1 in [11] is that Equation (3) only involves inputs that can be inferred from

{(X_{t})}_{t = 0}^{N}

.

2.2. Core Concepts Underlying the Proof of Theorem 1

The following definition and Equation (4) are drawn from [19].

Definition 1

(One-condition g-inverse). A one-condition g-inverse (or one-condition generalized inverse) of a matrix A is any matrix

A^{-}

such that

A A^{-} A = A .

Let

P = [p_{i j}] \in {[0, 1]}^{d \times d}

be the transition matrix of a finite irreducible Markov chain, which is assumed to have the associated steady-state distribution vector

π

. Let

\tilde{P} = [{\tilde{p}}_{i j}] = P + E

be the transition matrix of the perturbed Markov chain, where

E = [ϵ_{i j}]

is the matrix of perturbations. Notice that

\forall (i, j) \in {[d]}^{2}, | ϵ_{i j} | \leq 1

, and

\forall i \in [d], \sum_{j = 1}^{d} ϵ_{i j} = 0

.

\tilde{P}

is assumed to admit the unique steady-state distribution vector

\tilde{π}

. Then, according to Theorem 2.1 in [19],

{\tilde{π}}^{T} - π^{T} = {\tilde{π}}^{T} E H,

(4)

where

H = G (I - e π^{T})

,

e^{T} = (1, \dots, 1)

, I is the identity matrix, and G is a one-condition g-inverse of

I - P

.

An exhaustive review of the different ways to compute matrix G (and therefore matrix H) is beyond the scope of this work. For the purposes of the examples in Section 4, we computed G as the so-called group inverse (see [20]) of

I - \hat{P}

. This choice is mainly motivated by ease of computation for the transition matrix of an ergodic Markov chain (Theorem 5.2 in [20]). Indeed, the most computationally intensive step is the calculation of the inverse of a

{(d - 1)}^{2}

principal submatrix of

I - \hat{P}

.

It bears mentioning that H can also be derived based on the matrix of mean first passage times and the steady-state distribution vector (see [19]), which can enhance interpretability.

For a sample path of length N characterized by the sequence of observed states

{(X_{i})}_{i = 0}^{N}

of a discrete ergodic Markov chain with transition matrix P, let

N_{i} = : | {t \in [N - 1] : X_{t} = ω_{i}} |

. For any pair of states

(ω_{i}, ω_{j}) \in Ω^{2}

, let

N_{i j}

be the number of transitions from

ω_{i}

to

ω_{j}

in the sample path

{(X_{t})}_{t = 0}^{N}

, or more formally

N_{i j} : = | {t \in [n - 1] : (X_{t}, X_{t + 1}) = (ω_{i}, ω_{j})} |, (i, j) \in {[d]}^{2}

. We will refer to

M = [N_{i j}] \in N^{d \times d}

as the sequence matrix. Also, throughout this work, we will assume that

\forall i \in [d], N_{i} \geq 1

.

Lemma 1.

Conditional on

{N_{i} | i \in [d]}

, the rows of M are mutually independent and respectively follow multinomial distributions with

N_{i}

trials and unknown event probabilities

p_{i 1}, \dots, p_{i d}

.

Proof.

Conditional on

N_{i}

, the joint outcome of

N_{i 1}, \dots, N_{i d}

can be equated to

N_{i}

independent trials (independence follows from the Markov property) where the outcome of each trial has a categorical distribution with fixed success probabilities

p_{i 1}, \dots, p_{i d}

. In other words, conditional on

N_{i}

,

N_{i 1}, \dots, N_{i d}

jointly follow a multinomial distribution with

N_{i}

trials and event probabilities

p_{i 1}, \dots, p_{i d}

.

Also, for

\forall i \in [d]

let

N_{i \cdot}

be a vector corresponding to row i of the sequence matrix M and let

P_{i \cdot}

be a vector corresponding to row i of the unknown transition matrix P. Given known

{N_{i} | i \in [d]}

and unknown

{P_{i \cdot} | i \in [d]}

, for

j \neq k

:

\begin{matrix} P (N_{j \cdot} = x \cap N_{k \cdot} = y | {N_{i} | i \in [d]}, {P_{i \cdot} | i \in [d]}) \\ = P (N_{j \cdot} = x | N_{k \cdot} = y, {N_{i} | i \in [d]}, {P_{i \cdot} | i \in [d]}) P (N_{k \cdot} = y | {N_{i} | i \in [d]}, {P_{i \cdot} | i \in [d]}) \\ = P (N_{j \cdot} = x | N_{j}, P_{j \cdot}) P (N_{k \cdot} = y | N_{k}, P_{k \cdot}) \\ = P (N_{j \cdot} = x | {N_{i} | i \in [d]}, {P_{i \cdot} | i \in [d]}) P (N_{k \cdot} = y | {N_{i} | i \in [d]}, {P_{i \cdot} | i \in [d]}) . \end{matrix}

□

Importantly, independence between the rows only holds conditional on the knowledge of

{N_{i} | i \in [d]}

. Otherwise, the rows of M are of course not independent since crucially

\forall i \in [d], N_{i}

depends on

\sum_{j \in [d]} N_{j i}

.

2.3. Proof of Theorem 1

Notice that for a sample path

{(X_{t})}_{t = 0}^{N}

of an irreducible Markov chain with discrete and finite state-space

Ω

, transition matrix P, and unique steady-state distribution vector

π

, we can build the maximum likelihood estimate

\hat{P} = [{\hat{p}}_{i j}] = [N_{i j} / N_{i}]

of P based on the sequence matrix. Moreover,

\hat{P}

can be viewed as a perturbation of P (and vice versa).

Assume that

\hat{P}

admits the unique steady-state distribution vector

\hat{π}

. Let

E = [ϵ_{i j}] = P - \hat{P}

and

H = G (I - e {\hat{π}}^{T})

, where

e^{T} = (1, \dots, 1)

, I is the identity matrix, and G is a one-condition g-inverse of

I - \hat{P}

. In this context, the following lemmas apply.

Lemma 2.

\forall (i, j) \in {[d]}^{2},

\begin{matrix} Cov (π_{i} - {\hat{π}}_{i}, π_{j} - {\hat{π}}_{j} | {N_{k}}_{k \in [d]}) = \sum_{k = 1}^{d} π_{k}^{2} Cov (\sum_{l = 1}^{d} h_{l i} ϵ_{k l}, \sum_{m = 1}^{d} h_{m j} ϵ_{k m} | N_{k}) . \end{matrix}

Proof.

By matrix multiplication and following Equation (4),

\forall (i, j) \in {[d]}^{2},

π_{i} - {\hat{π}}_{i} = \sum_{k = 1}^{d} π_{k} \sum_{l = 1}^{d} h_{l i} ϵ_{k l},

π_{j} - {\hat{π}}_{j} = \sum_{k = 1}^{d} π_{k} \sum_{m = 1}^{d} h_{m j} ϵ_{k m} .

Hence,

\begin{matrix} Cov (π_{i} - {\hat{π}}_{i}, π_{j} - {\hat{π}}_{j} | {N_{k}}_{k \in [d]}) \\ = Cov (\sum_{k = 1}^{d} π_{k} \sum_{l = 1}^{d} h_{l i} ϵ_{k l}, \sum_{k = 1}^{d} π_{k} \sum_{m = 1}^{d} h_{m j} ϵ_{k m} | {N_{k}}_{k \in [d]}) \\ = \sum_{k = 1}^{d} π_{k}^{2} Cov (\sum_{l = 1}^{d} h_{l i} ϵ_{k l}, \sum_{m = 1}^{d} h_{m j} ϵ_{k m} | N_{k}) . \end{matrix}

The second equality follows from conditional independence between the rows of the sequence matrix (Lemma 1). □

The following three lemmas are corollaries of Lemmas 3.4–3.6 in [11], except that here

E = [ϵ_{i j}] = P - \hat{P}

and

H = G (I - e {\hat{π}}^{T})

, whereby

e^{T} = (1, \dots, 1)

, I is the identity matrix, and G is a one-condition g-inverse of

I - \hat{P}

. For the sake of brevity, we refer the reader to [11] for the respective proofs.

Lemma 3.

\begin{matrix} \hat{π} (g) - π (g) = \sum_{k = 1}^{d} π_{k} \sum_{i = 1}^{d} g (ω_{i}) \sum_{l = 1}^{d} h_{l i} ϵ_{k l} . \end{matrix}

Lemma 4.

Var (\hat{π} (g) - π (g) | {N_{k}}_{k \in [d]}) = \sum_{k = 1}^{d} π_{k}^{2} \sum_{i = 1}^{d} \sum_{j = 1}^{d} g (ω_{i}) g (ω_{j}) Cov (\sum_{l = 1}^{d} h_{l i} ϵ_{k l}, \sum_{m = 1}^{d} h_{m j} ϵ_{k m} | N_{k}) .

Lemma 5.

Without a loss of generality, assume that the row vector

P_{k \cdot}

has only nonzero elements.

\forall (i, j, k) \in N^{3}

,

Cov (\sum_{l = 1}^{d} h_{l i} ϵ_{k l}, \sum_{m = 1}^{d} h_{m j} ϵ_{k m} | N_{k}) = \frac{1}{N_{k}} {\hat{h}}_{i}^{T} Σ_{k} {\hat{h}}_{j},

where

{\hat{h}}_{i} = [\begin{matrix} h_{1 i} \\ ⋮ \\ h_{d i} \end{matrix}] - h_{d i} [\begin{matrix} 1 \\ ⋮ \\ 1 \end{matrix}],

{\hat{h}}_{j} = [\begin{matrix} h_{1 j} \\ ⋮ \\ h_{d j} \end{matrix}] - h_{d j} [\begin{matrix} 1 \\ ⋮ \\ 1 \end{matrix}],

and

Σ_{k} = (\begin{matrix} p_{k 1} (1 - p_{k 1}) & - p_{k 1} p_{k 2} & \dots & - p_{k 1} p_{k d} \\ - p_{k 1} p_{k 2} & p_{k 2} (1 - p_{k 2}) & \dots & - p_{k 2} p_{k d} \\ ⋮ & ⋮ & ⋱ & ⋮ \\ - p_{k 1} p_{k d} & - p_{k 2} p_{k d} & \dots & p_{k d} (1 - p_{k d}) \end{matrix}) .

Notice that

\forall k \in [d],

lim_{N \to \infty} \frac{N_{k}}{N} = lim_{N \to \infty} {\hat{π}}_{k} = π_{k},

lim_{N \to \infty} {\hat{P}}_{k \cdot} = P_{k \cdot}, and hence

lim_{N \to \infty} {\hat{Σ}}_{k} = Σ_{k} .

From these equalities and Lemma 5

lim_{N \to \infty} \frac{1}{N} \sum_{k = 1}^{d} N_{k} {\hat{h}}_{i}^{T} {\hat{Σ}}_{k} {\hat{h}}_{j} = lim_{N \to \infty} \sum_{k = 1}^{d} π_{k} {\hat{h}}_{i}^{T} Σ_{k} {\hat{h}}_{j},

and

lim_{N \to \infty} \frac{1}{N} \sum_{k = 1}^{d} N_{k} \sum_{i = 1}^{d} \sum_{j = 1}^{d} g (ω_{i}) g (ω_{j}) {\hat{h}}_{i}^{T} {\hat{Σ}}_{k} {\hat{h}}_{j} = lim_{N \to \infty} \sum_{k = 1}^{d} π_{k} \sum_{i = 1}^{d} \sum_{j = 1}^{d} g (ω_{i}) g (ω_{j}) {\hat{h}}_{i}^{T} Σ_{k} {\hat{h}}_{j} .

Additionally, given Lemmas 2 and 4, respectively,

lim_{N \to \infty} N Cov (π_{i} - {\hat{π}}_{i}, π_{j} - {\hat{π}}_{j} | {N_{k}}_{k \in [d]}) = lim_{N \to \infty} \sum_{k = 1}^{d} π_{k} {\hat{h}}_{i}^{T} Σ_{k} {\hat{h}}_{j}

and

lim_{N \to \infty} N Var (\hat{π} (g) - π (g) | {N_{k}}_{k \in [d]}) = lim_{N \to \infty} \sum_{k = 1}^{d} π_{k} \sum_{i = 1}^{d} \sum_{j = 1}^{d} g (ω_{i}) g (ω_{j}) {\hat{h}}_{i}^{T} Σ_{k} {\hat{h}}_{j},

which proves Theorem 1.

3. Proposed Approach

3.1. Deriving a Confidence Interval for $π (g)$

The Markov chain Central Limit Theorem (MCCLT) states that, under certain conditions (see [21] for more details),

\sqrt{N} (\hat{π} (g) - π (g)) \overset{D}{\to} N (0, σ^{2}),

where

σ^{2} = {lim}_{N \to \infty} N Var (\hat{π} (g))

. We note that given the assumptions made in Theorem 1, the MCCLT applies. Equation (3) therefore lends itself to the computation of a normal confidence interval for

π (g)

based on a single sample path of a Markov chain using the non-asymptotic variance estimate

{\hat{σ}}^{2} = \frac{1}{N^{2}} \sum_{k = 1}^{d} N_{k} {\hat{γ}}_{k} .

(5)

The steps can be summarized in Algorithm 1 as follows.

Algorithm 1: Compute confidence interval for

π (g)

Data: Sample path

{(X_{t})}_{t = 0}^{N}

, confidence level

1 - α

Result: Confidence interval lower and upper bound

Here,

Φ^{- 1} : [0, 1] \to R

is the inverse of the cumulative distribution function of the standard normal distribution. We exemplify Algorithm 1 in Section 4.2.

3.2. Deriving a Confidence Region for $π$

A corollary of the MCCLT is that

\forall i \in [d],

\sqrt{N} ({\hat{π}}_{i} - π_{i}) \overset{D}{\to} N (0, σ_{i i}),

where

σ_{i i} = {lim}_{N \to \infty} N Var ({\hat{π}}_{i} - π_{i})

. To see this, let

g (x) = 1 (x = ω_{i})

. Equation (2) therefore lends itself to the computation of a normal confidence interval for

π_{i}

based on a single sample path of a Markov chain using the non-asymptotic variance estimate

{\hat{σ}}_{i i} = \frac{1}{N^{2}} \sum_{k = 1}^{d} N_{k} {\hat{h}}_{i}^{T} Σ_{k} {\hat{h}}_{i} .

By the same corollary and Equation (2), for large N the joint distribution of

{\hat{π}}_{1} - π_{1}, \dots, {\hat{π}}_{d - 1} - π_{d - 1}

converges towards a multivariate normal distribution with covariance matrix

Σ_{*} = [{\hat{σ}}_{i j}]

whereby

\forall (i, j) \in {[d - 1]}^{2}

,

{\hat{σ}}_{i j} = \frac{1}{N^{2}} \sum_{k = 1}^{d} N_{k} {\hat{h}}_{i}^{T} {\hat{Σ}}_{k} {\hat{h}}_{j} .

(6)

The full covariance matrix over

(i, j) \in {[d]}^{2}

is rank-deficient as

\sum_{i = 1}^{N} {\hat{π}}_{i} = \sum_{i = 1}^{N} π_{i} = 1

.

We can leverage the convergence towards a multivariate normal distribution to derive confidence regions for

π

. Let

{\hat{π}}_{*} : = {[{\hat{π}}_{i}]}_{i \in [d - 1]}

and

Δ_{d - 1} : = {x \in {[0, 1]}^{d - 1} : \sum_{i = 1}^{d - 1} x_{i} \leq 1}

. By the well-known properties of the multivariate normal distribution (see textbooks such as [22] for more details), the set of vectors in

Δ_{d - 1}

that satisfy the inequality

{(x - {\hat{π}}_{*})}^{T} Σ_{*}^{- 1} (x - {\hat{π}}_{*}) \leq χ_{d - 1, α}^{2}

(7)

form a

1 - α

confidence region for

{[π_{i}]}_{i \in [d - 1]}

and therefore also for

π

. Here,

χ_{d - 1, α}^{2}

denotes the

1 - α

cumulative probability quantile of a chi-squared distribution with

d - 1

degrees of freedom. This confidence region takes the form of an ellipsoid in

Δ_{d - 1}

. The steps can be summarized in Algorithm 2 as follows.

Algorithm 2: Compute confidence region for

π

Data: Sample path

{(X_{t})}_{t = 0}^{N}

, confidence level

1 - α

Result: Parameters of ellipsoidal confidence region

In two-dimensional space (i.e., for

d = 3

), the boundary set of the ellipse is described by the parametric curve

x (θ) = {\hat{π}}_{*} + \sqrt{χ_{d - 1, α}^{2}} C [\begin{matrix} cos (θ) \\ sin (θ) \end{matrix}]

for

0 < θ < 2 π

, where

C

is the cholesky factor of

Σ_{*}

. We exemplify Algorithm 2 in Section 4.1.

4. Examples

4.1. Example 1: Ellipsoidal Confidence Regions for $π$

In this section, we illustrate the approach suggested in Section 3.2 with an example. Assume that an observed sample path is characterized by the

3 \times 3

sequence matrix (defined in Section 2.2)

M_{n} = (\begin{matrix} 4 n & n & 0 \\ 0 & 2 n & n \\ n & 0 & n \end{matrix}) .

Hence,

{N_{k}}_{k \in [3]} = {5 n, 3 n, 2 n}

, and therefore the maximum likelihood transition matrix estimate is

\hat{P} = (\begin{matrix} \frac{4}{5} & \frac{1}{5} & 0 \\ 0 & \frac{2}{3} & \frac{1}{3} \\ \frac{1}{2} & 0 & \frac{1}{2} \end{matrix}),

and the estimated steady-state distribution vector is

\hat{π} = (0.5, 0.3, 0.2)

.

As was discussed in Section 3.2, Equations (6) and (7) can be leveraged to derive an ellipsoidal confidence region for

π

based on

\hat{P}

,

\hat{π}

. Figure 1 represents this confidence region at confidence level

1 - α = 0.9

for various values of n.

4.2. Example 2: Normal Confidence Interval for $π (g)$

Suppose that

{(X_{t})}_{t = 0}^{N}

is a sample path of a birth-death chain on

N_{d} = {0, 1, \dots, d}

with transition probabilities

\{\begin{matrix} P (X_{t + 1} = i + 1 | X_{t} = i) = p, \forall i \leq d - 1, \\ P (X_{t + 1} = i - 1 | X_{t} = i) = q, \forall i \geq 1, \end{matrix}

whereby

0 < p \leq q < 1

and

q + p < 1

. It can easily be shown that the elements of the steady-state distribution vector are

π_{i} = \frac{{(p / q)}^{i} (1 - p / q)}{1 - {(p / q)}^{d + 1}}, i \in N_{d} .

Assume that p and q are unknown and that we want to derive a confidence interval for the mean population in equilibrium, i.e.,

π (g) = \sum_{i \in N_{d}} i π_{i}

, based on

{(X_{t})}_{t = 0}^{N}

.

To evaluate the coverage performance of a normal confidence interval based on Theorem 1, we fix

d = 4, p = 0.4, q = 0.5

and proceed as follows. We generate 5000 random ergodic sample paths with fixed length N of this Markov chain. For every sample path, we compute

\hat{π} (g)

as well as

{\hat{σ}}^{2}

(Equation (5)). The realized coverage ratio corresponds to the proportion of sample paths for which

| π (g) - \hat{π} (g) | \leq \hat{σ} Φ^{- 1} (1 - \frac{α}{2})

, where

Φ^{- 1} : [0, 1] \to R

is the inverse of

Φ

, the cumulative distribution function of a standard normal distribution, and

1 - α

is the desired level of confidence.

We repeat this experiment for different values of N and different values of

α

. The random sample paths were generated using an R library (see [23]). The results of these experiments are summarized below in Table 1. In all cases, the realized coverage ratio is very close to 1 −

α

, indicating that the confidence intervals perform well.

Moreover, the coverage levels converge toward the respective confidence levels as N increases. We ascribe this to the fact that the confidence interval becomes increasingly precise as the sample distribution converges towards a normal distribution. To see this, notice how the Shapiro–Wilk normality test p-value inceases from 5.796 ×

10^{- 5}

at

N = 250

to 0.9101 at

N = 2000

.

Another likely reason why coverage levels improve as N increases is that the numerical stability of

{\hat{σ}}^{2}

increases with N. To illustrate this, Figure 2 represents estimates of

N {\hat{σ}}^{2}

for 25 randomly sampled paths whose length we progressively extend until

N = 2000

. These sample paths are based on the same birth-death chain we used for the numerical experiment summarized in Table 1. For

N < 250

, the values of

N {\hat{σ}}^{2}

clearly seem prone to numerical instability.

The premise of this paper is that nothing is known about the Markov chain beyond what can be inferred from

{(X_{t})}_{t = 0}^{N}

. Under this premise, it may be difficult to ascertain that N is high enough for

{\hat{σ}}^{2}

to be a reliable estimate. One way to alleviate this difficulty could be to verify whether

N {\hat{σ}}^{2}

seems to have converged for a given value of N. This could be done through visual inspection of plots like Figure 2 or by determining convergence based on the differences between successive iterations of

N {\hat{σ}}^{2}

.

Another aspect to consider is bias, i.e.,

E [\hat{π} (g) - π (g)]

. In order for the confidence intervals to yield reliable results, we need to be able to assume that

E [\hat{π} (g) - π (g)] \approx 0

(or at least that

| E [\hat{π} (g) - π (g)] | < < \hat{σ}

). For the Markov chain we discussed in this section, bias shrinks to the order of

10^{- 10}

for

N = 70

(no matter which state the simulation is initialized in). This suggests that for this particular Markov chain, bias is negligible relative to

\hat{σ}

even for small values of N.

5. Conclusions

In this paper, we proposed a method to numerically determine confidence regions for the steady-state probabilities and confidence intervals for additive functionals of an ergodic finite-state Markov chain. The novelty of our method stems from the fact that it simultaneously fulfills three conditions. First, our method is asymptotically exact. Second, our method requires only a single sample path

{(X_{t})}_{t = 0}^{N}

as an input. Third, our method does not require any inputs beyond what can be inferred from

{(X_{t})}_{t = 0}^{N}

. To the best of our knowledge, no method that satisfies all three conditions has previously been proposed in the literature.

Since our method relies on asymptotically exact results, we expect it to work well for large enough values of N, as exemplified by the numerical experiment in Section 4.2. However, since the premise of this paper is that nothing is known about the Markov chain beyond what can be inferred from

{(X_{t})}_{t = 0}^{N}

, what constitutes a large enough value of N cannot be known with certainty. If N is too small, the variance estimate

{\hat{σ}}^{2}

may become numerically unstable, as illustrated in Section 4.2. In this case, the confidence interval’s realized coverage level could undershoot the desired confidence level. One way to mitigate this risk could be to analyze the convergence of

N {\hat{σ}}^{2}

as N is increased to include the entire sample path (see Section 4.2 and in particular Figure 2).

Another aspect to consider is bias, which could be a potential concern for small values of N. In order for the confidence intervals discussed in Section 3 to be valid, we need to be able to assume that

E [\hat{π} (g) - π (g)] \approx 0

(or at least that

| E [\hat{π} (g) - π (g)] | < < \hat{σ}

). This is not a strong assumption for large N since

\hat{π} (g)

is an unbiased estimator of

π (g)

, assuming the initial state is sampled from

π

. However, given the premise that nothing is known beyond what can be inferred from

{(X_{t})}_{t = 0}^{N}

, it may be difficult to determine with certainty whether

{(X_{t})}_{t = 0}^{N}

is large enough to make this assumption. One way to address this issue could be through the lens of mixing time (see [12] for more details). If N exceeds the mixing time, then bias is likely close to 0. An approach to estimating mixing time based on a single sample path was proposed in [24].

Lastly, an open question is how one might extend our proposed approach to infinite state-space Markov chains.

Author Contributions

Y.V. developed the methodology. Y.V. performed the numerical experiments in Section 5. Y.V. wrote the main manuscript. Y.V. and J.T. reviewed the manuscript. All authors have read and agreed to the published version of the manuscript.

Funding

This research was partially funded by the Natural Sciences and Engineering Research Council of Canada Discovery Grant 256233.

Data Availability Statement

The raw data supporting the conclusions of this article will be made available by the authors on request.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Metzner, P.; Noé, F.; Schütte, C. Estimating the sampling error: Distribution of transition matrices and functions of transition matrices for given trajectory data. Phys. Rev. Stat. Nonlinear Soft Matter Phys. 2009, 80, 021106. [Google Scholar] [CrossRef] [PubMed]
Strelioff, C.; Crutchfield, J.; Hübler, A. Inferring Markov chains: Bayesian estimation, model comparison, entropy rate, and out-of-class modeling. Phys. Rev. 2007, 76. [Google Scholar] [CrossRef] [PubMed]
Chen, J.; Wu, Y. Discrete-time Markov chain for prediction of air quality index. J. Ambient. Intell. Humaniz. Comput. 2020, 1–10. [Google Scholar] [CrossRef]
Szepesvari, C. Algorithms for Reinforcement Learning; Morgan & Claypool Publishers: San Rafael, CA, USA, 2009; Chapter 2. [Google Scholar]
Kipnis, C.; Varadhan, S. Central limit theorem for additive functionals of reversible Markov processes and applications to simple exclusions. Comm. Math. Phys. 1986, 104, 1–19. [Google Scholar] [CrossRef]
Meyn, S.; Tweedie, R. Markov Chains and Stochastic Stability; Communications and Control Engineering Series; Springer: London, UK, 1993; Chapter 17. [Google Scholar]
Kontorovich, A.; Weiss, R. Uniform Chernoff and Dvoretzky-Kiefer-Wolfowitz-Type Inequalities for Markov Chains and Related Processes. J. Appl. Probab. 2014, 51, 1100–1113. [Google Scholar] [CrossRef]
León, C.A.; Perron, F. Optimal Hoeffding Bounds for Discrete Reversible Markov Chains. Ann. Appl. Probab. 2004, 14, 958–970. [Google Scholar] [CrossRef]
Paulin, D. Concentration inequalities for Markov chains by Marton couplings and spectral methods. Electron. J. Probab. 2015, 20, 1–32. [Google Scholar] [CrossRef]
Jian, Y.H.; Lui, T.; Lou, Z.; Rosenthal, J.; Shangguan, S.; Wu, Z. Markov Chain Confidence Intervals and Biases. Int. J. Stat. Probab. 2022, 11, 1–29. [Google Scholar]
Vestring, Y.; Tavakoli, J. Estimating and Calibrating Markov Chain Sample Error Variance. Int. J. Stat. Probab. 2024, 14. [Google Scholar] [CrossRef]
Levin, D.; Peres, Y.; Wilmer, E. Markov Chains and Mixing Times; American Mathematical Soc.: Providence, RI, USA, 2009. [Google Scholar]
Abbas, K.; Berkhout, J.; Heidergott, B. A Critical Account of Perturbation Analysis of Markov Chains. Markov Process. Relat. Fields 2016, 22, 227–265. [Google Scholar]
Mouhoubi, Z. Perturbation and stability bounds for ergodic general state Markov chains with respect to various norms. Matematiche 2021, 76, 243–276. [Google Scholar]
Haviv, M.; Van der Heyden, L. Perturbation Bounds for the Stationary Probabilities of a Finite Markov Chain. Adv. Appl. Probab. 1984, 16, 804–818. [Google Scholar] [CrossRef]
Rojas Cruz, J. Sensitivity of the stationary distributions of denumerable Markov chains. Stat. Probab. Lett. 2020, 166, 108866. [Google Scholar] [CrossRef]
Schomburg, B. On the approximation of ergodic projections and stationary distributions of stochastic matrices. Electron. J. Linear Algebra 2012, 23, 989–1000. [Google Scholar] [CrossRef]
Cho, G.; Meyer, C. Comparison of perturbation bounds for the stationary distribution of a Markov chain. Linear Algebra Its Appl. 2001, 335, 137–150. [Google Scholar] [CrossRef]
Hunter, J. Stationary distributions and mean first passage times of perturbed Markov chains. Linear Algebra Its Appl. 2005, 410, 217–243. [Google Scholar] [CrossRef]
Meyer, C. The Role of the Group Generalized Inverse in the Theory of Finite Markov Chains. SIAM Rev. 1975, 17, 443–464. [Google Scholar] [CrossRef]
Jones, g. On the Markov chain central limit theorem. Probab. Surv. 2004, 1, 299–320. [Google Scholar] [CrossRef]
Mood, A. Introduction to the Theory of Statistics; McGrawHill: New York, NY, USA, 1950; Chapter 6. [Google Scholar]
Spedicato, G. Discrete Time Markov Chains with R. R J. 2017, 9, 84–104. [Google Scholar] [CrossRef]
Hsu, D.; Kontorovich, A.; Szepesvari, C. Mixing Time Estimation in Reversible Markov Chains from a Single Sample Path. Ann. Appl. Probab. 2019, 29, 2439–2480. [Google Scholar] [CrossRef]

Figure 1. Confidence regions for

π

obtained from sequence matrix

M_{n}

at confidence level

1 - α = 0.9

for 3 different values of n.

Figure 1. Confidence regions for

π

obtained from sequence matrix

M_{n}

at confidence level

1 - α = 0.9

for 3 different values of n.

Figure 2. Convergence of

N {\hat{σ}}^{2}

as N increases for 25 different randomly generated sample paths.

Figure 2. Convergence of

N {\hat{σ}}^{2}

as N increases for 25 different randomly generated sample paths.

Table 1. Realized coverage levels for different values of N and

α

.

Table 1. Realized coverage levels for different values of N and

α

.

$N$	Realized Coverage Level			Shapiro–Wilk Test p-Value
$N$	$α$ = 0.1	$α$ = 0.05	$α$ = 0.01	Shapiro–Wilk Test p-Value
250	0.8834	0.9324	0.9806	5.796 × $10^{- 5}$
500	0.8884	0.9368	0.9826	0.03163
1000	0.8968	0.9458	0.9872	0.1535
2000	0.8994	0.9524	0.9892	0.9101

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Vestring, Y.; Tavakoli, J. Confidence Regions for Steady-State Probabilities and Additive Functionals Based on a Single Sample Path of an Ergodic Markov Chain. Mathematics 2024, 12, 3641. https://doi.org/10.3390/math12233641

AMA Style

Vestring Y, Tavakoli J. Confidence Regions for Steady-State Probabilities and Additive Functionals Based on a Single Sample Path of an Ergodic Markov Chain. Mathematics. 2024; 12(23):3641. https://doi.org/10.3390/math12233641

Chicago/Turabian Style

Vestring, Yann, and Javad Tavakoli. 2024. "Confidence Regions for Steady-State Probabilities and Additive Functionals Based on a Single Sample Path of an Ergodic Markov Chain" Mathematics 12, no. 23: 3641. https://doi.org/10.3390/math12233641

APA Style

Vestring, Y., & Tavakoli, J. (2024). Confidence Regions for Steady-State Probabilities and Additive Functionals Based on a Single Sample Path of an Ergodic Markov Chain. Mathematics, 12(23), 3641. https://doi.org/10.3390/math12233641

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Confidence Regions for Steady-State Probabilities and Additive Functionals Based on a Single Sample Path of an Ergodic Markov Chain

Abstract

1. Introduction

2. Results Underpinning the Proposed Approach and Proofs

2.1. Results Underpinning the Proposed Approach

2.2. Core Concepts Underlying the Proof of Theorem 1

2.3. Proof of Theorem 1

3. Proposed Approach

3.1. Deriving a Confidence Interval for $π (g)$

3.2. Deriving a Confidence Region for $π$

4. Examples

4.1. Example 1: Ellipsoidal Confidence Regions for $π$

4.2. Example 2: Normal Confidence Interval for $π (g)$

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

Article Menu

Confidence Regions for Steady-State Probabilities and Additive Functionals Based on a Single Sample Path of an Ergodic Markov Chain

Abstract

1. Introduction

2. Results Underpinning the Proposed Approach and Proofs

2.1. Results Underpinning the Proposed Approach

2.2. Core Concepts Underlying the Proof of Theorem 1

2.3. Proof of Theorem 1

3. Proposed Approach

3.1. Deriving a Confidence Interval for π ( g )

3.2. Deriving a Confidence Region for π

4. Examples

4.1. Example 1: Ellipsoidal Confidence Regions for π

4.2. Example 2: Normal Confidence Interval for π ( g )

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

3.1. Deriving a Confidence Interval for $π (g)$

3.2. Deriving a Confidence Region for $π$

4.1. Example 1: Ellipsoidal Confidence Regions for $π$

4.2. Example 2: Normal Confidence Interval for $π (g)$