3.1. Markov Chains
Markov chains were introduced by Andrey Markov in 1906 in order to prove that the weak law of large numbers does not require strict independence of random variables. In particular, he conducted an experiment where he recorded all of the two-letter bigrams in Pushkin’s
Eugene Onegin to demonstrate his property [
19]. Whether he intended it or not, Markov chains have since become a powerful modeling tool used across machine learning and Bayesian statistics. In this section, we cover the basic properties of Markov chains and develop the theory necessary to understand MCMC simulation.
The following definitions give explanations for Markov chains on discrete state spaces. The definitions here come from chapter 11 of
Markov chains from Introduction to Probability by Blitzstein and Hwang [
20], in accordance with other works [
21,
22].
Definition 1 (Markov Chain, [
20])
. A sequence of random variables defined on a finite or countable state space
or is called a Markov chain
if, for all ,This condition is called the Markov property.
The Markov property states that the probability of transitioning from state i to j when moving from to is only dependent on the fact that . If we consider the variables in the chain to be steps in time, with n representing the step number, then all of the steps before n do not tell us anything about the distribution of .
Definition 2 (Transition matrix, [
20])
. If is a Markov chain on a countable state space , then the matrix Q with cells is called the transition matrix
of the chain.
Note that if the state space is finite with
M states, then the dimensions of
Q are
. If the state space is countably infinite, then
Q is not a matrix in the traditional sense but still follows the same rules of addition and multiplication [
19].
Proposition 1 (Properties of transition matrices). There are three immediate properties of a transition matrix Q:
- 1.
Q is non-negative.
- 2.
, as the rows of Q are conditional probability distributions for given .
- 3.
The probability of transitioning from i to j after n steps, denoted as , can be represented by the cell of the power .
To see part (3), note that the probability of going from
i to
j after two steps is
, which is the
cell of
. The case in
n steps can be seen by induction [
20] (p. 462). Thus, we can say that the
conditional distribution for any
k is given by the
ith row of
.
Example 1 (Weather chain)
. Any day is either sunny, cloudy, or rainy. We can construct the following transition matrix to describe a Markov chain on these states:Each cell of Q tells us the probability from transitioning from one state to another in one step. For example, if day n is sunny, to see the distribution of possible weather for day , one would look to the first row of Q. In this case, the day has the distribution . To see the probability of going from one state to another over two steps, we can simply compute Let v be a row vector that describes the starting probability distribution over the state space of a chain, i.e., . Then, Proposition 1 (3) implies that the marginal distribution of is given by . Note that this is different from the conditional distribution, which is simply given by , as noted above in Proposition 1.
Now, we introduce some vocabulary for describing Markov chains.
Definition 3 (Recurrent and transient states, [
20])
. A state i in the state space of a Markov chain is recurrent
if starting from , and the probability of returning to i at some later step is 1. The opposite of recurrent is transient
, which means that starting from , there is a positive probability that the future steps will never return to i. In example 1 we can see that all of the states are recurrent in the weather chain. From any state, it is possible to eventually transition to any other state. Therefore, from any state, the probability of leaving forever is 0, which implies the probability of returning infinitely many times is 1.
From Definition 2, it should be apparent that if state i in a finite state space is transient, then the Markov chain will leave it forever at some point with a probability of 1.
Definition 4 (Reducibility [
20])
. A chain is said to be irreducible
if for all states i and j, there is a positive probability of transitioning from i to j in a finite number of steps. Otherwise, the chain is said to be reducible.
Note that recurrent is a property of a state, and reducible is a property of a Markov chain.
Example 2 (Reducible chain)
. Consider the four-state Markov chain described by the transition matrixWe can say that states 1 and 2 are transient, as there is a positive probability of never returning to them when starting from them. For example, the probability of transitioning from state 1 to state 3 is . Since from state 3 it is impossible to return to state 1, we know that if the chain starts at state 1, then the probability of never returning after 1 step is at least . State 2 is transient because it has a probability of 1 of transitioning to state 1 and then a probability 1/2 of transitioning to state 3, from which it is impossible to transition to state 2. We can see from the matrix that starting from either of states 3 or 4, the probability is 1 that the chain will bounce back and forth between states 3 and 4, so we can say that those two states are recurrent. This chain, described by Q, has the property of being reducible, since it is impossible from states 3 or 4 to ever transition to states 1 or 2.
Note that a chain is irreducible if for any states , there exists n so that , where is the cell of .
Proposition 2. An irreducible Markov chain on a finite state space is one where all of the states are recurrent.
Proof ([
20])
. Let
Q be the transition matrix of a finite state Markov chain. We know that at least one state must be recurrent; otherwise, if all the states were transient, then the chain would eventually run out of states. Call the recurrent state 1. By the definition of irreducible, for any state
j, we know that there exists a finite
n so that the probability of transitioning from state 1 to
j after
n steps
is positive. Since 1 is recurrent, the chain will visit state 1 infinitely many times with probabilty 1. Therefore, the chain will eventually transition to
j. From
j the chain will eventually transition to 1, since 1 is recurrent. This is the same situation that we began with, so we know that the chain will eventually transition to
j again infinitely many times. So,
j is recurrent. Since this is true for any arbitrary
j, we conclude that all the states are recurrent. □
Definition 5 (Period [
20])
. The period
of a state i is the greatest common divisor of all the possible numbers of steps it can take for a chain starting at state i to return to i. If the period of a state is 1, it is called aperiodic
and periodic
otherwise. If all states are aperiodic, then the chain is called aperiodic. Otherwise, the chain is called periodic. Example 3 (Random walk)
. Consider a Markov chain on the state space . Starting from state i, the probabilities of transitioning to and are p and respectively. Once the chain reaches state 1 or N, the chain stays there with probability 1. The Markov chain describing this random walk can be written asWe can see that only states 1 and N are recurrent, as for all other states, there is a positive probability of reaching the end and never returning to the middle states. We can therefore conclude that the chain is reducible. The states between 1 and N (not inclusive) all have period 2, as it will take an even number of steps to transition away from a state and then move back to it. Therefore, we can say that this chain is periodic.
We now introduce the concept of a stationary distribution, which is a fundamental concept in MCMC. The stationary distribution describes the limiting behavior of the Markov chain.
Definition 6 (Stationarity [
19])
. Let s be a row vector describing a discrete probability distribution, meaning that , and . We say that s is a stationary distribution
for a Markov chain with transition matrix Q if . Let
be the finite state space of a Markov chain described by transition matrix
Q. By the definition of row vector matrix multiplication, we can say that
Proposition 3 ([
20])
. For any irreducible Markov chain, there exists a unique stationary distribution. This proposition is a corollary of the Perron–Frobenius theorem, which states that for any non-negative matrix Q whose rows sum to 1, if for any there exists a k so that the cell of , then 1 is the largest eigenvalue of Q, and the corresponding eigenvector has all positive entries. Knowing when the stationary distribution for a Markov chain exist is important for designing proper Markov chain simulations. We would also like to know whether or not we can asymptotically approach the stationary distribution. The following theorem formalizes this.
Theorem 1 ([
20] Convergence of Markov chains)
. Let be a Markov chain with stationary distribution s and transition matrix Q. If the Markov chain is both irreducible and aperiodic, then converges to a matrix where all rows are s. This tells us that the marginal distribution of converges to s as . This is important for MCMC simulation. As we will see later, the point of MCMC simulation is to construct a Markov chain that has a desired stationary distribution so we can sample from it by observing the long-term samples of the chain. We would like to know that the Markov chains we are designing actually approach our desired stationary distributions.
Now, the concept of reversibility is introduced. Reversibility has the nice benefit of making the check for stationarity much simpler.
Definition 7 (Reversibility, [
23])
. A Markov chain is called reversible
if the joint distributions for and are the same. That is, the joint probability of is equal to the probability of .
In the case where the state space is countable, reversibility is defined as the following [
20]: Given a row vector probability distribution
s over the states in a state space
and a transition matrix
Q describing a Markov chain on those states, we say that the chain is
reversible with respect to
s if the
reversibility or
detailed balance equation holds as follows:
Corollary 1 (Detailed balance test [
19])
. If a Markov chain is reversible with respect to a distribution row vector s, then s is a stationary distribution of that chain. To give some intuition for this corollary, imagine conducting an empirical experiment on the behavior of a Markov chain. Imagine running many copies of the chain, with each copy represented as a dot on a map of states laid out on a table. We can imagine the density of dots at each state as a probability distribution. Imagine that a vector s described these counts of dots. If the detailed balance equation held for our experiment, that would mean that for all i and j, the number of dots moving from state i to j is the same as the number of dots moving from j to i. Thus, the relative densities at states i and j do not change at each step. So, we expect the s to be the same at the next step.