1. Introduction
Implementation of Markov chain Monte Carlo (MCMC) algorithms often requires sampling from nonstandard probability distributions. In this article, we describe a method for (1) approximating the conditional distribution of the sum of independent binomials given the sum, or (2) simulating a value from this conditional distribution. We develop and compare two MCMC algorithms for accomplishing this.
This study was motivated by an imputation problem using the data augmentation algorithm [
1]. Ayres et al. [
2] proposed use of the data augmentation algorithm to impute county-level vaccination frequencies in the United States. In order to obtain these imputed values, it is necessary to simulate from the distribution of independent binomial random variables conditioned on the sum. In the formulation of the problem, we have independent binomial random variables whose sum is known. The method proposed in this paper has direct application to the imputation step of the data augmentation algorithm for the vaccine imputation problem. The problem of approximating the reproduction number of an infectious disease also requires the simulation of independent binomials conditioned on the sum.
Let
and let
denote the sum. The goal is to sample from the conditional distribution of
given
. The Metropolis–Hastings (MH) algorithm [
3,
4] can be used to sample from the conditional probability mass function (PMF)
. The algorithm creates a Markov chain whose steady state distribution is the desired distribution. General methods that include the MH algorithm are called Markov chain Monte Carlo (MCMC) methods. The MH algorithm was unnoticed in the statistical community until Geman and Geman [
5] applied it to a Bayesian analysis of image restoration. While most applications of the MH algorithm, or MCMC in general, involve the approximation of a posterior distribution in a Bayesian analysis, the method can be used to sample from almost any probability distribution. The MCMC revolution of the late 1980s opened the door to almost any Bayesian analysis, provided we could accept a collection of simulated observations from the posterior distribution in lieu of an analytic expression for the posterior. Gilks et al. [
6] was influential in explaining the theory of MCMC and giving a number of applications. See [
7] for a modern treatment of the various MCMC algorithms available today. For our situation, we apply the MH algorithm in two ways to simulate the conditional distribution of binomial random variables given the sum.
The MH algorithm begins with a plausible value for the vector , in the sense that , and . We then propose a move to another state by adding 1 to one randomly selected component of and subtracting 1 from another selected at random from . Given the current state of the Markov chain, we select one ordered pair of components with the condition that we can subtract 1 from the first and add 1 to the second and reach another valid state. Note that for some vectors, there may be some components that are 0, in which case we cannot subtract 1, and there may be others that equal , in which case we cannot add 1.
A related problem is that if
, then conditioned on
the vector
is multinomial with size
n and probabilities
,
. To see this, write the conditional as the ratio of the joint probability mass function (PMF)
and the marginal of the given,
:
which is the PMF for the multinomial distribution and has constraints
for all
i and
.
If the successes in the various Bernoulli trials that give rise to the
Xs have small probability
and the number of trials
is large, then the binomial distribution is well approximated by the Poisson; that is, if
, then
. Thus, as an approximation, we can say that
We propose two applications of the MH algorithm with a starting value obtained using the Poisson approximation in (
2).
The details of the random walk MH algorithm require a careful enumeration of the number of neighbors of each possible state in the finite Markov chain. In the next section, we address this combinatorics problem for the random walk MH algorithm.
The MH algorithm begins with an initial plausible value from the target distribution, which in our case is the conditional distribution of the vector
conditioned on the sum
. Call this initial value
. We then propose a move to a new value, called
. This proposal distribution, which can depend on the current state
of the Markov chain, is denoted
. This move is accepted with probability
The next value of
, called
, is
Thus, we either take the proposal
(with probability
) or we stay at the current state. This process then continues successively until the desired number of iterations, typically in the thousands to hundreds of thousands, has been achieved. The theory says that the steady state distribution of the Markov chain
is the target distribution
. We should continue this iterative method until we believe we have reached the steady state distribution; this is called the
burn-in. If we want a single sample from the distribution of
given
n, we can simulate one additional step. If we are interested in the target distribution in full, we can simulate an additional large number of steps and use the resulting proportions as our estimate for the target distribution.
We describe two types of the MH algorithm. One is a random walk, where a move is proposed from the current state by simultaneously increasing and decreasing the count by one in two components. The second type of MH algorithm is an independence sampler, whereby the proposal made at each step is independent of the current state vector. In most cases where the MH algorithm is applied, the random walk works better, but in this circumstance the independence sampler seems to reach the steady state quicker and is the preferred method.
Section 2 describes the random walk MH algorithm, and the subsequent section,
Section 3, gives a concrete example.
Section 4 describes the independence sampling MH algorithm and
Section 5 presents an example. Conclusions are given in
Section 6.
2. Random Walk MH Algorithm
In our implementation, we propose to select one component of and subtract 1 from it and another element from and add 1 to it. We must be careful, though, because some components of may be 0, in which case we cannot select one of them to subtract 1. Also, some components may achieve the maximum possible value, , in which case we cannot add 1. A careful counting of the possibilities is needed.
For example, consider the simple problem of
, and
We want to sample from the conditional distribution of
conditioned on
. Here
,
,
,
. The number of nonnegative integer solutions of
is equal to
. This uses the familiar “stars and bars” method of counting the number of nonnegative integer solutions to such problems; see Section 6.5 of [
8]. The 21 solutions are
005 104 014 203 113 023 302 212 122 032 401
311 221 131 041 500 410 320 230 140 050
We must be careful when counting the number of possible moves from a given state, because no component can go below 0 or above . For example, if our current state is 023, we cannot choose the first component to subtract 1, because that would drop us below 0. We could add 1 to the first component and subtract 1 from either of the other components. A little analysis suggests that the following states are reachable from 023 in one step:
014 113 122 032
By contrast, if our current state is 122 there are no restrictions on which components we could choose; in this case we could choose any two components and subtract 1 from one and add 1 to the other. The states
113 212 221 131 032 023
are reachable from
122. Even though
023 and
122 are reachable from one another, the probability of going from
023 to
122 is 1/4, while the probability of going from
122 to
023 is 1/6. This asymmetry must be accounted for in the MH algorithm. The graph of all 21 states with edges indicating reachability in one step is shown in the left panel of
Figure 1.
The middle and right panels of
Figure 1 are the graphs for other scenarios where some of the components have upper restrictions. These nodes lie on a subset of the
k-dimensional simplex.
Define the following
The number of components of
that are “maxed out”, i.e., equal to
, is equal to
.
To illustrate the counting methods required to determine the number of possible moves from a given state, i.e., the number of neighbors in the network, suppose the current state is the vector
where
,
, and
. For this starting value,
, and
. The number of possible moves by selecting two components satisfying
, and then selecting one of the two components to subtract 1, is
The number of possible moves obtained by selecting one component that is equal to 0 and one that is greater than 0 is
Note that once the two components are selected, there is only one way to move to a valid state: add 1 to the component that was 0, and subtract 1 from the other component.
The last way to select a move is to select one of the components that is maxed out (i.e., equal to
), and one that is not maxed out and also not equal to 0. (We must disallow selecting a 0 for the second choice, otherwise this would be a state covered by case 2). Again, we have no choice but to subtract 1 from the maxed out component and add on to the component that is not maxed out. This can be calculated in
ways. Thus, the number of possible moves from
is
The next theorem gives an enumeration of the number of possible moves in general.
Theorem 1. The number of possible moves given the current state is Proof. Suppose the current state is
with
k,
, and
m as defined in (
5). The states that can be reached in one step from
are those obtained by exactly one of the following rules:
Select two among the ℓ components that are strictly between 0 and . Within the two selected, choose one to subtract 1; then add 1 to the other. (Note: there are no restrictions on subtracting or adding 1 to any of the components in because they all satisfy ).
Select one component among the m zeros (i.e., values of that are equal to 0). Add 1 to this component, and then select one among the nonzero components to subtract 1.
Select one component among the that are maxed out (i.e., values of that are equal to ). Subtract 1 from this component and then add 1 to one of the m components that are not maxed out and also not 0, thus avoiding double counting here and in case 2.
Other ways of selecting two components from among the k do not lead to any reachable states. For example, we cannot select two components that are equal to 0. If we did this, we could not subtract 1 from either component. Similarly, we cannot select two components that are maxed out, since we would be unable to add 1 to either.
Once we recognize the three ways to select two components that lead to another reachable state, we can count them directly since they all involve two steps. The number of ways of applying the first rule is
The number of ways to apply the second rule is
Finally, the number of ways to apply rule three is
To get to a reachable state from
we must apply one of the three rules, so we conclude that the number of reachable states is
□
The algorithm for selecting one possible move from the state
, with all possible moves having equal probability, is given as Algorithm 1 below.
Algorithm 1: Propose a Move to a Neighbor |
Input: A state vector with the properties that and . Output: One feasible state vector selected at random, i.e., with equal probability, from the set of neighbors of . - 1:
Compute , , and , defined in Theorem 1. Let . - 2:
Simulate - 2a:
If , then select two components at random from among the members of the set . Select one of these two with probability 0.5 and subtract 1; add 1 to the other. Call the result . - 2b:
If then randomly select one component of from among the m components in the set . Add 1 to this component. Select at random one component from among those in ; this set contains elements. Subtract 1 from this component. Call the result . - 2c:
If then select one component at random from among the elements of the set , i.e., the set of those components that are maxed out; subtract 1 from this component. Then select one component from among the m that are not maxed out and not 0; add 1 to this component. Call the result
- 3:
Return .
|
3. Example of the Random Walk MH Algorithm
Consider the case of
, with
The constraint is that
. Suppose that the starting vector is
For this vector, the number of components for which
is
. The number of zeros is
and the number of maxed out components is
(the third and fifth components). Direct calculation yields
Thus, there are
neighbors of
, each of which is equally likely to be selected as the proposal in the MH algorithm.
The cutoff values for selecting which kind of move to take are
The first uniform random variate is
, which leads to the rule to select one 0, to which we add 1, and then select one nonzero component to subtract 1. The algorithm selected
as the component to add 1 and
as the non-zero component to subtract 1. The proposed state is therefore
This proposed state has 60 neighbors.
The log-likelihood at
is
and the log-likelihood at the proposed vector
is
. The acceptance probability is therefore
Therefore, this move is accepted with probability 1. We set
The steps of the MH algorithm are then repeated with
,
according to Algorithm 2.
Figure 2 shows the first 5, 50, 500, 5000, and 50,000 iterations of the random walk Markov chain. Convergence to the steady state is slow, but seems to occur by about 20,000 iterations. The slow convergence is likely due to the small changes in the proposed move from
to
in the MH algorithm.
Algorithm 2: Random Walk MH |
Input: Vectors and , the constraint on the sum, n, and the desired number N of samples. Output: A sequence of vectors whose steady state distribution is the conditional distribution of given . - 1:
Let be any nonnegative integer solution to . Set . A good starting value is a sample from the multinomial distribution given in ( 2). - 2:
Sample a proposal using Algorithm 1. - 3:
Compute the acceptance probability - 4:
Simulate . If , accept the proposed move and set ; otherwise set . - 5:
Set . If , go to step 2. Otherwise stop.
|
5. Example of the Independence Sampler MH Algorithm
Consider the same example given in
Section 3 where we have
and the constraint is
. Suppose, as before, that the starting vector is
The multinomial distribution from (
2) is
The proposed vector is sampled from this multinomial distribution, and is found to be
The acceptance probability is then
Thus, we accept the move and set
It is possible for the proposed vector to be infeasible. This can occur when the proposal satisfies
. For the example given previously,
. With the proposal from (
7), the marginal for
is
so
is an unlikely but possible outcome, even though the unconditional distribution is
from which
is impossible. The algorithm handles this infeasible proposal by giving zero likelihood to it. In other words, if a move is proposed to an infeasible vector, the move is never accepted because the numerator in the acceptance probability is 0. For the example discussed previously in this section, we ran 1,000,000 simulations using the independence sampler and obtained zero infeasible
values. It is important to recognize that this possibility exists and that the algorithm does not crash when it is encountered.
If we continue this algorithm we obtain a sequence of
whose steady state distribution is the target. The first 5, 50, 500, and 5000 simulations are shown in
Figure 3. Convergence to the steady state is faster with the independence sampler.
6. Summary and Conclusions
This study was motivated by a problem in allocating county-level vaccine counts. Many states in the United States were diligent about recording the county of residence for each vaccine recipient, while other states had a high proportion of recipients whose county of residence was unknown. The problem, described in [
2], was to allocate the state-wide unknowns to the various counties. Assigning these proportionally to the counties based on population is problematic, because a number of demographic and political variables are known to affect the willingness to get vaccinated. To address this, we applied a version of the data augmentation algorithm of Tanner and Wong [
1] to simultaneously impute the missing county-level frequencies and estimate parameters in a logistic regression model for vaccine willingness. As part of the data augmentation algorithm, we are required to simulate from the distribution of binomials conditioned on the sum.
Users of the algorithm described in this article may have one of two goals: (1) approximating the conditional probability mass function of the vector
given
, and (2) simulating one observation from the conditional distribution. For the first goal, the algorithm would have to be run for quite some time even after the steady state is (essentially) reached. In other applications, e.g., [
2], the problem is to simulate one observation from this conditional distribution. The need arises while applying the imputation step of the data augmentation algorithm. In this latter case, the use would return a single observation, say the very last simulated value.
Users who want to approximate the joint PMF of the conditional distribution will be limited to small values of n and k. The number of states in the Markov chain, , is equal to the number of points in the support of the conditional PMF for . For large n and k, the number of states is too large to obtain a reasonable number of “visits” to each state, which is needed to approximate the conditional PMF. On the other hand, if the goal is to simulate from the conditional PMF of given , the algorithm will scale well because each step of the MCMC involves simulation of just k random variables. The problems that motivated this study involved the latter scenario, where we wanted a single simulation from the conditional PMF.
Algorithms similar to those described here could be used to approximate the joint distribution of any set of discrete random variables whose support is a subset of the integers. The speed of convergence of the independence sampler will depend on how close the true (unknown) distribution is to multinomial distribution.