Greedy Algorithms for Optimal Distribution Approximation

The approximation of a discrete probability distribution $\mathbf{t}$ by an $M$-type distribution $\mathbf{p}$ is considered. The approximation error is measured by the informational divergence $\mathbb{D}(\mathbf{t}\Vert\mathbf{p})$, which is an appropriate measure, e.g., in the context of data compression. Properties of the optimal approximation are derived and bounds on the approximation error are presented, which are asymptotically tight. It is shown that $M$-type approximations that minimize either $\mathbb{D}(\mathbf{t}\Vert\mathbf{p})$, or $\mathbb{D}(\mathbf{p}\Vert\mathbf{t})$, or the variational distance $\Vert\mathbf{p}-\mathbf{t}\Vert_1$ can all be found by using specific instances of the same general greedy algorithm.


I. INTRODUCTION
In this work, we consider finite precision representations of probabilistic models. More precisely, if the original model, or target distribution, has n non-zero mass points and is given by t := (t 1 , . . . , t n ), we wish to approximate it by a distribution p := (p 1 , . . . , p n ), where for every i, p i = c i /M for some non-negative integer c i ≤ M . The distribution p is called an M -type distribution, and the positive integer M ≥ n is the precision of the approximation. The problem is non-trivial, since computing the numerator c i by rounding M t i to the nearest integer in general fails to yield a distribution.
M -type approximations have many practical applications, e.g., in political apportionments, M seats in a parliament need to be distributed to n parties according to the result of some vote t. This problem led, e.g., to the development of multiplier methods [1]. In communications engineering, example applications are finite precision implementations of probabilistic data compression [2], distribution matching [3], and finite-precision implementations of Bayesian networks [4], [5]. In all of these applications, the M -type approximation p should be close to the target distribution t in the sense of an appropriate error measure. Common choices for this approximation error are the variational distance and the informational divergences: where log denotes the natural logarithm. Variational distance and informational divergence (1b) have been considered by Reznik [6] and Böcherer [7], respectively, who presented algorithms for optimal M -type approximation and developed bounds on the approximation error. In a recent manuscript [8], we extended the existing works on (1a) and (1b) to target distributions with infinite support (n = ∞) and refined the bounds from [6], [7].
In this work, we focus on the approximation error (1c). It is an appropriate cost function for data compression [9,Thm. 5.4.3] and seems apropriate for the approximation of parameters in Bayesian networks (see Sec. IV). Nevertheless, to the best of the authors' knowledge, the characterization of M -type approximations minimizing D(t p) has not received much attention in literature so far.
Our contributions are as follows. In Sec. II, we present an efficient greedy algorithm to find M -type distributions minimizing (1c). We then discuss in Sec. III the properties of the optimal M -type approximation and bound the approximation error (1c). Our bound incorporates a reverse Pinsker inequality recently suggested in [10,Thm. 7]. The algorithm we present is an instance of a greedy algorithm similar to steepest ascent hill climbing [11,Ch. 2.6]. As a byproduct, we unify this work with [6]- [8] by showing that also the algorithms optimal w.r.t. variational distance (1a) and informational divergence (1b) are instances of the same general greedy algorithm, see Sec. II.

II. GREEDY OPTIMIZATION
In this section, we define a class of problems that can be optimally solved by a greedy algorithm. Consider the following example: Example 1. Suppose there are n queues with jobs, and you have to select M jobs minimizing the total time spent. A greedy algorithm suggests to select always the job with the shortest duration, among the jobs that are at the front of their queues. If the jobs in each queue are ordered by increasing duration, then this greedy algorithm is optimal.
We now make this precise: Let M be a positive integer, e.g., the number of jobs that have to be completed, and let δ i : N → R, i = 1, . . . , n, be a set of functions, e.g., δ i (k) is the duration of the k-th job in the i-th queue. Let furthermore c 0 := (c 1,0 , . . . , c n,0 ) ∈ N n 0 be a pre-allocation, representing a constraint that has to be fulfilled (e.g., in each queue at least one job has to be completed) or a chosen initialization. Then, the goal is to minimize i.e., to find a final allocation c := (c 1 , · · · , c n ) satisfying, for all i, c i ≥ c i,0 and c 1 = M . A greedy method to obtain such a final allocation is presented in Algorithm 1. We show in Appendix A that this algorithm is optimal if the functions δ i satisfy certain conditions: Remark 2 connects Algorithm 1 to steepest ascent hill climbing [11, Ch. 2.6] with fixed step size and a constrained number of M steps. Hill climbing is optimal for convex problems, suggesting an interesting connection with Proposition 1.
We now show that instances of Algorithm 1 can find Mtype approximations p minimizing each of the cost functions in (1). Noting that p i = c i /M for some non-negative integer c i , we can rewrite the cost functions as follows: Ignoring constant terms, these cost functions are all instances of Remark 2 for convex functions f i : R → R (see Table I).
Hence, the three different M -type approximation problems set up by (1) can all be solved by instances of Algorithm 1, for a trivial pre-allocation c 0 = 0 and after taking M steps: The final allocation c simply defines the M -type approximation by p i = c i /M . For variational distance optimal approximation, we showed in [8,Lem. 3] that every optimal M -type approximation satisfies p i ≥ ⌊M t i ⌋/M , hence one may speed up the algorithm by pre-allocating c i,0 = ⌊M t i ⌋. We furthermore show in Lemma 1 below that the support of the optimal Mtype approximation in terms of (1c) equals the support of t (if M is large enough). Assuming that t is positive, one can pre-allocate the algorithm with c i,0 = 1. We summarize these instantiations of Algorithm 1 in Table I. This list of instances of Algorithm 1 minimizing information-theoretic or probabilistic cost functions can be extended. For example, the χ 2 -divergences χ 2 (t||p) and χ 2 (p||t) can also be minimized, since the functions inside the respective sums are convex. However, Rényi divergences of orders α = 1 cannot be minimized by applying Algorithm 1.

III. M -TYPE APPROXIMATION MINIMIZING D(t p):
PROPERTIES AND BOUND As shown in the previous section, Algorithm 1 presents a minimizer of the problem min p D(t p) if instantiated according to Table I. Let us call this minimizer t a . Recall that t is positive and that M ≥ n. The support of t a must be at least as large as the support of t, since otherwise D(t t a ) = ∞. Note further that the costs δ i (k) are negative if t i > 0 and zero if t i = 0; hence, if t i = 0, the index i cannot be chosen by Algorithm 1, thus also t a i = 0. This proves Lemma 1. If M ≥ n, the supports of t and t a coincide, i.e., t i = 0 ⇔ t a i = 0. The assumption that t is positive and that M ≥ n hence comes without loss of generality. In contrast, neither variational distance nor informational divergence (1b) require M ≥ n: As we show in [8], the M -type approximation problem remains interesting even if M < n.
To show that informational divergence vanishes for M → ∞, assume that M > 1/t i for all i. Hence, the variational distance optimal approximation t vd has the same support as t, which ensures that D(t t vd ) < ∞. By similar arguments as in the proof of [8,Prop. 4.1)], we obtain We now develop an upper bound on D(t t a ) that holds for every M . To this end, we first approximate t by a distribution t * in P M := {p : ∀i: p i ≥ 1/M, p 1 = 1} that minimizes D(t t * ). If t * is unique, then it is called the reverse Iprojection [12, Sec. I.A] of t onto P M . Since t * ∈ P M , its variational distance optimal approximation t vd has the same support as t, which allows us to bound D(t t a ) by D(t t vd ).
If, for all i, t i > 1/M , then t ∈ P M , hence t * = t is feasible and ν(M ) = 1. One can show that ν(M ) decreases with M .
Proof: See Appendix C. The first term on the right-hand side of (8) accounts for the error caused by first approximating t by t * (in the sense of Lemma 2). The second term then accounts for the additional error caused by the M -type approximation of t * . The bound incorporates the reverse Pinsker inequality [10,Thm. 7], see Appendix C. If M > t i for every i, hence t ∈ P M , then ν(M ) = 1 and only the second term remains. For large M , (8) yields better results than (5), justifying this bound on the whole range of M . We illustrate the bounds for an example in Fig. 1. IV. OUTLOOK A possible application for our algorithms is the M -type approximation of Markov models, i.e., approximating the transition matrix T of an n-state, irreducible Markov chain with invariant distribution vectors µ by a transition matrix P containing only M -type probabilities. Generalizing (1c), the approximation error can be measured by the informational divergence rate [13] D(T P) := n i,j=1 The optimal M -type approximation is found by applying the instance of Algorithm 1 to each row separately, and Lemma 1 ensures that the transition graph of P equals that of T, i.e., the approximating Markov chain is irreducible. Future work shall extend this analysis to hidden Markov models and should investigate the performance of these algorithms in practical scenarios, e.g., in speech processing.
Another possible application is the approximation of Bayesian network parameters. The authors of [4] approximated the true parameters using a stationary multiplier method from [14]. Since rounding probabilities to zero led to bad classification performance, they replaced zeros in the approximating distribution afterwards by small values; thus, they also approximated true zero probabilities by positive ones. We believe that these problems can be removed by instantiating Algorithm 1 for cost (1c). This automatically prevents approximating non-zero probabilities with zeros and vice-versa, see Lemma 1.
Recent work suggested rounding log-probabilities, i.e., to approximate log t i by log p i = −c i /M for a non-negative integer c i [5]. Finding an optimal approximation that corresponds to a true distribution is equivalent to solving min d(t, p) where d(·, ·) denotes any of the considered cost functions (1). If M = 1 and d(t, p) = D(t p) using the binary logarithm, the constraint translates to the requirement that t is approximated by a complete binary tree. Then, the optimal approximation is the Huffman code for t.
Clearly, U (c) cannot be smaller than the sum over all elements in D M . Since the δ i are non-decreasing, there exists at least one final allocation c that takes successively the first c i values from each queue i, i.e., D M = {δ 1 (1), . . . , δ 1 (c 1 ), . . . , δ n (1), . . . , δ n (c n )} satisfies (10). This shows that the lower bound induced by (10) can actually be achieved.
We prove the optimality of Algorithm 1 by contradiction: Assume that Algorithm 1 finishes with a final allocationc such that U (c) is strictly larger than the (unique) sum over all elements in (non-unique) D M . Hence,c must exchange at least one of the elements in D M for an element that is strictly larger. Thus, by the properties of the functions δ i and Algorithm 1, there must be indices ℓ and m such thatc ℓ > c ℓ , c m < c m , and δ ℓ (c ℓ ) ≥ δ ℓ (c ℓ + 1) > δ m (c m ) ≥ δ m (c m ). At each iteration of the algorithm, the current allocation at index m satisfies k m ≤c m < c m . Since δ m (c m ) < δ ℓ (c ℓ + 1), δ ℓ (c ℓ + 1) can never be a minimal element, and hence is not chosen by Algorithm 1. This contradicts the assumption that Algorithm 1 finishes with ac such that U (c) is strictly larger than the sum of D's M smallest values.

B. Proof of Lemma 2
The problem finding a t * ∈ P M minimizing D(t t * ) is equivalent to finding an optimal point of the problem: