Next Article in Journal
Positive Sofic Entropy Implies Finite Stabilizer
Previous Article in Journal
Three Strategies for the Design of Advanced High-Entropy Alloys
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Greedy Algorithms for Optimal Distribution Approximation

by
Bernhard C. Geiger
* and
Georg Böcherer
Institute for Communications Engineering, Technical University of Munich, Munich 80290, Germany
*
Author to whom correspondence should be addressed.
Entropy 2016, 18(7), 262; https://doi.org/10.3390/e18070262
Submission received: 14 June 2016 / Revised: 1 July 2016 / Accepted: 11 July 2016 / Published: 18 July 2016
(This article belongs to the Section Information Theory, Probability and Statistics)

Abstract

:
The approximation of a discrete probability distribution t by an M-type distribution p is considered. The approximation error is measured by the informational divergence D ( t p ) , which is an appropriate measure, e.g., in the context of data compression. Properties of the optimal approximation are derived and bounds on the approximation error are presented, which are asymptotically tight. A greedy algorithm is proposed that solves this M-type approximation problem optimally. Finally, it is shown that different instantiations of this algorithm minimize the informational divergence D ( p t ) or the variational distance p t 1 .

Graphical Abstract

1. Introduction

In this work, we consider finite precision representations of probabilistic models. Suppose the original model, or target distribution, has n non-zero mass points and is given by t : = ( t 1 , , t n ) . We wish to approximate it by a distribution p : = ( p 1 , , p n ) of which each entry is a rational number with a fixed denominator. In other words, for every i, p i = c i / M for some non-negative integer c i M . The distribution p is called an M-type distribution, and the positive integer M n is the precision of the approximation. The problem is non-trivial, since computing the numerator c i by rounding M t i to the nearest integer in general fails to yield a distribution.
M-type approximations have many practical applications, e.g., in political apportionments, M seats in a parliament need to be distributed to n parties according to the result of some vote t . This problem led, e.g., to the development of multiplier methods [1]. In communications engineering, example applications are finite precision implementations of probabilistic data compression [2], distribution matching [3], and finite-precision implementations of Bayesian networks [4,5]. In all of these applications, the M-type approximation p should be close to the target distribution t in the sense of an appropriate error measure. Common choices for this approximation error are the variational distance and the informational divergences:
(1a) p t 1 : = i = 1 n | p i t i | (1b) D ( p t ) : = i : p i > 0 p i log p i t i (1c) D ( t p ) : = i : t i > 0 t i log t i p i
where log denotes the natural logarithm.
Variational distance and informational divergence Equation (1b) have been considered by Reznik [6] and Böcherer [7], respectively, who presented algorithms for optimal M-type approximation and developed bounds on the approximation error. In a recent manuscript [8], we extended the existing works on Equation (1a,b) to target distributions with infinite support ( n = ) and refined the bounds from [6,7].
In this work, we focus on the approximation error Equation (1c). It is an appropriate cost function for data compression [9] (Theorem 5.4.3) and seems appropriate for the approximation of parameters in Bayesian networks (see Section 4). Nevertheless, to the best of the authors’ knowledge, the characterization of M-type approximations minimizing D ( t p ) has not received much attention in literature so far.
Our contributions are as follows. In Section 2, we present an efficient greedy algorithm to find M-type distributions minimizing Equation (1c). We then discuss in Section 3 the properties of the optimal M-type approximation and bound the approximation error Equation (1c). Our bound incorporates a reverse Pinsker inequality recently suggested in [10] (Theorem 7). The algorithm we present is an instance of a greedy algorithm similar to steepest ascent hill climbing [11] (Chapter 2.6). As a byproduct, we unify this work with [6,7,8] by showing that also the algorithms optimal w.r.t. variational distance Equation (1a) and informational divergence Equation (1b) are instances of the same general greedy algorithm, see Section 2.

2. Greedy Optimization

In this section, we define a class of problems that can be optimally solved by a greedy algorithm. Consider the following example:
Example 1. 
Suppose there are n queues with jobs, and you have to select M jobs minimizing the total time spent. A greedy algorithm suggests to select successively the job with the shortest duration, among the jobs that are at the front of their queues. If the jobs in each queue are ordered by increasing duration, then this greedy algorithm is optimal.
We now make this precise: Let M be a positive integer, e.g., the number of jobs that have to be completed, and let δ i : N R , i = 1 , , n , be a set of functions, e.g., δ i ( k ) is the duration of the k-th job in the i-th queue. Let furthermore c 0 : = ( c 1 , 0 , , c n , 0 ) N 0 n be a pre-allocation, representing a constraint that has to be fulfilled (e.g., in the i-th queue at least c i , 0 jobs have to be completed) or a chosen initialization. Then, the goal is to minimize
U ( c ) : = i = 1 n k i = c i , 0 + 1 c i δ i ( k i )
i.e., to find a final allocation c : = ( c 1 , , c n ) satisfying c 1 = M and, for every i, c i c i , 0 . A greedy method to obtain such a final allocation is presented in Algorithm 1. We show in Appendix A.1. that this algorithm is optimal if the functions δ i satisfy certain conditions:
Algorithm 1: Greedy Algorithm
Initialize k i = c i , 0 , i = 1 , , n .
repeat M c 0 1 times
  Compute δ i ( k i + 1 ) , i = 1 , , n .
  Compute j = min   argmin i δ i ( k i + 1 ) . // (choose one minimal element) Update k j k j + 1 , .
end repeat
Return c = (k1, ⋯, kn).
Proposition 1. 
If the functions δ i ( k ) are non-decreasing in k, Algorithm 1 achieves a global minimum U ( c ) for a given pre-allocation c 0 and a given M.
Remark 1. 
The minimum of U ( c ) may not be unique.
Remark 2. 
If a function f i : R R is convex, the difference δ i ( k ) = f i ( k ) f i ( k 1 ) is non-decreasing in k. Hence, Algorithm 1 also minimizes
U ( c ) = i = 1 n f i ( c i ) .
Remark 3. 
Note that the functions δ i ( k ) need not be non-negative, i.e., in the view of Example 1, jobs may have negative duration. The functions δ i ( k ) are non-negative, though, if f i : R R in Remark 2 is convex and non-decreasing.
Remark 2 connects Algorithm 1 to steepest ascent hill climbing [11] (Chapter 2.6) with fixed step size and a constrained number of M steps.
We now show that instances of Algorithm 1 can find M-type approximations p minimizing each of the cost functions in Equation (1). Noting that p i = c i / M for some non-negative integer c i , we can rewrite the cost functions as follows:
(4a) p t 1 = 1 M i = 1 n | c i M t i | (4b) D ( p t ) = 1 M i : c i > 0 c i log c i t i log M (4c) D ( t p ) = log M H ( t ) i : t i > 0 t i log c i
where H ( · ) denotes entropy in nats.
Ignoring constant terms, these cost functions are all instances of Remark 2 for convex functions f i : R R (see Table 1). Hence, the three different M-type approximation problems set up by Equation (1) can all be solved by instances of Algorithm 1, for a trivial pre-allocation c 0 = 0 and after taking M steps. The final allocation c simply defines the M-type approximation by p i = c i / M .
For variational distance optimal approximation, we showed in [8] (Lemma 3), that every optimal M-type approximation satisfies p i M t i / M , hence one may speed up the algorithm by pre-allocating c i , 0 = M t i . We furthermore show in Lemma 1 below that the support of the optimal M-type approximation in terms of Equation (1c) equals the support of t (if M n ). Assuming that t is positive, one can pre-allocate the algorithm with c i , 0 = 1 . We summarize these instantiations of Algorithm 1 in Table 1.
This list of instances of Algorithm 1 minimizing information-theoretic or probabilistic cost functions can be extended. For example, the χ 2 -divergences χ 2 ( t | | p ) and χ 2 ( p | | t ) can also be minimized, since the functions inside the respective sums are convex. However, Rényi divergences of orders α 1 cannot be minimized by applying Algorithm 1.

3. M -Type Approximation Minimizing D ( t p )

As shown in the previous section, Algorithm 1 presents a minimizer of the problem min p D ( t p ) if instantiated according to Table 1. Let us call this minimizer t a . Recall that t is positive and that M n . The support of t a must contain the support of t , since otherwise D ( t t a ) = . Note further that the costs δ i ( k ) are negative if t i > 0 and zero if t i = 0 ; hence, if t i = 0 , the index i cannot be chosen by Algorithm 1, thus also t i a = 0 . This proves:
Lemma 1. 
If M n , the supports of t and t a coincide, i.e., t i = 0 t i a = 0 .
The assumption that t is positive and that M n hence comes without loss of generality. In contrast, neither variational distance nor informational divergence Equation (1b) require M n : As we show in [8], the M-type approximation problem remains interesting even if M < n .
Based on Lemma 1, the following example explains why the optimal M-type approximation does not necessarily result in a “small” approximation error:
Example 2. 
Let t = ( 1 ε , ε n 1 , , ε n 1 ) and M = n , hence by Lemma 1, t a = 1 n ( 1 , 1 , , 1 ) . It follows that D ( t t a ) = log n H ( t ) , which can be made arbitrarily close to log n by choosing a small positive ε.
In Table 1 we made use of [8] (Lemma 3), which says that every p minimizing the variational distance p t 1 satisfies p i M t i / M , to speed up the corresponding instance of Algorithm 1 by proper pre-allocation. Initialization by rounding is not possible when minimizing D ( t p ) , as shown in the following two examples:
Example 3. 
Let t = ( 17 / 20 , 3 / 40 , 3 / 40 ) and M = 20 . The optimal M-type approximation is p = ( 8 / 10 , 1 / 10 , 1 / 10 ) , hence p 1 < M t 1 / M . Initialization via rounding off fails.
Example 4. 
Let t = ( 0 . 719 , 0 . 145 , 0 . 088 , 0 . 048 ) and M = 50 . The optimal M-type approximation is p = ( 0 . 74 , 0 . 14 , 0 . 08 , 0 . 04 ) , hence p 1 > M t 1 / M . Initialization via rounding up fails.
To show that informational divergence vanishes for M , assume that M > 1 / t i for all i. Since the variational distance optimal approximation t vd satisfies t i vd M t i / M for every i, t vd has the same support as t , which ensures that D ( t t vd ) < . By similar arguments as in the proof of [8] (Proposition 4), we obtain
D ( t t a ) D ( t t vd ) log 1 + n 2 M M 0 .
Note that this bound is universal, i.e., it prescribes the same convergence rate for every target distribution with n mass points.
We now develop an upper bound on D ( t t a ) that holds for every M. To this end, we first approximate t by a distribution t * in P M : = { p : i : p i 1 / M , p 1 = 1 } that minimizes D ( t t * ) . If t * is unique, then it is called the reverse I-projection [12] (Section I.A) of t onto P M . Since t * P M , its variational distance optimal approximation t vd has the same support as t , which allows us to bound D ( t t a ) by D ( t t vd ) .
Lemma 2. 
Let t * P M minimize D ( t t * ) . Then,
t i * : = t i ν ( M ) + 1 M t i ν ( M ) +
where ν ( M ) is such that t * 1   =   1 , and where ( x ) + : = max { 0 , x } .
Proof. 
See Appendix A.2.  ☐
Let K : = { i : t i < ν ( M ) / M } , k : = | K | , and T K : = i K t i . The parameter ν ( M ) must scale the mass ( 1 T K ) such that it equals ( M k ) / M , i.e., we have
ν ( M ) = 1 T K 1 k M .
If, for all i, t i > 1 / M , then t P M , hence t * = t is feasible and ν ( M ) = 1 . One can show that ν ( M ) decreases with M.
Proposition 2 
(Approximation Bounds).
D ( t t a ) log ν ( M ) + log ( 2 ) 2 1 ν ( M ) 1 n M
Proof. 
See Appendix A.3.  ☐
The first term on the right-hand side of Equation (8) accounts for the error caused by first approximating t by t * (in the sense of Lemma 2). The second term accounts for the additional error caused by the M-type approximation of t * and incorporates the reverse Pinsker inequality [10] (Theorem 7). If M > t i for every i, hence t P M , then ν ( M ) = 1 and the bound simplifies to
D ( t t a ) log ( 2 ) n 2 M .
For M sufficiently large, Equation (8) thus yields better results than Equation (5), which approximates to n / ( 2 M ) . Moreover, for M sufficiently large, our bound Equation (8) is uniform, i.e., it prescribes the same convergence rate for every target distribution with n mass points. We illustrate the bounds for an example in Figure 1.

4. Applications and Outlook

Arithmetic coding uses a probabilistic model to compress a source sequence. Applying Algorithm 1 with cost Equation (1c) to the empirical distribution of the source sequence provides an M-type distribution as a probabilistic model. The parameter M can be choosen small for reduced complexity. Another application of Algorithm 1 can be found in [3], which considers the problem of generating length-M sequences according to a desired distribution. Since a length-M sequence has an M-type empirical distribution, the Reference [3] applies Algorithm 1 with cost Equation () to pre-calculate the M-type approximation of the desired distribution.
Algorithm 1 can also be used to calculate the M-type approximation of Markov models, i.e., approximating the transition matrix T of an n-state, irreducible Markov chain with invariant distribution vectors μ by a transition matrix P containing only M-type probabilities. Generalizing Equation (1c), the approximation error can be measured by the informational divergence rate [13]
D ¯ ( T P ) : = i , j = 1 n μ i T i j log T i j P i j = i = 1 n μ i D ( t i p i ) .
The optimal M-type approximation is found by applying the instance of Algorithm 1 to each row separately, and Lemma 1 ensures that the transition graph of P equals that of T , i.e., the approximating Markov chain is irreducible. Future work shall extend this analysis to hidden Markov models and should investigate the performance of these algorithms in practical scenarios, e.g., speech processing with finite-precision arithmetic.
Another possible application is the approximation of Bayesian network parameters. The authors of [4] approximated the true parameters using a stationary multiplier method from [14]. Since rounding probabilities to zero led to bad classification performance, they replaced zeros in the approximating distribution afterwards by small values. This in turn led to the problem that probabilities that are in fact zero, were approximated by a non-zero probability. We believe that these problems can be removed by instantiating Algorithm 1 for cost Equation (1c). This automatically prevents approximating non-zero probabilities with zeros and vice-versa, see Lemma 1.
Finally, for approximating Bayesian network parameters, recent work suggests rounding log-probabilities, i.e., to approximate log t i by log p i = c i / M for a non-negative integer c i [5]. Finding an optimal approximation that corresponds to a true distribution is equivalent to solving
min d ( t , p ) s . t . e c 1 / M   =   1
where d ( · , · ) denotes any of the considered cost functions Equation (1). If M = 1 and d ( t , p ) = D ( t p ) using the binary logarithm, the constraint translates to the requirement that t is approximated by a complete binary tree. Then, the optimal approximation is the Huffman code for t .

Acknowledgments

The work of Bernhard C. Geiger was partially funded by the Erwin Schrödinger Fellowship J 3765 of the Austrian Science Fund. The work of Georg Böcherer was partly supported by the German Ministry of Education and Research in the framework of an Alexander von Humboldt Professorship. This work was supported by the German Research Foundation (DFG) and the Technical University of Munich (TUM) in the framework of the Open Access Publishing Program.

Author Contributions

Bernhard C. Geiger and Georg Böcherer conceived this study, derived the results, and wrote the manuscript. Specifically, Bernhard C. Geiger Proposition 1 and Lemmas 3, 5 and 6, and Georg Böcherer proved Lemmas 2 and 4. All authors have read and approved the final manuscript.

Conflicts of Interest

The authors declare no conflict of interest.

Appendix. Proofs

Appendix A.1. Proof of Proposition 1

Since a pre-allocation only fixes a lower bound for U ( c ) , w.l.o.g. we assume that c 0 = 0 and thus c N 0 n with c 1 = M . Consider the set D : = { δ i ( k i ) : k i N , i = 1 , , n } and assume that the (not necessarily unique) set D M consists of M smallest values in D , i.e., | D M | = M and
d D M , d D \ D M : d d .
Clearly, U ( c ) cannot be smaller than the sum over all elements in D M . Since the δ i are non-decreasing, there exists at least one final allocation c that takes successively the first c i values from each queue i, i.e., D M = { δ 1 ( 1 ) , , δ 1 ( c 1 ) , , δ n ( 1 ) , , δ n ( c n ) } satisfies Equation (A1). This shows that the lower bound induced by Equation (A1) can actually be achieved.
We prove the optimality of Algorithm 1 by contradiction: Assume that Algorithm 1 finishes with a final allocation c ˜ such that U ( c ˜ ) is strictly larger than the (unique) sum over all elements in (non-unique) D M . Hence, c ˜ must exchange at least one of the elements in D M for an element that is strictly larger. Thus, by the properties of the functions δ i and Algorithm 1, there must be indices and m such that c ˜ > c , c ˜ m < c m , and δ ( c ˜ ) δ ( c + 1 ) > δ m ( c m ) δ m ( c ˜ m ) . At each iteration of the algorithm, the current allocation at index m satisfies k m c ˜ m < c m . Since δ m ( c m ) < δ ( c + 1 ) , δ ( c + 1 ) can never be a minimal element, and hence is not chosen by Algorithm 1. This contradicts the assumption that Algorithm 1 finishes with a c ˜ such that U ( c ˜ ) is strictly larger than the sum of D ’s M smallest values. ☐

Appendix A.2. Proof of Lemma 2

The problem finding a t * P M minimizing D ( t t * ) is equivalent to finding an optimal point of the problem:
(A2a) minimize p R > 0 n i = 1 n t i log p i (A2b) subject   to 1 M p i 0 , i = 1 , 2 , , n (A2c) 1 + i = 1 n p i = 0 .
The Lagrangian of the problem is
L ( p , λ , ν ) = i = 1 n t i log p i + i = 1 n λ i 1 M p i + ν 1 + i = 1 n p i .
By the Karush–Kuhn–Tucker (KKT) conditions [15] (Chapter 5.5.3), a feasible point t * is optimal if, for every i = 1 , , n ,
(A4a) λ i 0 (A4b) λ i 1 M t i * = 0 (A4c) p i L ( p , λ , ν ) | p = t * = t i t i * λ i + ν = 0 .
By Equation (A2b), we have t i * 1 / M . If t i * > 1 / M , then λ i = 0 by Equation (A4b) and t i * = t i / ν by Equation (A4c). Thus
t i * = t i ν + 1 M t i ν +
where ν is such that i = 1 n t i * = 1 . ☐

Appendix A.3. Proof of Proposition 2

Reverse I-projections admit a Pythagorean inequality [12] (Theorem 1). In other words, if p is a distribution, p * its reverse I-projection onto a set S , and q any distribution in S , then
D ( p q ) D ( p p * ) + D ( p * q ) .
For the present scenario, we can show an even stronger result:
Lemma 3. 
Let t be the target distribution, let t * be as in Lemma 2, and let t vd be the variational distance optimal M-type approximation of t * . Then,
D ( t t vd ) = D ( t t * ) + ν D ( t * t vd ) .
Proof. 
(A8) D ( t t vd ) = i = 1 n t i log t i t i vd (A9) = i = 1 n t i log t i t i * t i vd t i * (A10) = i = 1 n t i log t i t i * + i = 1 n t i log t i * t i vd (A11) = ( a ) D ( t t * ) + ν i K t i ν log t i * t i vd (A12) = ( b ) D ( t t * ) + ν D ( t * t vd )
Here, ( a ) follows because for i K , t i * = 1 / M and thus, the M-type approximation minimizing the variational distance satisfies t i vd = 1 / M ; furthermore, ( b ) is because for i K , t i * = t i / ν .  ☐
We now bound the summands in Lemma 3.
Lemma 4. 
In the setting of Lemma 3,
D ( t * t vd ) log ( 2 ) t * t vd 1 .
Proof. 
We first employ a reverse Pinsker inequality from [10] (Theorem 7), stating that
D ( t * t vd ) 1 2 r log r r 1 t * t vd 1
where r : = sup i : t i * > 0 t i * t i vd . Furthermore, since for variational distance optimal approximations we always have | t i * t i vd | < 1 / M [8] (Lemma 3), we can bound
r < t i vd + 1 M t i vd 2
since t i vd M t i * / M 1 / M . Since the factor r log r r 1 increases in r, the bound Equation (A13) follows by substituting r in Equation (A14) by 2.  ☐
Lemma 5. 
In the setting of Lemma 3,
D ( t t * ) log ν .
Proof. 
(A17) D ( t t * ) = i = 1 n t i log t i t i * (A18) = i K t i log ν t i t i + i K t i log M t i (A19) ( a ) ( 1 T K ) log ν + i K t i log ν (A20) = log ν
where ( a ) is because for i K , M t i ν .  ☐
To bound t * t vd 1 , we present
Lemma 6. 
Let p * be a sub-probability distribution with m M masses and total weight 1 T , and let p vd * be its variational distance optimal M-type approximation using J M masses. Then,
p * p vd * 1 m 2 M + ( M M T J ) 2 2 m M .
Note that for J = M we recover [8] (Lemma 4).
Proof. 
Assume first that either i : p i * p i vd * or i : p i * p i vd * . Note that this is possible since p * and p vd * are sub-probability distributions, summing to 1 T and J / M , respectively. Then, p * p vd * 1   =   | 1 T J / M | which satisfies this bound. This can be seen by rearranging Equation (A21) such that J only appears on the left-hand side; the maximizing J (not necessarily integer) then satisfies Equation (A21) with equality.
We thus remain to treat the case where after rounding off all indices, 1 L M 1 masses remain and we have
i = 1 m p i * M p i * M = : i = 1 m e i = 1 T J L M = : g ( L ) .
The variational distance is minimized by distributing the L masses to L indices i L with the largest errors e i , hence
(A23) p * p vd * 1 = i L 1 M e i + i L e i (A24) ( a ) L M L n g ( L ) + n L n g ( L )
where ( a ) follows because for i L , j L , e i e j . This is maximized for L = n ( M M T J ) 2 (not necessarily integer), which after inserting yields the upper bound.  ☐
Proof of Bound in Proposition 2. 
We start by bounding the informational divergence D ( t t a ) by the informational divergence between t and the variational distance optimal approximation t vd of its reverse I-projection t * onto P M :
(A25) D ( t t a ) D ( t t vd ) (A26) = ( a ) D ( t t * ) + ν D ( t * t vd ) (A27) ( b ) log ν + ν log ( 2 ) t * t vd 1 (A28) ( c ) log ν + ν log ( 2 ) n k 2 M (A29) ( d ) log ν + ν log ( 2 ) n M + M ν 2 M (A30) = log ν + log ( 2 ) 2 1 ν 1 n M
where
  • (a) is due to Lemma 3,
  • (b) is due to Lemmas 4 and 5,
  • (c) is due to Lemma 6 with m = n k , 1 T = 1 k / M , and J = M k , and
  • (d) follows by bounding k from below via Equation (7)
    k = M ν ( ν 1 + T K ) M ν ( ν 1 ) = M M ν .

References

  1. Dorfleitner, G.; Klein, T. Rounding with multiplier methods: An efficient algorithm and applications in statistics. Stat. Pap. 1999, 40, 143–157. [Google Scholar] [CrossRef]
  2. Rissanen, J.; Langdon, G.G. Arithmetic coding. IBM J. Res. Dev. 1979, 23, 149–162. [Google Scholar] [CrossRef]
  3. Schulte, P.; Böcherer, G. Constant Composition Distribution Matching. IEEE Trans. Inf. Theory 2016, 62, 430–434. [Google Scholar] [CrossRef]
  4. Drużdżel, M.J.; Oniśko, A. Are Bayesian Networks Sensitive to Precision of Their Parameters? In Proceedings of the International IIS’08 Conference, Intelligent Information Systems XVI, Zakopane, Poland, 16–18 June 2008; pp. 35–44.
  5. Tschiatschek, S.; Pernkopf, F. On Bayesian Network Classifiers with Reduced Precision Parameters. IEEE Trans. Pattern Anal. Mach. Intell. 2015, 37, 774–785. [Google Scholar] [CrossRef] [PubMed]
  6. Reznik, Y. An Algorithm for Quantization of Discrete Probability Distributions. In Proceedings of the 2011 Data Compression Conference (DCC), Snowbird, UT, USA, 29–31 March 2011; pp. 333–342.
  7. Böcherer, G. Optimal Non-Uniform Mapping for Probabilistic Shaping. In Proceedings of the 9th International ITG Conference on Systems, Communications and Coding (SCC), Munich, Germany, 21–24 January 2013; pp. 1–6.
  8. Böcherer, G.; Geiger, B.C. Optimal Quantization for Distribution Synthesis. 2016; arXiv:1307.6843v4. [Google Scholar]
  9. Cover, T.M.; Thomas, J.A. Elements of Information Theory, 2nd ed.; Wiley Interscience: Hoboken, NJ, USA, 2006. [Google Scholar]
  10. Verdú, S. Total variation distance and the distribution of relative information. In Proceedings of the Information Theory and Applications Workshop (ITA), San Diego, CA, USA, 9–14 February 2014; pp. 499–501.
  11. Michalewicz, Z.; Fogel, D.B. How to Solve It: Modern Heuristics, 2nd ed.; Springer: Berlin, Germany, 2004. [Google Scholar]
  12. Csiszár, I.; Frantis̆ek, M. Information Projections Revisited. IEEE Trans. Inf. Theory 2003, 49, 1474–1490. [Google Scholar] [CrossRef]
  13. Rached, Z.; Alajaji, F.; Campbell, L.L. The Kullback–Leibler divergence rate between Markov sources. IEEE Trans. Inf. Theory 2004, 50, 917–921. [Google Scholar] [CrossRef]
  14. Heinrich, L.; Pukelsheim, F.; Schwingenschlögl, U. On stationary multiplier methods for the rounding of probabilities and the limiting law of the Sainte-Laguë divergence. Stat. Decis. 2005, 23, 117–129. [Google Scholar] [CrossRef]
  15. Boyd, S.; Vandenberghe, L. Convex Optimization; Cambridge University Press: Cambridge, UK, 2004. [Google Scholar]
Figure 1. Evaluating the bounds Equations (5) and (8) for t = ( 0 . 48 , 0 . 48 , 0 . 02 , 0 . 02 ) . Note that Equation (5) is a valid bound only for M 50 , i.e., where the curve is dashed.
Figure 1. Evaluating the bounds Equations (5) and (8) for t = ( 0 . 48 , 0 . 48 , 0 . 02 , 0 . 02 ) . Note that Equation (5) is a valid bound only for M 50 , i.e., where the curve is dashed.
Entropy 18 00262 g001
Table 1. Instances of Algorithm 1 Optimizing Equation (1).
Table 1. Instances of Algorithm 1 Optimizing Equation (1).
Cost f i ( x ) δ i ( k ) c i , 0 References
p t 1 | x M t i | | k M t i | | k 1 M t i | M t i [6,8]
D ( p t ) x log ( x / t i ) k log k k 1 + log ( k 1 ) log t i 0[7,8]
D ( t p ) t i log x t i log ( ( k 1 ) / k ) t i This work

Share and Cite

MDPI and ACS Style

Geiger, B.C.; Böcherer, G. Greedy Algorithms for Optimal Distribution Approximation. Entropy 2016, 18, 262. https://doi.org/10.3390/e18070262

AMA Style

Geiger BC, Böcherer G. Greedy Algorithms for Optimal Distribution Approximation. Entropy. 2016; 18(7):262. https://doi.org/10.3390/e18070262

Chicago/Turabian Style

Geiger, Bernhard C., and Georg Böcherer. 2016. "Greedy Algorithms for Optimal Distribution Approximation" Entropy 18, no. 7: 262. https://doi.org/10.3390/e18070262

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop