Abstract
Let be a memoryless uniform Bernoulli source and be the output of it through a binary symmetric channel. Courtade and Kumar conjectured that the Boolean function that maximizes the mutual information is a dictator function, i.e., for some i. We propose a clustering problem, which is equivalent to the above problem where we emphasize an information geometry aspect of the equivalent problem. Moreover, we define a normalized geometric mean of measures and interesting properties of it. We also show that the conjecture is true when the arithmetic and geometric mean coincide in a specific set of measures.
1. Introduction
Let be an independent and identically distributed (i.i.d.) uniform Bernoulli source and be an output of it through a memoryless binary symmetric channel with crossover probability . Recently, Courtade and Kumar conjectured that the most informative Boolean function is a dictator function.
Conjecture 1
([1]). For any Boolean function , we have:
where the maximum is achieved by a dictator function, i.e., for some . Note that is the binary entropy function.
Although there has been some progress in this line of work [2,3], this simple conjecture still remains open. There are also a number of variations of this conjecture. Weinberger and Shayevitz [4] considered the optimal Boolean function under quadratic loss. Huleihel and Ordentlich [5] considered the complementary case and showed that for all . Nazer et al. focused on information distilling quantizers [6], which can be seen as a generalized version of the above problem.
Many of them are based on the Fourier analysis technique including the original paper [1]. In this paper, we suggest an alternative approach, namely the information geometric approach. The mutual information can naturally be expressed with Kullback–Leibler (KL) divergences. Thus, it can be shown that the maximizing mutual information is equivalent to clustering probability measures under KL divergence.
In the equivalent clustering problem, the center of the cluster is an arithmetic mean of measures. We also provide the role of the geometric mean of measures (with appropriate normalization) in this setting. To the best of our knowledge, the geometric mean of measures has received less attention in the literature. We propose an equivalent formulation of the conjecture using the geometric mean of measures. Note that the geometric mean also allows us to connect Conjecture 1 to the other well-known clustering problem.
The rest of the paper is organized as follows. In Section 2, we briefly review the Jensen–Shannon divergence and -compressedness. In Section 3, we provide an equivalent clustering problem of probability measures. We introduce the geometric mean of measures in Section 4. We conclude this paper in Section 5.
Notations
denotes the alphabet set of random variable X, and denotes the set of measures on . denotes a random vector , while denotes a specific realization of it. If it is clear from the context, denotes a conditional distribution of Y given , i.e., . Similarly, denotes a conditional distribution of given , i.e., . Let be the set of all binary sequences of length n. For , the shifted version of A is denoted by where ⊕ is an element-wise XOR operator. The arithmetic mean of measures in the set is denoted by . For , let be the set of elements in A that satisfy , i.e., , and . is defined in a similar manner. A length n binary vector denotes a vector with .
2. Preliminaries
2.1. Jensen–Shannon Divergence
For such that , the Jensen–Shannon (JS) divergence of two measures and is defined as:
It is not hard to show that the following definition is equivalent.
Lin proposed a generalized JS divergence [7]:
where is a weight vector such that . Similar to Equation (3), it has an equivalent definition:
where . Topsøe [8] pointed out an interesting property, the so-called compensation identity. It states that for any distribution Q,
Throughout the paper, we often use Equation (6) directly without the notion of JSD
Remark 1.
The generalized JS divergence is the mutual information between X and the mixture distribution. Let Z be a random variable that takes the value from where and . Then, it is not hard to show that:
However, we introduced generalized JS divergence to emphasize the information geometric perspective of our problem.
2.2. -Compressed
Let A be the subset of and be the set of indexes. For , the -section of A is defined as:
The set A is called -compressed if is an initial segment of lexicographical ordering for all . For example, if A is -compressed for some , then should be one of:
It simply says that if , then .
Courtade and Kumar showed that it is enough to consider the -compressed sets.
Theorem 1
([1]). Let be the set of functions for which is -compressed for all with . In maximizing , it is sufficient to consider functions .
In this paper, we often restrict our attention to functions in the set .
3. Approach via Clustering
In this section, we provide an interesting approach toward Conjecture 1 via clustering. More precisely, we formulate an equivalent clustering problem.
3.1. Equivalence to Clustering
The following theorem implies the relation between the original conjecture and the clustering problem.
Theorem 2.
Let and be an induced random variable. Then,
The proof of the theorem is provided in Appendix A. Note that:
which is a weighted mean of for . The is a distance from each element to the cluster center. This implies that maximizing is equivalent to clustering under KL divergences. Since KL divergence is a Bregman divergence, all clusters are separated by a hyperplane [9].
In this paper, we focus on where is i.i.d. Bern(1/2).
Corollary 1.
Let and be a binary random variable.
The equivalent clustering problem is minimizing:
Let , then we can simplify further.
The cluster center is an arithmetic mean of measures in the set . Then, we have:
For simplicity, let:
which is the sum of distances from each element in A to the cluster center. In short, finding the most informative Boolean function f is equivalent to finding the set that minimizes .
Remark 2.
Conjecture 1 implies that minimizes (19). Furthermore, Theorem 1 implies that it is enough to consider A such that for all i.
Thus, we have:
Note that does not depend on A, and therefore, we have the following theorem.
Theorem 3.
For any , minimizing is equivalent to maximizing:
The above theorem provides an alternative problem formulation of the original conjecture.
3.2. Connection to Clustering under Hamming Distance
In this section, we consider the duality between the above clustering problem under the KL divergence and the clustering on under the Hamming distance. The following theorem shows that the KL divergence on corresponds to the Hamming distance on .
Theorem 4.
For all , we have:
where denotes the Hamming distance between and .
This theorem implies that the distance between two measures and is proportional to the Hamming distance between two binary vectors and . The proof of the theorem is provided in Appendix B. Note that the KL divergence is symmetric on .
In the above duality, we have a mapping between and ; more precisely, . This mapping naturally suggests an equivalent clustering problem of n-dimensional binary vectors. However, the cluster center is not an element of in general. In order to formulate an equivalent clustering problem, we need to answer the question “Which n dimensional vector corresponds to ?”. A naive approach is to extend the set of binary vectors to under distance instead of the Hamming distance. In such a case, the goal is to map to the arithmetic mean of binary vectors in the set A. If this is true, we can further simplify the problem into the problem of clustering a hypercube in . However, the following example shows that this naive extension is not valid.
Example 1.
Let , and , then the arithmetic mean of binary vectors of A and that of B are the same. However, is not equal to .
Furthermore, the set is not the optimum choice when clustering the hypercube under . Instead, we need to consider the set of measures directly. The following theorem provides a bit of geometric structure among measures.
Theorem 5.
For all and ,
where .
The proof of the theorem is provided in Appendix C. Since for all , Theorem 5 immediately implies the following corollary.
Corollary 2.
For all and ,
where is a convex hull of measures in the set A.
This is a triangle inequality that can be useful when we consider the clustering problem of measures.
4. Geometric Mean of Measures
In the previous section, we formulate the clustering problem that is equivalent to the original maximizing mutual information problem. In this section, we provide another approach using a geometric mean of measures. We define the geometric mean of measures formally and derive a nontrivial conjecture, which is equivalent to Conjecture 1.
4.1. Definition of the Geometric Mean of Measures
For measures and weights such that , we considered the sum of KL divergences in (6):
We also observed that (28) is minimized when Q is an arithmetic mean of measures.
Since the KL divergence is asymmetric, it is natural to consider the sum of another direction of KL divergences.
Compared to the arithmetic mean that minimizes (28), can be considered as a geometric mean of measures. However, is not a measure in general, and normalization is required. With a normalizing constant s, we can define the geometric mean of measures by:
where s is a constant, so that , i.e.,
Then, we have:
which can be minimized when . Thus, for all Q,
The above result provides a geometric compensation identity.
This also implies that .
Remark 3.
If , s is called the α-Chernoff coefficient, and it is called the Bhattacharyya coefficient when . The summation is known as α-Chernoff divergence. For more details, please see [10,11] and the references therein.
Under this definition, we can find the geometric mean of measures in the set with uniform weights by:
where:
Remark 4.
The original conjecture is that the Boolean function f such that maximizes the mutual information . The geometric mean of measures in the set satisfies the following property.
Note that the geometric mean of measures in the set satisfies:
4.2. Main Results
So far, we have seen two means of measures and . It is natural to ask if they are equal. Our main theorem provides a connection to Conjecture 1.
Theorem 6.
Suppose A is a nontrivial subset of for (i.e., ), and A is -compressed for all . Then, for some i if and only if and .
The proof of the theorem is provided in Appendix D. Theorem 6 implies that the following conjecture is the equivalent to Conjecture 1.
Conjecture 2.
Let and be -compressed for all . Then, is maximized if and only if and .
Remark 5.
One of the main challenges of this problem is that the conjectured optimal sets are extremes, i.e., for some i. Our main theorem provides an alternative conjecture that seems more natural in the context of optimization.
Remark 6.
It is clear that holds if . Thus, both conditions and are needed to guarantee for some i.
4.3. Property of the Geometric Mean
We can derive a new identity by combining the original and geometric compensation identity together. For , let be:
Then,
where (47) is because of the compensation identity (6). As we discussed in Section 4.1, the second term of the right-hand side is:
Finally, we have:
More interestingly, we can apply original and geometric compensation identities:
From Theorem 4, we have , and therefore, we can switch A and B.
If we let , we have:
Note that is similar to a known clustering problem. In the clustering literature, the min-sum clustering problem [12] is minimizing the sum of all edges in each cluster. Using , we can describe the binary min-sum clustering problem on by minimizing .
4.4. Another Application of the Geometric Mean
Using the geometric mean of measures, we can rewrite the clustering problem in a different form. Recall that . Then, we have:
Let be the geometric mean of measures in the set , i.e.,
where:
Then, we have:
Needless to say:
The sum of the results is:
This can be considered as a dual of Theorem 3.
Remark 7.
Let , which is the candidate of the optimizer. Then,
5. Concluding Remarks
In this paper, we have proposed a number of different formulations of the most informative Boolean function conjecture. Most of them are based on the information geometric approach. Furthermore, we focused on the (normalized) geometric mean of measures that can simplify the problem formulation. More precisely, we showed that Conjecture 1 is true if and only if the maximum achieving f satisfies the following property: “the arithmetic and geometric mean of measures are the same for both , as well as .”
Funding
This work was supported by the Hongik University new faculty research support fund.
Conflicts of Interest
The author declares no conflict of interest.
Appendix A. Proof of Theorem 2
By the definition of mutual information, we have:
This concludes the proof.
Appendix B. Proof of Theorem 4
Without loss of generality, we can assume that and where . Then, we have:
Thus,
Since , this concludes the proof.
Appendix C. Proof of Theorem 5
The following lemma bounds the ratio between and , which will be crucial in our argument.
Lemma A1.
For ,
Proof.
Without loss of generality, we can assume that and .
Similarly, we can show that:
This concludes the proof of the lemma. ☐
Without loss of generality, we can assume that and . Then, we have:
where . Thus, we have:
Note that two expectations in (A20) are under different distributions; on the other hand, expectations in (A21) and the following equations are under the same distribution .
Finally,
This concludes the proof.
Appendix D. Proof of Theorem 6
From the first assumption , we have:
Clearly, we can get the following result in a similar manner.
The ratio of those two is given by:
On the other hand, we can also marginalize :
Similarly, we have:
Thus, the ratio is given by:
Since , both ratios should be the same. Let , which implies . Then, we have:
If we let , then the above equation can be further simplified to:
Lemma A2.
For fixed , the only solutions of the above equation are .
Proof.
It is clear that comprise the solution of the following equation.
It is enough to show that:
can have up to three solutions. Consider the derivative ,
which is equivalent to:
It is a quadratic equation, and therefore, can have up to two solutions. Thus, can have up to three solutions. ☐
This implies that or . It is clear that the above result holds for all i and , i.e., is either or , and is either or . These cardinalities should satisfy the following equations:
for all . Since -compressedness implies , we have . Thus, should be either or for all i. If for some i, then . Since is either or , but , we have . Thus, and , which implies .
On the other hand, assume that for all i. Since A is -compressed, implies . However, we have , and therefore:
and equivalently,
for all i. It can only be true when or , which contradicts our original assumption. This concludes the proof.
References
- Courtade, T.A.; Kumar, G.R. Which Boolean functions maximize mutual information on noisy inputs? IEEE Trans. Inf. Theory 2014, 60, 4515–4525. [Google Scholar] [CrossRef]
- Pichler, G.; Matz, G.; Piantanida, P. A tight upper bound on the mutual information of two Boolean functions. In Proceedings of the 2016 IEEE Information Theory Workshop (ITW), Cambridge, UK, 11–14 September 2016; pp. 16–20. [Google Scholar]
- Ordentlich, O.; Shayevitz, O.; Weinstein, O. An improved upper bound for the most informative Boolean function conjecture. In Proceedings of the 2016 IEEE International Symposium on Information Theory (ISIT), Barcelona, Spain, 10–15 July 2016; pp. 500–504. [Google Scholar]
- Weinberger, N.; Shayevitz, O. On the optimal Boolean function for prediction under quadratic loss. IEEE Trans. Inf. Theory 2017, 63, 4202–4217. [Google Scholar] [CrossRef]
- Huleihel, W.; Ordentlich, O. How to quantize n outputs of a binary symmetric channel to n-1 bits? In Proceedings of the 2017 IEEE International Symposium on Information Theory (ISIT), Aachen, Germany, 25–30 June 2017; pp. 91–95. [Google Scholar]
- Nazer, B.; Ordentlich, O.; Polyanskiy, Y. Information-distilling quantizers. In Proceedings of the 2017 IEEE International Symposium on Information Theory (ISIT), Aachen, Germany, 25–30 June 2017; pp. 96–100. [Google Scholar]
- Lin, J. Divergence measures based on the Shannon entropy. IEEE Trans. Inf. Theory 1991, 37, 145–151. [Google Scholar] [CrossRef]
- Topsøe, F. An information theoretical identity and a problem involving capacity. Stud. Sci. Math. Hung. 1967, 2, 291–292. [Google Scholar]
- Banerjee, A.; Merugu, S.; Dhillon, I.S.; Ghosh, J. Clustering with Bregman divergences. J Mach. Learn. Res. 2005, 6, 1705–1749. [Google Scholar]
- Nielsen, F. An information-geometric characterization of Chernoff information. IEEE Signal Process. Lett. 2013, 20, 269–272. [Google Scholar] [CrossRef]
- Nielsen, F.; Boltz, S. The burbea-rao and bhattacharyya centroids. IEEE Trans. Inf. Theory 2011, 57, 5455–5466. [Google Scholar] [CrossRef]
- Guttmann-Beck, N.; Hassin, R. Approximation algorithms for min-sum p-clustering. Discrete Appl. Math. 1998, 89, 125–142. [Google Scholar] [CrossRef]
© 2018 by the author. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).