Abstract
Dimension reduction is a technique used to transform data from a high-dimensional space into a lower-dimensional space, aiming to retain as much of the original information as possible. This approach is crucial in many disciplines like engineering, biology, astronomy, and economics. In this paper, we consider the following dimensionality reduction instance: Given an n-dimensional probability distribution p and an integer , we aim to find the m-dimensional probability distribution q that is the closest to p, using the Kullback–Leibler divergence as the measure of closeness. We prove that the problem is strongly NP-hard, and we present an approximation algorithm for it.
1. Introduction
Dimension reduction [1,2] is a methodology for mapping data from a high-dimensional space to a lower-dimensional space, while approximately preserving the original information content. This process is essential in fields such as engineering, biology, astronomy, and economics, where large datasets with high-dimensional points are common.
It is often the case that the computational complexity of the algorithms employed to extract relevant information from these datasets depends on the dimension of the space where the points lie. Therefore, it is important to find a representation of the data in a lower-dimensional space that still (approximately) preserves the information content of the original data, as per given criteria.
A special case of the general issue illustrated before arises when the elements of the dataset are n-dimensional probability distributions, and the problem is to approximate them by lower-dimensional ones. This question has been extensively studied in different contexts. In [3,4], the authors address the problem of dimensionality reduction on sets of probability distributions with the aim of preserving specific properties, such as pairwise distances. In [5], Gokhale considers the problem of finding the distribution that minimizes, subject to a set of linear constraints on the probabilities, the “discrimination information” with respect to a given probability distribution. Similarly, in [6], Globerson et al. address the dimensionality reduction problem by introducing a nonlinear method aimed at minimizing the loss of mutual information from the original data. In [7], Lewis explores dimensionality reduction for reducing storage requirements and proposes an approximation method based on the maximum entropy criterion. Likewise, in [8], Adler et al. apply dimensionality reduction to storage applications, focusing on the efficient representation of large-alphabet probability distributions. More closely related to the dimensionality reduction that we deal with in this paper are the works [9,10,11,12]. In [10,11], the authors address task scheduling problems where the objective is to allocate tasks of a project in a way that maximizes the likelihood of completing the project by the deadline. They formalize the problem in terms of random variables approximation by using the Kolmogorov distance as a measure of distance and present an optimal algorithm for the problem. In contrast, in [12], Vidyasagar defines a metric distance between probability distributions on two distinct finite sets of possibly different cardinalities based on the Minimum Entropy Coupling (MEC) problem. Informally, in the MEC, given two probability distributions p and q, one seeks to find a joint distribution that has p and q as marginal distributions and also has minimum entropy. Unfortunately, computing the MEC is NP-hard, as shown in [13]. However, numerous works in the literature present efficient algorithms for computing couplings with entropy within a constant number of bits from the optimal value [14,15,16,17,18]. We note that computing the coupling of a pair of distributions can be seen as essentially the inverse of dimension reduction. Specifically, given two distributions p and q, one constructs a third, larger distribution , such that p and q are derived from or, more formally, aggregations of . In contrast, the dimension reduction problem addressed in this paper involves starting with a distribution p and creating another, smaller distribution that is derived from p or, more formally, is an aggregation of p.
Moreover, in [12], the author demonstrates that, according to the defined metric, any optimal reduced-order approximation must be an aggregation of the original distribution. Consequently, the author provides an approximation algorithm based on the total variation distance, using an approach similar to the one we will employ in Section 4. Similarly, in [9], Cicalese et al. examine dimensionality reduction using the same distance metric introduced in [12]. They propose a general criterion for approximating p with a shorter vector q, based on concepts from Majorization theory, and provide an approximation approach to solve the problem.
We also mention that analogous problems arise in scenario reduction [19], where the problem is to (best) approximate a given discrete distribution with another distribution with fewer atoms in compressing probability distributions [20] and elsewhere [21,22,23]. Moreover, we recommend the following survey for further application examples [24].
In this paper, we study the following instantiation of the general problem described before: Given an n-dimensional probability distribution , and , find the m-dimensional probability distribution that is the closest to p, where the measure of closeness is the well-known relative entropy [25] (also known as Kullback–Leibler divergence). In Section 2, we formally state the problem. In Section 3, we prove that the problem is strongly NP-hard, and in Section 4, we provide an approximation algorithm returning a solution whose distance from p is at most 1 plus the minimum possible distance.
2. Statement of the Problem and Mathematical Preliminaries
Let
be the -dimensional probability simplex. Given two probability distributions and , with , we say that q is an aggregation of p if each component of q can be expressed as the sum of distinct components of p. More formally, q is an aggregation of p if there exists a partition of such that , for each . Notice that the aggregation operation corresponds to the following operation on random variables: Given a random variable X that takes value in a finite set , such that for , any function , with and , induces a random variable whose probability distribution is an aggregation of p. Dimension reduction in random variables through the application of deterministic functions is a common technique in the area (e.g., [10,12,26]). Additionally, the problem arises also in the area of “hard clustering” [27] where one seeks a deterministic mapping f from data, generated by an r.v. X taking values in a set , to “labels” in some set , where typically .
For any probability distribution and an integer , let us denote by the set of all that are aggregations of p. Our goal is to solve the following optimization problem:
Problem 1.
Given and , find such that
where is the relative entropy [25], given by
and the logarithm is of base 2.
An additional motivation to study Problem 1 comes from the fundamental paper [28], in which the principle of minimum relative entropy (called therein minimum cross entropy principle) is derived in an axiomatic manner. The principle states that, of the distributions q that satisfy given constraints (in our case, that ), one should choose the one with the least relative entropy “distance” from the prior p.
Before establishing the computational complexity of the Problem 1, we present a simple lower bound on the optimal value.
Lemma 1.
For each and , , it holds that
where
Proof.
Given an arbitrary , one can see that
Moreover, for any and , the Jensen inequality applied to the log function gives the following:
□
3. Hardness
In this section, we prove that the optimization problem (2) described in Section 1 is strongly NP-hard. We accomplish this by reducing the problem from the 3-Partition problem, a well-known strongly NP-hard problem [29], described as follows.
3-Partition: Given a multiset of positive integers for which , for some T, the problem is to decide whether S can be partitioned into m triplets such that the sum of each triple is exactly T. More formally, the problem is to decide whether there exist such that the following conditions hold:
Theorem 1.
The
3-Partition
problem can be reduced in polynomial time to the problem of finding the aggregation of some , for which
Proof.
The idea behind the following reduction can be summarized as follows: given an instance of 3-Partition, we transform it into a probability distribution p such that the lower bound is an aggregation of p if and only if the original instance of 3-Partition admits a solution. Let an arbitrary instance of 3-Partition be given, that is, let S be a multiset of positive integers with . Without loss of generality, we assume that the integers are ordered in a non-increasing fashion. We construct a valid instance p of our Problem 1 as follows. We set as follows:
Note that p is a probability distribution. In fact, since , we have
Moreover, from (4) and (5), the probability distribution associated to p is as follows:
To prove the theorem, we show that the starting instance of 3-Partition is a Yes instance if and only if it holds that
where p is given in (5).
We begin by assuming the given instance of 3-Partition is a Yes instance, that is, there is a partition of S into triplets such that
and we show that . By Lemma 1, (5), and equality (6), we have
From (8), we have
Let us define as follows:
where, by (10),
From (12) and from the fact that are a partition of , we obtain , that is, is a valid aggregation of p (cfr., (5)). Moreover,
and . Therefore, by (9) and that , we obtain
as required.
We show that the original instance of 3-Partition is also a Yes instance, that is, there is a partition of S into triplets such that
Let be the element in that achieves the minimum in (13). Consequently, we have
where is the Shannon entropy of . From (15), we obtain that ; hence, (see [30], Thm. 2.6.4). Recalling that , we obtain that the uniform distribution
is an aggregation of p. We note that the first m components of p, as defined in (5), cannot be aggregated among them to obtain (16), because , for . Therefore, in order to obtain (16) as an aggregation of p, there must exist a partition of for which
From (17), we obtain
From this, it follows that
We note that, for (19) to be true, there cannot exist any for which . Indeed, if there were a subset for which , there would be at least a subset for which . Thus, for such an , we would have
contradicting (19). Therefore, it holds that
Moreover, from (19) and (20), we obtain
Thus, from (21), it follows that the subsets give a partition of S into triplets, such that
Therefore, the starting instance of 3-Partition is a Yes instance. □
4. Approximation
Given and , let denote the optimal value of the optimization problem (2), that is
In this section, we design a greedy algorithm to compute an aggregation of p such that
The idea behind our algorithm is to see the problem of computing an aggregation as a bin packing problem with “overstuffing” (see [31] and references therein quoted), which is a bin packing where overfilling of bins is possible. In the classical bin packing problem, one is given a set of items, with their associated weights, and a set of bins with their associated capacities (usually, equal for all bins). The objective is to place all the items in the bins, trying to minimize a given cost function.
In our case, we have n items (corresponding to the components of p) with weights , respectively, and m bins, corresponding to the components of (as defined in (4)) with capacities . Our objective is to place all the n components of p into the m bins without exceeding the capacity of each bin j, , by more than . For such a purpose, the idea behind Algorithm 1 is quite straightforward. It behaves like a classical First-Fit bin packing: to place the ith item, it chooses the first bin j in which the item can be inserted without exceeding its capacity by more than . In the following, we will show that such a bin always exists and that fulfilling this objective is sufficient to ensure the approximation guarantee (23) we are seeking.
| Algorithm 1: GreedyApprox |
1. Compute ; 2. Let be the content of bin j after the first i components of p have been placed ( for each ); 3. For Let j be the smallest bin index for which holds that , place into the j-th bin: , for each ; 4. Output . |
The step 3 of GreedyApprox operates as in the classical First-Fit bin packing algorithm. Therefore, it can be implemented to run in time, as discussed in [32]. In fact, each iteration of the loop in step 3 can be implemented in -time by using a balanced binary search tree with height that has a leaf for each bin and in which each node keeps track of the largest remaining capacity of all the bins in its subtree.
Lemma 2.
GreedyApprox computes a valid aggregation of . Moreover, it holds that
Proof.
We first prove that each component of p is placed in some bin. This implies that .
For each step , there is always a bin in which the algorithm places . In fact, the capacity of bin j satisfies the relation:
Let us consider an arbitrary step , in which the algorithm has placed the first i components of p and needs to place into some bin. We show that, in this case also, there is always a bin j in which the algorithm places the item , without exceeding the capacity of the bin j by more than .
First, notice that in each step i, , there is at least a bin k whose content does not exceed its capacity ; that is, for which holds. Were this the opposite, for all bins j, we would have ; then, we would also have
However, this is not possible since we have placed only the first components of p, and therefore, it holds that
contradicting (25). Consequently, let k be the smallest integer for which the content of the k-th bin does not exceed its capacity, i.e., for which . For such a bin k, we obtain
Thus, from (26), one derives that the algorithm places into the bin k without exceeding its capacity by more than .
The reasoning applies to each , thus proving that GreedyApprox correctly assigns each component of p to a bin, effectively computing an aggregation of p. Moreover, from the instructions of step 3 of GreedyApprox, the output is an aggregation , for which the following crucial relation holds:
Let us now prove that . We have
□
We need the following technical lemma to show the approximation guarantee of GreedyApprox.
Lemma 3.
Let and be two arbitrary probability distributions with . It holds that
where .
Proof.
□
The following theorem is the main result of this section.
Theorem 2.
For any and , GreedyApprox produces an aggregation of p such that
where
Proof.
From Lemma 3, we have
and from Theorem 2, we know that the produced aggregation of p satisfies the relation
Putting it all together, we obtain:
□
5. Concluding Remarks
In this paper, we examined the problem of approximating n-dimensional probability distributions with m-dimensional ones using the Kullback–Leibler divergence as the measure of closeness. We demonstrated that this problem is strongly NP-hard and introduced an approximation algorithm for solving the problem with guaranteed performance.
Moreover, we conclude by pointing out that the analysis of GreedyApprox presented in Theorem 2 is tight. Let be
where . The application of GreedyApprox on p produces the aggregation given by
whereas one can see that the optimal aggregation is equal to
Hence, for , we have
while
Therefore, to improve our approximation guarantee, one should use a bin packing heuristic different from the First-Fit as employed in GreedyApprox. Another interesting open problem is to provide an approximation algorithm with a (small) multiplicative approximation guarantee. However, both problems mentioned above would probably require a different approach, and we leave that to future investigations.
Another interesting line of research would be to extend our findings to different divergence measures (e.g., [33] and references quoted therein).
Funding
This research received no external funding.
Institutional Review Board Statement
Not applicable.
Informed Consent Statement
Not applicable.
Data Availability Statement
No new data were created or analyzed in this study. Data sharing is not applicable to this article.
Acknowledgments
The author wants to express his gratitude to Ugo Vaccaro for guidance throughout this research, to the anonymous referees, and to the Academic Editor for many useful suggestions that have improved the presentation of the paper.
Conflicts of Interest
The author declares no conflicts of interest.
References
- Burges, C.J. Dimension reduction: A guided tour. Found. Trends Mach. Learn. 2010, 2, 275–365. [Google Scholar] [CrossRef]
- Sorzano, C.O.S.; Vargas, J.; Montano, A.P. A survey of dimensionality reduction techniques. arXiv 2014, arXiv:1403.2877. [Google Scholar]
- Abdullah, A.; Kumar, R.; McGregor, A.; Vassilvitskii, S.; Venkatasubramanian, S. Sketching, Embedding, and Dimensionality Reduction for Information Spaces. Artif. Intell. Stat. PMLR 2016, 51, 948–956. [Google Scholar]
- Carter, K.M.; Raich, R.; Finn, W.G.; Hero, A.O., III. Information-geometric dimensionality reduction. IEEE Signal Process. Mag. 2011, 28, 89–99. [Google Scholar] [CrossRef][Green Version]
- Gokhale, D.V. Approximating discrete distributions, with applications. J. Am. Stat. Assoc. 1973, 68, 1009–1012. [Google Scholar] [CrossRef]
- Globerson, A.; Tishby, N. Sufficient dimensionality reduction. J. Mach. Learn. Res. 2003, 3, 1307–1331. [Google Scholar]
- Lewis, P.M., II. Approximating probability distributions to reduce storage requirements. Inf. Control. 1959, 2, 214–225. [Google Scholar] [CrossRef]
- Adler, A.; Tang, J.; Polyanskiy, Y. Efficient representation of large-alphabet probability distributions. IEEE Sel. Areas Inf. Theory 2022, 3, 651–663. [Google Scholar] [CrossRef]
- Cicalese, F.; Gargano, L.; Vaccaro, U. Approximating probability distributions with short vectors, via information theoretic distance measures. In Proceedings of the IEEE International Symposium on Information Theory (ISIT), Barcelona, Spain, 10–15 July 2016; pp. 1138–1142. [Google Scholar]
- Cohen, L.; Grinshpoun, T.; Weiss, G. Efficient optimal Kolmogorov approximation of random variables. Artif. Intell. 2024, 329, 104086. [Google Scholar] [CrossRef]
- Cohen, L.; Weiss, G. Efficient optimal approximation of discrete random variables for estimation of probabilities of missing deadlines. Proc. Aaai Conf. Artif. Intell. 2019, 33, 7809–7815. [Google Scholar] [CrossRef]
- Vidyasagar, M. A metric between probability distributions on finite sets of different cardinalities and applications to order reduction. IEEE Trans. Autom. Control. 2012, 57, 2464–2477. [Google Scholar] [CrossRef]
- Kovačević, M.; Stanojević, I.; Šenk, V. On the entropy of couplings. Inf. Comput. 2015, 242, 369–382. [Google Scholar] [CrossRef]
- Cicalese, F.; Gargano, L.; Vaccaro, U. Minimum-entropy couplings and their applications. IEEE Trans. Inf. Theory 2019, 65, 3436–3451. [Google Scholar] [CrossRef]
- Compton, S. A tighter approximation guarantee for greedy minimum entropy coupling. In Proceedings of the IEEE International Symposium on Information Theory (ISIT), Espoo, Finland, 26 June–1 July 2022; pp. 168–173. [Google Scholar]
- Compton, S.; Katz, D.; Qi, B.; Greenewald, K.; Kocaoglu, M. Minimum-entropy coupling approximation guarantees beyond the majorization barrier. Int. Conf. Artif. Intell. Stat. 2023, 206, 10445–10469. [Google Scholar]
- Li, C. Efficient approximate minimum entropy coupling of multiple probability distributions. IEEE Trans. Inf. Theory 2021, 67, 5259–5268. [Google Scholar] [CrossRef]
- Sokota, S.; Sam, D.; Witt, C.; Compton, S.; Foerster, J.; Kolter, J. Computing Low-Entropy Couplings for Large-Support Distributions. arXiv 2024, arXiv:2405.19540. [Google Scholar]
- Rujeerapaiboon, N.; Schindler, K.; Kuhn, D.; Wiesemann, W. Scenario reduction revisited: Fundamental limits and guarantees. Math. Program. 2018, 191, 207–242. [Google Scholar] [CrossRef]
- Gagie, T. Compressing probability distributions. Inf. Process. Lett. 2006, 97, 133–137. [Google Scholar] [CrossRef]
- Cohen, L.; Fried, D.; Weiss, G. An optimal approximation of discrete random variables with respect to the Kolmogorov distance. arXiv 2018, arXiv:1805.07535. [Google Scholar]
- Pavlikov, K.; Uryasev, S. CVaR distance between univariate probability distributions and approximation problems. Ann. Oper. Res. 2018, 262, 67–88. [Google Scholar] [CrossRef]
- Pflug, G.C.; Pichler, A. Approximations for probability distributions and stochastic optimization problems. In Stochastic Optimization Methods in Finance and Energy: New Financial Products and Energy Market Strategies; Springer: Berlin/Heidelberg, Germany, 2011; pp. 343–387. [Google Scholar]
- Melucci, M. A brief survey on probability distribution approximation. Comput. Sci. Rev. 2019, 33, 91–97. [Google Scholar] [CrossRef]
- Kullback, S.; Leibler, R.A. On information and sufficiency. Ann. Math. Stat. 1951, 22, 79–86. [Google Scholar] [CrossRef]
- Lamarche-Perrin, R.; Demazeau, Y.; Vincent, J.M. The best-partitions problem: How to build meaningful aggregations. In Proceedings of the IEEE/WIC/ACM International Joint Conferences on Web Intelligence (WI) and Intelligent Agent Technologies (IAT), Atlanta, GA, USA, 17–20 November 2013; pp. 399–404. [Google Scholar]
- Kearns, M.; Mansour, Y.; Ng, A.Y. An information-theoretic analysis of hard and soft assignment methods for clustering. In Learning in Graphical Models; Springer: Dordrecht, The Netherlands, 1998; pp. 495–520. [Google Scholar]
- Shore, J.; Johnson, R. Axiomatic derivation of the principle of maximum entropy and the principle of minimum cross-entropy. IEEE Trans. Inf. Theory 1980, 26, 26–37. [Google Scholar] [CrossRef]
- Garey, M.; Johnson, D. Strong NP-Completeness results: Motivation, examples, and implications. J. ACM 1978, 25, 499–508. [Google Scholar] [CrossRef]
- Cover, T.M.; Thomas, J.A. Elements of Information Theory, 2nd ed.; Wiley-Interscience: Hoboken, NJ, USA, 2006. [Google Scholar]
- Dell’Olmo, P.; Kellerer, H.; Speranza, M.; Tuza, Z. A 13/12 approximation algorithm for bin packing with extendable bins. Inf. Process. Lett. 1998, 65, 229–233. [Google Scholar] [CrossRef]
- Coffman, E.G.; Garey, M.R.; Johnson, D.S. Approximation Algorithms for Bin Packing: A Survey. In Approximation Algorithms for NP-Hard Problems; Hochbaum, D., Ed.; PWS Publishing Co.: Worcester, UK, 1996; pp. 46–93. [Google Scholar]
- Sason, I. Divergence Measures: Mathematical Foundations and Applications in Information-Theoretic and Statistical Problems. Entropy 2022, 24, 712. [Google Scholar] [CrossRef]
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2024 by the author. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).