Hardness and Approximability of Dimension Reduction on the Probability Simplex

Roberto Bruno

doi:10.3390/a17070296

Department of Computer Science, University of Salerno, 84084 Fisciano, Italy

Algorithms2024, 17(7), 296;https://doi.org/10.3390/a17070296

This article belongs to the Special Issue Selected Algorithmic Papers from IWOCA 2024

Version Notes

Order Reprints

Abstract

Dimension reduction is a technique used to transform data from a high-dimensional space into a lower-dimensional space, aiming to retain as much of the original information as possible. This approach is crucial in many disciplines like engineering, biology, astronomy, and economics. In this paper, we consider the following dimensionality reduction instance: Given an n-dimensional probability distribution p and an integer

m < n

, we aim to find the m-dimensional probability distribution q that is the closest to p, using the Kullback–Leibler divergence as the measure of closeness. We prove that the problem is strongly NP-hard, and we present an approximation algorithm for it.

Keywords:

dimension reduction; NP-completeness; approximation; bin packing; Kullback–Leibler divergence

1. Introduction

Dimension reduction [1,2] is a methodology for mapping data from a high-dimensional space to a lower-dimensional space, while approximately preserving the original information content. This process is essential in fields such as engineering, biology, astronomy, and economics, where large datasets with high-dimensional points are common.

It is often the case that the computational complexity of the algorithms employed to extract relevant information from these datasets depends on the dimension of the space where the points lie. Therefore, it is important to find a representation of the data in a lower-dimensional space that still (approximately) preserves the information content of the original data, as per given criteria.

A special case of the general issue illustrated before arises when the elements of the dataset are n-dimensional probability distributions, and the problem is to approximate them by lower-dimensional ones. This question has been extensively studied in different contexts. In [3,4], the authors address the problem of dimensionality reduction on sets of probability distributions with the aim of preserving specific properties, such as pairwise distances. In [5], Gokhale considers the problem of finding the distribution that minimizes, subject to a set of linear constraints on the probabilities, the “discrimination information” with respect to a given probability distribution. Similarly, in [6], Globerson et al. address the dimensionality reduction problem by introducing a nonlinear method aimed at minimizing the loss of mutual information from the original data. In [7], Lewis explores dimensionality reduction for reducing storage requirements and proposes an approximation method based on the maximum entropy criterion. Likewise, in [8], Adler et al. apply dimensionality reduction to storage applications, focusing on the efficient representation of large-alphabet probability distributions. More closely related to the dimensionality reduction that we deal with in this paper are the works [9,10,11,12]. In [10,11], the authors address task scheduling problems where the objective is to allocate tasks of a project in a way that maximizes the likelihood of completing the project by the deadline. They formalize the problem in terms of random variables approximation by using the Kolmogorov distance as a measure of distance and present an optimal algorithm for the problem. In contrast, in [12], Vidyasagar defines a metric distance between probability distributions on two distinct finite sets of possibly different cardinalities based on the Minimum Entropy Coupling (MEC) problem. Informally, in the MEC, given two probability distributions p and q, one seeks to find a joint distribution

ϕ

that has p and q as marginal distributions and also has minimum entropy. Unfortunately, computing the MEC is NP-hard, as shown in [13]. However, numerous works in the literature present efficient algorithms for computing couplings with entropy within a constant number of bits from the optimal value [14,15,16,17,18]. We note that computing the coupling of a pair of distributions can be seen as essentially the inverse of dimension reduction. Specifically, given two distributions p and q, one constructs a third, larger distribution

ϕ

, such that p and q are derived from

ϕ

or, more formally, aggregations of

ϕ

. In contrast, the dimension reduction problem addressed in this paper involves starting with a distribution p and creating another, smaller distribution that is derived from p or, more formally, is an aggregation of p.

Moreover, in [12], the author demonstrates that, according to the defined metric, any optimal reduced-order approximation must be an aggregation of the original distribution. Consequently, the author provides an approximation algorithm based on the total variation distance, using an approach similar to the one we will employ in Section 4. Similarly, in [9], Cicalese et al. examine dimensionality reduction using the same distance metric introduced in [12]. They propose a general criterion for approximating p with a shorter vector q, based on concepts from Majorization theory, and provide an approximation approach to solve the problem.

We also mention that analogous problems arise in scenario reduction [19], where the problem is to (best) approximate a given discrete distribution with another distribution with fewer atoms in compressing probability distributions [20] and elsewhere [21,22,23]. Moreover, we recommend the following survey for further application examples [24].

In this paper, we study the following instantiation of the general problem described before: Given an n-dimensional probability distribution

p = (p_{1}, \dots, p_{n})

, and

m < n

, find the m-dimensional probability distribution

q = (q_{1}, \dots, q_{m})

that is the closest to p, where the measure of closeness is the well-known relative entropy [25] (also known as Kullback–Leibler divergence). In Section 2, we formally state the problem. In Section 3, we prove that the problem is strongly NP-hard, and in Section 4, we provide an approximation algorithm returning a solution whose distance from p is at most 1 plus the minimum possible distance.

2. Statement of the Problem and Mathematical Preliminaries

Let

P_{n} = {p = (p_{1}, \dots, p_{n}) ∣ p_{1} \geq \dots \geq p_{n} > 0, \sum_{i = 1}^{n} p_{i} = 1}

(1)

be the

(n - 1)

-dimensional probability simplex. Given two probability distributions

p \in P_{n}

and

q \in P_{m}

, with

m < n

, we say that q is an aggregation of p if each component of q can be expressed as the sum of distinct components of p. More formally, q is an aggregation of p if there exists a partition

Π = (Π_{1}, \dots, Π_{m})

of

{1, \dots, n}

such that

q_{i} = \sum_{j \in Π_{i}} p_{j}

, for each

i = 1, \dots, m

. Notice that the aggregation operation corresponds to the following operation on random variables: Given a random variable X that takes value in a finite set

X = {x_{1}, \dots, x_{n}}

, such that

Pr {X = x_{i}} = p_{i}

for

i = 1, \dots, n

, any function

f : X \mapsto Y

, with

Y = {y_{1}, \dots, y_{m}}

and

m < n

, induces a random variable

f (X)

whose probability distribution

q = (q_{1}, \dots, q_{m})

is an aggregation of p. Dimension reduction in random variables through the application of deterministic functions is a common technique in the area (e.g., [10,12,26]). Additionally, the problem arises also in the area of “hard clustering” [27] where one seeks a deterministic mapping f from data, generated by an r.v. X taking values in a set

X

, to “labels” in some set

Y

, where typically

| Y | ≪ | X |

.

For any probability distribution

p \in P_{n}

and an integer

m < n

, let us denote by

A_{m} (p)

the set of all

q \in P_{m}

that are aggregations of p. Our goal is to solve the following optimization problem:

Problem 1.

Given

p \in P_{n}

and

m < n

, find

q^{*} \in A_{m} (p)

such that

min_{q \in A_{m} (p)} D (q ∥ p) = D (q^{*} ∥ p),

(2)

where

D (q ∥ p)

is the relative entropy [25], given by

D (q ∥ p) = \sum_{i = 1}^{m} q_{i} log \frac{q_{i}}{p_{i}},

and the logarithm is of base 2.

An additional motivation to study Problem 1 comes from the fundamental paper [28], in which the principle of minimum relative entropy (called therein minimum cross entropy principle) is derived in an axiomatic manner. The principle states that, of the distributions q that satisfy given constraints (in our case, that

q \in A_{m} (p)

), one should choose the one with the least relative entropy “distance” from the prior p.

Before establishing the computational complexity of the Problem 1, we present a simple lower bound on the optimal value.

Lemma 1.

For each

p \in P_{n}

and

q \in P_{m}

,

m < n

, it holds that

D (q ∥ p) \geq D (l b (p) ∥ p) = - log (\sum_{i = 1}^{m} p_{i}) .

(3)

where

l b (p) = (\frac{p_{1}}{\sum_{i = 1}^{m} p_{i}}, \dots, \frac{p_{m}}{\sum_{i = 1}^{m} p_{i}}) \in P_{m} .

(4)

Proof.

Given an arbitrary

p \in P_{n}

, one can see that

D (l b (p) ∥ p) = - log (\sum_{i = 1}^{m} p_{i}) .

Moreover, for any

p \in P_{n}

and

q \in P_{m}

, the Jensen inequality applied to the log function gives the following:

- D (q | | p) = \sum_{i = 1}^{m} q_{i} log \frac{p_{i}}{q_{i}} \leq log (\sum_{i = 1}^{m} p_{i}) .

□

3. Hardness

In this section, we prove that the optimization problem (2) described in Section 1 is strongly NP-hard. We accomplish this by reducing the problem from the 3-Partition problem, a well-known strongly NP-hard problem [29], described as follows.

3-Partition: Given a multiset

S = {a_{1}, \dots, a_{n}}

of

n = 3 m

positive integers for which

\sum_{i = 1}^{n} a_{i} = m T

, for some T, the problem is to decide whether S can be partitioned into m triplets such that the sum of each triple is exactly T. More formally, the problem is to decide whether there exist

S_{1}, \dots, S_{m} \subseteq S

such that the following conditions hold:

\begin{matrix} \sum_{a \in S_{j}} a & = T, \forall j \in {1, \dots, m}, \\ S_{i} \cap S_{j} & = \emptyset, \forall i \neq j, \\ ⋃_{i = 1}^{m} S_{i} & = S, \\ | S_{i} | & = 3, \forall i \in {1, \dots, m} . \end{matrix}

Theorem 1.

The 3-Partition problem can be reduced in polynomial time to the problem of finding the aggregation

q^{*} \in P_{m}

of some

p \in P_{n}

, for which

D (q^{*} ∥ p) = min_{q \in A_{m} (p)} D (q ∥ p) .

Proof.

The idea behind the following reduction can be summarized as follows: given an instance of 3-Partition, we transform it into a probability distribution p such that the lower bound

l b (p)

is an aggregation of p if and only if the original instance of 3-Partition admits a solution. Let an arbitrary instance of 3-Partition be given, that is, let S be a multiset

{a_{1}, \dots, a_{n}}

of

n = 3 m

positive integers with

\sum_{i = 1}^{n} a_{i} = m T

. Without loss of generality, we assume that the integers

a_{i}

are ordered in a non-increasing fashion. We construct a valid instance p of our Problem 1 as follows. We set

p \in P_{n + m}

as follows:

p = (\underset{m times}{\underset{︸}{\frac{1}{m + 1}, \dots, \frac{1}{m + 1}}}, \frac{a_{1} + 2 T}{(m + 1) 7 m T}, \dots, \frac{a_{n} + 2 T}{(m + 1) 7 m T}) .

(5)

Note that p is a probability distribution. In fact, since

n = 3 m

, we have

\sum_{i = 1}^{n} \frac{a_{i} + 2 T}{(m + 1) 7 m T} = \frac{1}{(m + 1) 7 m T} (\sum_{i = 1}^{n} (a_{i} + 2 T)) = \frac{7 m T}{(m + 1) 7 m T} = \frac{1}{m + 1} .

Moreover, from (4) and (5), the probability distribution

l b (p) \in P_{m}

associated to p is as follows:

l b (p) = (\frac{p_{1}}{\sum_{j = 1}^{m} p_{j}}, \dots, \frac{p_{m}}{\sum_{j = 1}^{m} p_{j}}) = (\frac{1}{m}, \dots, \frac{1}{m}) .

(6)

To prove the theorem, we show that the starting instance of 3-Partition is a Yes instance if and only if it holds that

min_{q \in A_{m} (p)} D (q ∥ p) = log \frac{m + 1}{m},

(7)

where p is given in (5).

We begin by assuming the given instance of 3-Partition is a Yes instance, that is, there is a partition of S into triplets

S_{1}, \dots, S_{m}

such that

\sum_{a_{i} \in S_{j}} a_{i} = T, \forall j \in {1, \dots, m},

(8)

and we show that

{min}_{q \in A_{m} (p)} D (q ∥ p) = log \frac{m + 1}{m}

. By Lemma 1, (5), and equality (6), we have

min_{q \in A_{m} (p)} D (q ∥ p) \geq D (l b (p) ∥ p) = \sum_{i = 1}^{m} \frac{1}{m} log \frac{1 / m}{1 / (m + 1)} = log \frac{m + 1}{m} .

(9)

From (8), we have

\begin{matrix} \sum_{a_{i} \in S_{j}} \frac{a_{i} + 2 T}{(m + 1) 7 m T} & = \frac{T}{(m + 1) 7 m T} + \sum_{a_{i} \in S_{j}} \frac{2 T}{(m + 1) 7 m T} \\ = \frac{T}{(m + 1) 7 m T} + \frac{6 T}{(m + 1) 7 m T} \\ = \frac{1}{(m + 1) m}, \forall j \in {1, \dots, m} . \end{matrix}

(10)

Let us define

q^{'} \in P_{m}

as follows:

q^{'} = (\frac{1}{m + 1} + \sum_{a_{i} \in S_{1}} \frac{a_{i} + 2 T}{(m + 1) 7 m T}, \dots, \frac{1}{m + 1} + \sum_{a_{i} \in S_{m}} \frac{a_{i} + 2 T}{(m + 1) 7 m T}),

(11)

where, by (10),

\sum_{a_{i} \in S_{j}} \frac{a_{i} + 2 T}{(m + 1) 7 m T} = \frac{1}{(m + 1) m}, \forall j \in {1, \dots, m} .

(12)

From (12) and from the fact that

S_{1}, \dots, S_{m}

are a partition of

{a_{1}, \dots, a_{n}}

, we obtain

q^{'} \in A_{m} (p)

, that is,

q^{'}

is a valid aggregation of p (cfr., (5)). Moreover,

q^{'} = (\frac{1}{m}, \dots, \frac{1}{m}),

and

D (q^{'} ∥ p) = log \frac{m + 1}{m}

. Therefore, by (9) and that

q^{'} \in A_{m} (p)

, we obtain

min_{q \in A_{m} (p)} D (q ∥ p) = log \frac{m + 1}{m},

as required.

To prove the opposite implication, we assume that p (as given in (5)) is a Yes instance, that is,

min_{q \in A_{m} (p)} D (q ∥ p) = log \frac{m + 1}{m} .

(13)

We show that the original instance of 3-Partition is also a Yes instance, that is, there is a partition of S into triplets

S_{1}, \dots, S_{m}

such that

\sum_{a_{i} \in S_{j}} a_{i} = T, \forall j \in {1, \dots, m} .

(14)

Let

q^{*}

be the element in

A_{m} (p)

that achieves the minimum in (13). Consequently, we have

\begin{matrix} log \frac{m + 1}{m} & = D (q^{*} ∥ p) = \sum_{i = 1}^{m} q_{i}^{*} log \frac{q_{i}^{*}}{p_{i}} = \sum_{i = 1}^{m} q_{i}^{*} log \frac{1}{p_{i}} - H (q^{*}) \\ = log (m + 1) - H (q^{*}) (from (5)), \end{matrix}

(15)

where

H (q^{*}) = - \sum_{i = 1}^{m} q_{i}^{*} log q_{i}^{*}

is the Shannon entropy of

q^{*}

. From (15), we obtain that

H (q^{*}) = log m

; hence,

q^{*} = (1 / m, \dots, 1 / m)

(see [30], Thm. 2.6.4). Recalling that

q^{*} \in A_{m} (p)

, we obtain that the uniform distribution

(\frac{1}{m}, \dots, \frac{1}{m})

(16)

is an aggregation of p. We note that the first m components of p, as defined in (5), cannot be aggregated among them to obtain (16), because

2 / (m + 1) > 1 / m

, for

m > 2

. Therefore, in order to obtain (16) as an aggregation of p, there must exist a partition

S_{1}, \dots, S_{m}

of

S = {a_{1}, \dots, a_{n}}

for which

\frac{1}{m + 1} + \sum_{a_{i} \in S_{j}} \frac{a_{i} + 2 T}{(m + 1) 7 m T} = \frac{1}{m}, \forall j \in {1, \dots, m} .

(17)

From (17), we obtain

\sum_{a_{i} \in S_{j}} \frac{a_{i} + 2 T}{(m + 1) 7 m T} = \frac{1}{m (m + 1)}, \forall j \in {1, \dots, m} .

(18)

From this, it follows that

2 T | S_{j} | + \sum_{a_{i} \in S_{j}} a_{i} = 7 T, \forall j \in {1, \dots, m} .

(19)

We note that, for (19) to be true, there cannot exist any

S_{j}

for which

| S_{j} | \neq 3

. Indeed, if there were a subset

S_{j}

for which

| S_{j} | \neq 3

, there would be at least a subset

S_{k}

for which

| S_{k} | > 3

. Thus, for such an

S_{k}

, we would have

2 T | S_{k} | + \sum_{a_{i} \in S_{k}} a_{i} \geq 8 T + \sum_{a_{i} \in S_{k}} a_{i} > 7 T,

contradicting (19). Therefore, it holds that

| S_{j} | = 3, \forall j \in {1, \dots, m} .

(20)

Moreover, from (19) and (20), we obtain

\sum_{a_{i} \in S_{j}} a_{i} = 7 T - 2 T | S_{j} | = T, \forall j \in {1, \dots, m} .

(21)

Thus, from (21), it follows that the subsets

S_{1}, \dots, S_{m}

give a partition of S into triplets, such that

\sum_{a_{i} \in S_{j}} a_{i} = T, \forall j \in {1, \dots, m} .

Therefore, the starting instance of 3-Partition is a Yes instance. □

4. Approximation

Given

p \in P_{n}

and

m < n

, let

O P T

denote the optimal value of the optimization problem (2), that is

O P T = min_{q \in A_{m} (p)} D (q ∥ p) .

(22)

In this section, we design a greedy algorithm to compute an aggregation

\bar{q} \in A_{m} (p)

of p such that

D (\bar{q} ∥ p) < O P T + 1 .

(23)

The idea behind our algorithm is to see the problem of computing an aggregation

q \in A_{m} (p)

as a bin packing problem with “overstuffing” (see [31] and references therein quoted), which is a bin packing where overfilling of bins is possible. In the classical bin packing problem, one is given a set of items, with their associated weights, and a set of bins with their associated capacities (usually, equal for all bins). The objective is to place all the items in the bins, trying to minimize a given cost function.

In our case, we have n items (corresponding to the components of p) with weights

p_{1}, \dots, p_{n}

, respectively, and m bins, corresponding to the components of

l b (p)

(as defined in (4)) with capacities

l b {(p)}_{1}, \dots, l b {(p)}_{m}

. Our objective is to place all the n components of p into the m bins without exceeding the capacity

l b {(p)}_{j}

of each bin j,

j = 1, \dots, m

, by more than

(\sum_{i = 1}^{m} p_{i}) l b {(p)}_{j}

. For such a purpose, the idea behind Algorithm 1 is quite straightforward. It behaves like a classical First-Fit bin packing: to place the ith item, it chooses the first bin j in which the item can be inserted without exceeding its capacity by more than

(\sum_{i = 1}^{m} p_{i}) l b {(p)}_{j}

. In the following, we will show that such a bin always exists and that fulfilling this objective is sufficient to ensure the approximation guarantee (23) we are seeking.

Algorithm 1: GreedyApprox

1. Compute

l b (p) = (p_{1} / \sum_{j = 1}^{m} p_{j}, \dots, p_{m} / \sum_{j = 1}^{m} p_{j})

;

2. Let

l b_{j}^{i}

be the content of bin j after the first i components of p have been placed (

l b_{j}^{0} = 0

for each

j \in {1, \dots, m}

);

3. For

i = 0, \dots, n - 1

Let j be the smallest bin index for which holds that

l b_{j}^{i} + p_{i + 1} < (1 + \sum_{j = 1}^{m} p_{j}) l b {(p)}_{j}

, place

p_{i + 1}

into the j-th bin:

l b_{j}^{i + 1} = l b_{j}^{i} + p_{i + 1},

l b_{k}^{i + 1} = l b_{k}^{i}

, for each

k \neq j

;

4. Output

\bar{q} = (l b_{1}^{n}, \dots, l b_{m}^{n})

.

The step 3 of GreedyApprox operates as in the classical First-Fit bin packing algorithm. Therefore, it can be implemented to run in

O (n log m)

time, as discussed in [32]. In fact, each iteration of the loop in step 3 can be implemented in

O (log m)

-time by using a balanced binary search tree with height

O (log m)

that has a leaf for each bin and in which each node keeps track of the largest remaining capacity of all the bins in its subtree.

Lemma 2.

GreedyApprox computes a valid aggregation

\bar{q} \in A_{m} (p)

of

p \in P_{n}

. Moreover, it holds that

D (\bar{q} ∥ l b (p)) < log (1 + \sum_{j = 1}^{m} p_{j}) .

(24)

Proof.

We first prove that each component

p_{i}

of p is placed in some bin. This implies that

\bar{q} \in A_{m} (p)

.

For each step

i = 0, \dots, m - 1

, there is always a bin in which the algorithm places

p_{i + 1}

. In fact, the capacity

l b {(p)}_{j}

of bin j satisfies the relation:

l b {(p)}_{j} = \frac{p_{j}}{\sum_{ℓ = 1}^{m} p_{ℓ}} > p_{j}, \forall j \in {1, \dots, m} .

Let us consider an arbitrary step

m \leq i < n

, in which the algorithm has placed the first i components of p and needs to place

p_{i + 1}

into some bin. We show that, in this case also, there is always a bin j in which the algorithm places the item

p_{i + 1}

, without exceeding the capacity

l b {(p)}_{j}

of the bin j by more than

(\sum_{ℓ = 1}^{m} p_{ℓ}) l b {(p)}_{j}

.

First, notice that in each step i,

m \leq i < n

, there is at least a bin k whose content

l b_{k}^{i}

does not exceed its capacity

l b {(p)}_{k}

; that is, for which

l b_{k}^{i} < l b {(p)}_{k}

holds. Were this the opposite, for all bins j, we would have

l b_{j}^{i} \geq l b {(p)}_{j}

; then, we would also have

\sum_{j = 1}^{m} l b_{j}^{i} \geq \sum_{j = 1}^{m} l b {(p)}_{j} = 1 .

(25)

However, this is not possible since we have placed only the first

i < n

components of p, and therefore, it holds that

\sum_{j = 1}^{m} l b_{j}^{i} = \sum_{j = 1}^{i} p_{j} < \sum_{j = 1}^{n} p_{j} = 1,

contradicting (25). Consequently, let k be the smallest integer for which the content of the k-th bin does not exceed its capacity, i.e., for which

l b_{k}^{i} < l b {(p)}_{k}

. For such a bin k, we obtain

\begin{matrix} (1 + \sum_{j = 1}^{m} p_{j}) l b {(p)}_{k} & = l b {(p)}_{k} + (\sum_{j = 1}^{m} p_{j}) l b {(p)}_{k} \\ = l b {(p)}_{k} + (\sum_{j = 1}^{m} p_{j}) \frac{p_{k}}{\sum_{j = 1}^{m} p_{j}} \\ = l b {(p)}_{k} + p_{k} \\ > l b_{k}^{i} + p_{k} (since l b {(p)}_{k} > l b_{k}^{i}) \\ \geq l b_{k}^{i} + p_{i + 1} (since p_{k} \geq p_{i + 1}) . \end{matrix}

(26)

Thus, from (26), one derives that the algorithm places

p_{i + 1}

into the bin k without exceeding its capacity

l b {(p)}_{k}

by more than

(\sum_{j = 1}^{m} p_{j}) l b {(p)}_{k}

.

The reasoning applies to each

i < n

, thus proving that GreedyApprox correctly assigns each component

p_{i}

of p to a bin, effectively computing an aggregation of p. Moreover, from the instructions of step 3 of GreedyApprox, the output is an aggregation

\bar{q} = ({\bar{q}}_{1}, \dots, {\bar{q}}_{m}) \in A_{m} (p)

, for which the following crucial relation holds:

{\bar{q}}_{i} < (1 + \sum_{j = 1}^{m} p_{j}) l b {(p)}_{i}, \forall i \in {1, \dots, m} .

(27)

Let us now prove that

D (\bar{q} ∥ l b (p)) < log (1 + \sum_{j = 1}^{m} p_{j})

. We have

\begin{matrix} D (\bar{q} ∥ l b (p)) & = \sum_{i = 1}^{m} {\bar{q}}_{i} log \frac{{\bar{q}}_{i}}{l b {(p)}_{i}} \\ < \sum_{i = 1}^{m} {\bar{q}}_{i} log \frac{(1 + \sum_{j = 1}^{m} p_{j}) l b {(p)}_{i}}{l b {(p)}_{i}} (from (27)) \\ = log (1 + \sum_{j = 1}^{m} p_{j}) . \end{matrix}

□

We need the following technical lemma to show the approximation guarantee of GreedyApprox.

Lemma 3.

Let

q \in P_{m}

and

p \in P_{n}

be two arbitrary probability distributions with

m < n

. It holds that

D (q ∥ p) = D (q ∥ l b (p)) + D (l b (p) ∥ p),

(28)

where

l b (p) = (l b {(p)}_{1}, \dots, l b {(b)}_{m}) = (p_{1} / \sum_{i = 1}^{m} p_{i}, \dots, p_{m} / \sum_{i = 1}^{m} p_{i})

.

Proof.

\begin{matrix} D (q ∥ p) & = \sum_{i = 1}^{m} q_{i} log \frac{q_{i}}{p_{i}} = \sum_{i = 1}^{m} q_{i} log \frac{q_{i}}{p_{i} \frac{\sum_{j = 1}^{m} p_{j}}{\sum_{j = 1}^{m} p_{j}}} \\ = \sum_{i = 1}^{m} q_{i} log \frac{q_{i}}{\frac{p_{i}}{\sum_{j = 1}^{m} p_{j}}} + \sum_{i = 1}^{m} q_{i} log \frac{1}{\sum_{j = 1}^{m} p_{j}} \\ = \sum_{i = 1}^{m} q_{i} log \frac{q_{i}}{l b {(p)}_{i}} + \sum_{i = 1}^{m} q_{i} log \frac{1}{\sum_{j = 1}^{m} p_{j}} (since l b {(p)}_{i} = p_{i} / \sum_{j = 1}^{m} p_{j}) \\ = D (q ∥ l b (p)) + log \frac{1}{\sum_{j = 1}^{m} p_{j}} = D (q ∥ l b (p)) + D (l b (p) ∥ p) . \end{matrix}

□

The following theorem is the main result of this section.

Theorem 2.

For any

p \in P_{n}

and

m < n

, GreedyApprox produces an aggregation

\bar{q} \in A_{m} (p)

of p such that

D (\bar{q} ∥ p) < O P T + 1,

(29)

where

O P T = {min}_{q \in A_{m} (p)} D (q ∥ p) .

Proof.

From Lemma 3, we have

D (\bar{q} ∥ p) = D (\bar{q} ∥ l b (p)) + D (l b (p) ∥ p),

(30)

and from Theorem 2, we know that the produced aggregation

\bar{q}

of p satisfies the relation

D (\bar{q} ∥ l b (p)) < log (1 + \sum_{j = 1}^{m} p_{j}) .

(31)

Putting it all together, we obtain:

\begin{matrix} D (\bar{q} ∥ p) & = D (\bar{q} ∥ l b (p)) + D (l b (p) ∥ p) \\ < log (1 + \sum_{j = 1}^{m} p_{j}) + D (l b (p) ∥ p) (from (31)) \\ = log (1 + \sum_{j = 1}^{m} p_{j}) - log (\sum_{j = 1}^{m} p_{j}) \\ < - log (\sum_{j = 1}^{m} p_{j}) + 1 (since 1 + \sum_{j = 1}^{m} p_{j} < 2) \\ \leq O P T + 1 (from Lemma 1) . \end{matrix}

□

5. Concluding Remarks

In this paper, we examined the problem of approximating n-dimensional probability distributions with m-dimensional ones using the Kullback–Leibler divergence as the measure of closeness. We demonstrated that this problem is strongly NP-hard and introduced an approximation algorithm for solving the problem with guaranteed performance.

Moreover, we conclude by pointing out that the analysis of GreedyApprox presented in Theorem 2 is tight. Let

p \in P_{3}

be

p = (\frac{1}{2} - ϵ, \frac{1}{2} - ϵ, 2 ϵ),

where

ϵ > 0

. The application of GreedyApprox on p produces the aggregation

\bar{q} \in P_{2}

given by

\bar{q} = (1 - 2 ϵ, 2 ϵ),

whereas one can see that the optimal aggregation

q^{*} \in P_{2}

is equal to

q^{*} = (\frac{1}{2} + ϵ, \frac{1}{2} - ϵ) .

Hence, for

ϵ \to 0

, we have

D (\bar{q} ∥ p) = (1 - 2 ϵ) log \frac{1 - 2 ϵ}{\frac{1}{2} - ϵ} + 2 ϵ log \frac{2 ϵ}{\frac{1}{2} - ϵ} \to 1,

while

O P T = D (q^{*} ∥ p) = (\frac{1}{2} + ϵ) log \frac{\frac{1}{2} + ϵ}{\frac{1}{2} - ϵ} + (\frac{1}{2} - ϵ) log \frac{\frac{1}{2} - ϵ}{\frac{1}{2} - ϵ} \to 0 .

Therefore, to improve our approximation guarantee, one should use a bin packing heuristic different from the First-Fit as employed in GreedyApprox. Another interesting open problem is to provide an approximation algorithm with a (small) multiplicative approximation guarantee. However, both problems mentioned above would probably require a different approach, and we leave that to future investigations.

Another interesting line of research would be to extend our findings to different divergence measures (e.g., [33] and references quoted therein).

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

No new data were created or analyzed in this study. Data sharing is not applicable to this article.

Acknowledgments

The author wants to express his gratitude to Ugo Vaccaro for guidance throughout this research, to the anonymous referees, and to the Academic Editor for many useful suggestions that have improved the presentation of the paper.

Conflicts of Interest

The author declares no conflicts of interest.

References

Burges, C.J. Dimension reduction: A guided tour. Found. Trends Mach. Learn. 2010, 2, 275–365. [Google Scholar] [CrossRef]
Sorzano, C.O.S.; Vargas, J.; Montano, A.P. A survey of dimensionality reduction techniques. arXiv 2014, arXiv:1403.2877. [Google Scholar]
Abdullah, A.; Kumar, R.; McGregor, A.; Vassilvitskii, S.; Venkatasubramanian, S. Sketching, Embedding, and Dimensionality Reduction for Information Spaces. Artif. Intell. Stat. PMLR 2016, 51, 948–956. [Google Scholar]
Carter, K.M.; Raich, R.; Finn, W.G.; Hero, A.O., III. Information-geometric dimensionality reduction. IEEE Signal Process. Mag. 2011, 28, 89–99. [Google Scholar] [CrossRef][Green Version]
Gokhale, D.V. Approximating discrete distributions, with applications. J. Am. Stat. Assoc. 1973, 68, 1009–1012. [Google Scholar] [CrossRef]
Globerson, A.; Tishby, N. Sufficient dimensionality reduction. J. Mach. Learn. Res. 2003, 3, 1307–1331. [Google Scholar]
Lewis, P.M., II. Approximating probability distributions to reduce storage requirements. Inf. Control. 1959, 2, 214–225. [Google Scholar] [CrossRef]
Adler, A.; Tang, J.; Polyanskiy, Y. Efficient representation of large-alphabet probability distributions. IEEE Sel. Areas Inf. Theory 2022, 3, 651–663. [Google Scholar] [CrossRef]
Cicalese, F.; Gargano, L.; Vaccaro, U. Approximating probability distributions with short vectors, via information theoretic distance measures. In Proceedings of the IEEE International Symposium on Information Theory (ISIT), Barcelona, Spain, 10–15 July 2016; pp. 1138–1142. [Google Scholar]
Cohen, L.; Grinshpoun, T.; Weiss, G. Efficient optimal Kolmogorov approximation of random variables. Artif. Intell. 2024, 329, 104086. [Google Scholar] [CrossRef]
Cohen, L.; Weiss, G. Efficient optimal approximation of discrete random variables for estimation of probabilities of missing deadlines. Proc. Aaai Conf. Artif. Intell. 2019, 33, 7809–7815. [Google Scholar] [CrossRef]
Vidyasagar, M. A metric between probability distributions on finite sets of different cardinalities and applications to order reduction. IEEE Trans. Autom. Control. 2012, 57, 2464–2477. [Google Scholar] [CrossRef]
Kovačević, M.; Stanojević, I.; Šenk, V. On the entropy of couplings. Inf. Comput. 2015, 242, 369–382. [Google Scholar] [CrossRef]
Cicalese, F.; Gargano, L.; Vaccaro, U. Minimum-entropy couplings and their applications. IEEE Trans. Inf. Theory 2019, 65, 3436–3451. [Google Scholar] [CrossRef]
Compton, S. A tighter approximation guarantee for greedy minimum entropy coupling. In Proceedings of the IEEE International Symposium on Information Theory (ISIT), Espoo, Finland, 26 June–1 July 2022; pp. 168–173. [Google Scholar]
Compton, S.; Katz, D.; Qi, B.; Greenewald, K.; Kocaoglu, M. Minimum-entropy coupling approximation guarantees beyond the majorization barrier. Int. Conf. Artif. Intell. Stat. 2023, 206, 10445–10469. [Google Scholar]
Li, C. Efficient approximate minimum entropy coupling of multiple probability distributions. IEEE Trans. Inf. Theory 2021, 67, 5259–5268. [Google Scholar] [CrossRef]
Sokota, S.; Sam, D.; Witt, C.; Compton, S.; Foerster, J.; Kolter, J. Computing Low-Entropy Couplings for Large-Support Distributions. arXiv 2024, arXiv:2405.19540. [Google Scholar]
Rujeerapaiboon, N.; Schindler, K.; Kuhn, D.; Wiesemann, W. Scenario reduction revisited: Fundamental limits and guarantees. Math. Program. 2018, 191, 207–242. [Google Scholar] [CrossRef]
Gagie, T. Compressing probability distributions. Inf. Process. Lett. 2006, 97, 133–137. [Google Scholar] [CrossRef]
Cohen, L.; Fried, D.; Weiss, G. An optimal approximation of discrete random variables with respect to the Kolmogorov distance. arXiv 2018, arXiv:1805.07535. [Google Scholar]
Pavlikov, K.; Uryasev, S. CVaR distance between univariate probability distributions and approximation problems. Ann. Oper. Res. 2018, 262, 67–88. [Google Scholar] [CrossRef]
Pflug, G.C.; Pichler, A. Approximations for probability distributions and stochastic optimization problems. In Stochastic Optimization Methods in Finance and Energy: New Financial Products and Energy Market Strategies; Springer: Berlin/Heidelberg, Germany, 2011; pp. 343–387. [Google Scholar]
Melucci, M. A brief survey on probability distribution approximation. Comput. Sci. Rev. 2019, 33, 91–97. [Google Scholar] [CrossRef]
Kullback, S.; Leibler, R.A. On information and sufficiency. Ann. Math. Stat. 1951, 22, 79–86. [Google Scholar] [CrossRef]
Lamarche-Perrin, R.; Demazeau, Y.; Vincent, J.M. The best-partitions problem: How to build meaningful aggregations. In Proceedings of the IEEE/WIC/ACM International Joint Conferences on Web Intelligence (WI) and Intelligent Agent Technologies (IAT), Atlanta, GA, USA, 17–20 November 2013; pp. 399–404. [Google Scholar]
Kearns, M.; Mansour, Y.; Ng, A.Y. An information-theoretic analysis of hard and soft assignment methods for clustering. In Learning in Graphical Models; Springer: Dordrecht, The Netherlands, 1998; pp. 495–520. [Google Scholar]
Shore, J.; Johnson, R. Axiomatic derivation of the principle of maximum entropy and the principle of minimum cross-entropy. IEEE Trans. Inf. Theory 1980, 26, 26–37. [Google Scholar] [CrossRef]
Garey, M.; Johnson, D. Strong NP-Completeness results: Motivation, examples, and implications. J. ACM 1978, 25, 499–508. [Google Scholar] [CrossRef]
Cover, T.M.; Thomas, J.A. Elements of Information Theory, 2nd ed.; Wiley-Interscience: Hoboken, NJ, USA, 2006. [Google Scholar]
Dell’Olmo, P.; Kellerer, H.; Speranza, M.; Tuza, Z. A 13/12 approximation algorithm for bin packing with extendable bins. Inf. Process. Lett. 1998, 65, 229–233. [Google Scholar] [CrossRef]
Coffman, E.G.; Garey, M.R.; Johnson, D.S. Approximation Algorithms for Bin Packing: A Survey. In Approximation Algorithms for NP-Hard Problems; Hochbaum, D., Ed.; PWS Publishing Co.: Worcester, UK, 1996; pp. 46–93. [Google Scholar]
Sason, I. Divergence Measures: Mathematical Foundations and Applications in Information-Theoretic and Statistical Problems. Entropy 2022, 24, 712. [Google Scholar] [CrossRef]

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the author. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Hardness and Approximability of Dimension Reduction on the Probability Simplex

Abstract

1. Introduction

2. Statement of the Problem and Mathematical Preliminaries

3. Hardness

4. Approximation

5. Concluding Remarks

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Article Metrics

Citations

Article Access Statistics