Communication Efficient Algorithms for Bounding and Approximating the Empirical Entropy in Distributed Systems

The empirical entropy is a key statistical measure of data frequency vectors, enabling one to estimate how diverse the data are. From the computational point of view, it is important to quickly compute, approximate, or bound the entropy. In a distributed system, the representative (“global”) frequency vector is the average of the “local” frequency vectors, each residing in a distinct node. Typically, the trivial solution of aggregating the local vectors and computing their average incurs a huge communication overhead. Hence, the challenge is to approximate, or bound, the entropy of the global vector, while reducing communication overhead. In this paper, we develop algorithms which achieve this goal.


Introduction
Consider the distributed computing model [1][2][3], where the goal is to compute a function over input divided amongst multiple nodes. A local computation, while simple, does not always suffice to reach a conclusion on the aggregated input data, especially when the function is nonlinear. On the other hand, broadcasting the local data to a coordinator node is impractical and undesirable due to communication overhead, energy consumption, and privacy issues. Generally, we seek to approximate or bound the function's value on the aggregated data without broadcasting it in its entirety.
The function we handle in this paper is the empirical Shannon entropy [4], which is defined as ∑ i −x i ln(x i ) for a frequency vector X = (x 1 , . . . , x n ) (i.e., all values are non-negative and sum to 1). For some of the ensuing analysis, it is easier to use the natural logarithm than the base 2 one, which only changes the value by a multiplicative constant. Thus, hereafter, "entropy" will refer exclusively to the empirical Shannon entropy. Specifically, we assume there exists a distributed system, with each node (or "party") holding a "local" frequency vector. The target function is defined as the global system entropy, which is equal to the empirical Shannon entropy of the average of the local vectors. Alas, to compute the exact value, we must first aggregate the local vectors and average them, which often incurs a huge communication overhead. Fortunately, it often suffices to approximate, or bound, this global entropy; for example: • Often, a sudden change in the entropy indicates a phase change in the underlying system, for example a DDoS (Distributed Denial of Service) attack [5]. To this end, it typically suffices to bound the entropy, since its precise value is not required. • A good measure of similarity between two datasets is the difference between their aggregated and individual entropies. For example, if two collections of text are of a similar nature, their aggregate entropy will be similar to the individual ones, and if they are different, the aggregated entropy will be substantially larger. Here, too, it suffices to approximate the global entropy, or bound it from above or below in order to reach a decision.
Guided by such challenges, we develop communication-efficient algorithms for bounding and approximating the global entropy, which are organized as follows: In Section 3, we present algorithms for bounding the global entropy, with low communication. Some results on real and synthetic data are also provided.
In Section 4, a novel algorithm is provided for approximating the global entropy. It is tailored to treat the cases in which the algorithm in Section 3 underperforms.

Previous Work
The problem of reducing communication overhead in distributed systems is very important both from the practical and theoretical points of view. Applications abound, for example distributed graphs [2,6] and distributed machine learning [3]. Research close to ours in spirit [7] deals with the following scenario: a system is given consisting of • Distributed computing nodes denoted by N 1 . . . N t , with N i holding a "local" data vector X i . The nodes can communicate, either directly or via a "coordinator" node. • A scalar-valued function f (X, Y), defined on pairs of real-valued vectors. Given the above, the challenge is to approximate the values f (X i , X j ), i, j = 1 . . . t with a low communication overhead; that is, the trivial solution of sending all the local vectors to some computing node is forbidden.
A sketch for this type of problem is defined as a structure s(), of size smaller than the dimension of X i , which has the following property: knowledge of s(X), s(Y) allows one to approximate f (X, Y) with very high accuracy. An important example [7] is f (X, Y) = X, Y ( X, Y stands for the inner product of X, Y).
There are many types of sketches, for example: • PCA (Principle Component Analysis) sketch: given a large subset S ⊆ R n , one wishes to quickly estimate the distance of vectors from an underlying structure which S is sampled from (a famous example is images of a certain type [8]). To this end, S is represented by a smaller set, consisting of the dominant eigenvectors of S's scatter matrix, and the distance is estimated by the distance of the vector from the subspace spanned by these vectors. • In the analysis of streaming data, some important sketches were developed, in order to handle large and dynamic streams, by only preserving salient properties (such as the number of distinct items, frequency, and the norm). It is beyond the scope of this paper to describe these sketches, so we refer to the literature [9].
Sketches are specifically tailored for the task at hand. In our case, X, Y are frequency (probability) vectors, and f (X, Y) is the empirical Shannon entropy of (X + Y)/2. Similarly, one may look at functions defined on larger subsets of {X 1 , . . . , X t } (Section 3.4). Our task is therefore to define a sketch s(), such that • s(X) is much smaller than X. • Knowledge of s(X), s(Y) allows one to approximate the empirical Shannon entropy of (X + Y)/2.
We note here that some work addressed entropy approximation in the Streaming Model [1,10,11]. Here, as in [7], we are mainly interested in the static scenario, in which the overall communication overhead is substantially smaller than the overall data volume. The "geometric monitoring" method [10,11], applied to solve the Distributed Monitoring Problem [1], relies on checking local constraints at the nodes; as long as they hold, the value of some global function, defined on the average of the local streams, is guaranteed to lie in some range. Alas, when the local conditions are violated, the nodes undertake a "synchronization stage" [12], which consists of communicating their local vectors in their entirety (which here we avoid). In the future, we plan to extend the techniques developed here to the distributed streaming scenario.

Dynamic Bounds and Communication Reduction
In this section, we present algorithms for bounding the entropy of a centralized vector-that is, the mean of several local vectors-by broadcasting a controlled amount of inter-communication between machines. The proposed algorithms for both upper and lower bounds accept the same input and therefore can be run concurrently.

Problem, Motivation, and an Example
This work addresses the following problem: • Given nodes N i , each holding a probability vector v i (i.e., all values are positive and sum to 1), approximate the entropy of the average of v i , while maintaining low communication overhead.
Let us start with the simplest possible scenario, which we shall analyze in detail, in order to prepare the ground for the general treatment.

Example 1.
There are two nodes, N 1 , N 2 , and the vectors they hold are of length 3. Assume without loss of generality that N 1 sends some of its data to N 2 , where "data" consists of a set of pairs (coordinate, value), where "coordinate" is the location of a value of v 1 , and "value" is its numerical value; then, N 2 attempts to derive an upper bound on the entropy of v 1 +v 2 2 . Note that vectors of length 2 are hardly interesting, since sending a single datum allows one to compute the other (as they sum to 1); hence, N 2 will be able to exactly compute the entropy.
Intuition suggests that N 1 should relay its largest value to N 2 . While (as we will show later), this is true on the average, that is not always the case. Assume that the vectors held by the nodes are Assume that N 1 sends its largest value (and its coordinate) to N 2 . Now, N 2 knows that (a) the first value of the average vector is 1 3 , and (b) the second and third values of N 2 sum to 1 3 . That leaves open the possibility that these values are 1 6 each, which would render the average vector equal to hence, the upper bound is equal to the maximal entropy possible, 3 ln(3). However, if N 1 sends its second largest value 1 3 to N 2 , N 2 can conclude that the second value of the average vector equals 5 12 ; hence, the upper bound on the entropy is strictly smaller than 3 ln(3).
We observe here that the key consideration in determining the upper bound is the distribution of the "slack" corresponding to the unknown values at the other node (N 1 in this example). The overall size of this "slack" is one-half of the unknown values, and it should be distributed amongst the same set of coordinates in N 2 after they have been divided by 2.
In contrast to the above "adversarial" example, on average, it is optimal to send the largest value (i.e., it allows one to achieve a lower upper bound). To prove this, we have (numerically) computed the integral of the upper bound over all triplets, both after the largest value and a random value were sent; sending the highest value, on the average, yielded an upper bound lower by 0.041 than sending a random value. More general experiments, for both real and synthetic data, are reported in Section 3.5.
We now address the general scenario. Let us start with a few definitions: . . , X t } be a set of local vectors held in t nodes {N 1 , . . . , N t }. Then, X = 1 t ∑ t i=1 X i is the aggregate vector, which in our case is the mean over all local vectors. 1]. We define the Entropy activation function h(x) by: Then, H(X) denotes the Shannon's Entropy of X [4], given by We will henceforth assume all vectors are of length n and behave like X in Definition 1, even if it is not explicitly noted. We also assume each value of X can be represented by at most b bits. Notation 3. Let X Local , X Other denote a probability vector held by a local machine and a probability vector held by a remote machine, respectively.
In this section, we present algorithms for deciding whether the entropy of an average probability vector that sums to 1 is greater or lesser than a user-defined threshold. Formally, we will address two problems:

1.
Determining whether the inequality H X ≤ L holds for some user-defined constant L.

2.
Determining whether the inequality H X ≥ U holds for some user-defined constant U.
We begin with a lemma which provides the foundation for both the Local Upper Bound (Section 3.2) and Local Lower Bound (Section 3.3) in the following subsections.
While noting that the lemma and its corollary hold for any vector X ∈ R n , our vectors are always frequency vectors and hence sum to 1; the ∆ below corresponds to the "slack" added after dividing the respective value by 2, as explained in the discussion of Example 1 above; hence, the values still sum to 1.

Lemma 1 (Extrema of Entropy)
. Let X = (x 1 , . . . , x n ) ∈ R n s.t. ∀i, x i ≥ 0. Let ∆ be a positive number, and let i, j be two distinct coordinates of X.
Using the observation that h (x) = − ln x − 1, which is strictly decreasing, we divide the proof into two cases depending on the relation between x i + ∆ and x j : 1. x By applying the Lagrange Mean Value Theorem, for some Since h () is decreasing and c 1 < c 2 , we immediately obtain that h (c 1 ) > h (c 2 ). It follows that: , as in the case above, is: Given a probability vector X, and ∆ > 0, the following properties hold: 1.
If ∆ is added to any value of X, the maximal increase of its entropy will occur when ∆ is added to the minimal value of X.

2.
If ∆ is added to any value of X, the minimal increase of its entropy will occur when ∆ is added to the maximal value of X.

Upper Bound
While ln(n) is a trivial upper bound to the entropy, and does not require any communication to agree upon, we can develop a more efficient alternative while incurring a small communication overhead. Let X, S k (X) denote a probability vector and a k-sized ordered subset of X's k largest values, respectively. Hence, let local nodes broadcast the following two ordered sets:

1.
S k (X) = ordered set of largest k values of X; 2.
C k (X) = the coordinates of the values in S k (X), or formally {i | x i ∈ S k (X)}.
Each of these messages costs at most k(b + log 2 n) bits: b for each value and log 2 n for each corresponding coordinate. By sending these subsets of values and coordinates, local machines can immediately obtain the following information regarding the local vector X from which S k (X), C k (X) were sent: • The sum of all values not in S k (X), i.e., 1 − ∑ x∈S k (X) x, will be referred to as the mass of the local vector that remains available to be distributed among coordinates. It will be denoted by m in the following algorithms.
We next suggest an algorithm for a local machine with local probability vector X Local to compute the strict upper bound forX, which is the aggregated data of both X Local and X Other , which is a probability vector that is not accessible to the machine. The remote machine broadcasts S k (X Other ) and C k (X Other ) for some predetermined k.
The algorithm constructs the unknown subset of the remote machine that ensures the centralized entropy is maximized, or formally argmax X H(X Local + X), while maintaining feasible constraints. We view this problem as an instance of constrained optimization, where our target function is the global entropy, and the constraints are given by the broadcast set of S K (X) and its sum. The main tool is Corollary 1 for every coordinate of X Local .
Before we present the algorithm, we note two extreme cases, which instantly produce an upper bound without the need to algorithmically compute it:

1.
∑ x∈S k (X) x ≈ 1. In this case, most (or all) the information of X is broadcast by the message, and the entropy can be computed accurately without need for a bound.
In this case, there is no need to run the proposed algorithm; the constraint maximization will always result is an "optimal" target-the uniform vector with the value x max + 1 n m − ∑ x i ∈X Local x max − x i , whose entropy we know is maximal w.r.t its sum.
Theorem 1. Algorithm 1 runs in O(n 2 ) time and returns an upper bound on the entropy ofX = 1 2 (X Local + X Other ).

Algorithm 1: Upper Entropy Bound for Two Nodes
Input: A local vector X Local , a k-sized largest value ordered set S k (X Other ) and a corresponding ordered coordinate set C k (X Other ) Output: An upper bound for H(X) Proof. Let n be the length of X * , which equals n − k. In each loop iteration, the algorithm increments no more than n values of X * , and since there are n coordinates of X * , it will perform at most n steps. Hence, the bound of O(n 2 ) runtime follows. Let X * = (x 1 , . . . , x n ) be the initial vector as noted in line 3, and let Y denote the same by the end of the while loop, i.e., after the condition m = 0 is met. Let the coordinates of Y be arranged in ascending order, which has no effect on its entropy: Y = (y 1 , y 2 , . . . , y t , y t+1 , . . . , y n ). Since at every loop iteration, all minimal coordinates are incremented simultaneously, there exists some coordinate t such that for all i < t, y i equals c, and for all i > t, y i is strictly greater than c. Hence, we can view Y as a concatenation of the two vectors (Y L , Y R ) as defined below: Let s(X) denote the sum of X. It now suffices to show that any vector Z that sums to s(X * ) + m and can be achieved by performing only additions to X * has a lesser or equal entropy value than Y. Let Z denote such a vector for every value of which z i satisfies z i ≥ x i . Let Z = (Z L , Z R ), where Z L = (z 1 , . . . , z t ), Z R = (z t+1 , . . . , z n ) for the same t as defined above.
Since it holds that s(Z) = s(Z L ) + s(Z R ) = s(Y L ) + s(Y R ) = s(Y), we examine the following cases: • s(Z L ) = s(Y L ), s(Z R ) = s(Y R ): Note that Z R = Y R , since their sum is equal, and Y R has had no further additions. In addition, since s(Z L ) = s(Y L ) = c · |Y L | and Y L is the uniform vector, there exists a subset z i 1 , . . . , z i ⊆ Z R for which every z i j is greater than the corresponding value y i j of Y R . Let δ i j = z i j − y i j . For every δ i j , there exists a value z in Z L s.t z < y i j < z i j , since s(Z L ) < s(Y L ) = c · |Y L | < y i j · |Y L |. Let Z be Z after δ i j is subtracted from each z i j and added to some z ∈ Z L as described above. By Lemma 1, H(Z ) > H(Z). It also holds that Z R = Y R , and that H(Z L ) ≤ H(Y L ), since they both sum to c · |Y L |, and Y L is a uniform vector (whose entropy is maximal). Therefore, : this case can be immediately omitted; it is an impossibility to feasibly subtract a value from Z R or increase the value of s(Z L ) above s(Y L ).
Therefore, we have proven for any vector Z, it holds that H(Z) ≤ H(Y).
Next, we suggest a more time-efficient algorithm than Algorithm 1 that achieves an equivalent bound, with a runtime of O(n log n). Suppose c is the maximal threshold all values of the local vectors can be incremented to without exceeding the sum of the values from the remote vector, m. Then, if we define the sorted coordinates of X * to be x 1 , . . . , x n , there is some coordinate t such that x t ≤ c ≤ x t+1 .
By performing a binary search on the the coordinate t of the local vector as described above, we can efficiently find that x t as described in the algorithm below.
Theorem 2. Algorithm 2 runs in O(n log n) time and returns an upper bound for the entropy of X = 1 2 (X Local + X Other ).

Algorithm 2: Binary Search Upper Entropy Bound for Two Nodes
Input: A local vector X Local , a k-sized largest value set S k (X Other ) and a corresponding ordered coordinate set C k (X Other ) Output: An upper bound for H(X)  Proof. The algorithm begins by sorting X * , which costs O(n log n) and follows by performing a binary search on a range of size n , wherein a single step requires O(n) operations. Therefore, its runtime is O(n log n).
Since the vector X * constructed by this algorithm is equivalent to the vector which Algorithm 1 computes, the proof of correctness is the same as the proof of Theorem 1.
It will be noted that the upper bound given by Algorithms 1 and 2 can be further improved by using the feasibility constraint upon X * . It is possible to increase a coordinate of X * by a value larger than min{S K (X Other )}, particularly if for some coordinate i, the inequality x * i+1 − x * i > min{S K (X Other )} holds. In order to keep the core algorithms simple, we will address this formally in Appendix A by proposing an improvement to the algorithms above, such that the bound will indeed by tight.

Lower Bound
We now turn to discuss a communication-efficient solution for computing a tight lower bound for the entropy of a global vector. As with the upper bound, this problem is an instance of constrained optimization, only that here, our target is to find the minimum. As with the Upper Bound (Section 3.2), we use the same message containing C k (X) and S k (X), for a remote vector X. Property 1. Let n denote |X * |, sup denote min{S k (X Other )} and m denote 1 − ∑ x∈S k (X Other ) x, as used in Algorithm 3. Then, n · sup ≥ m.

Algorithm 3: Lower Entropy Bound for Two Nodes
Input: A local vector X Local , a k-sized largest value ordered set S k (X Other ) and a corresponding ordered coordinate set C k (X Other ) Output: A lower bound for H(X) Proof. Using the definitions, we obtain the following inequality: x which holds, since: Theorem 3. Algorithm 3 runs in O(n log n) time and returns a tight, lower bound for the entropy ofX = 1 2 (X Local + X Other ) Proof. After sorting the vector, we iteratively increment no more than n = |X * | ≤ n coordinates since n · sup ≥ m by Property 1; hence, the total runtime is O(n log n).
To prove the correctness of the bound, it suffices to examine our loop step; it is clear we must add a total sum of m to any of X * 's coordinates, and we cannot increment a single coordinate by more than sup-since we know all remaining values of the unknown vector X are lesser or equal to sup.
The algorithm increments the maximal values of X * by sup, which by Corollary 1 incurs the minimal entropy gain to X * . Due to the fact that the entropy is coordinate-wise additive, the "greedy" approach which minimizes over coordinates separately reaches the global minimum.
In Figure 1, the proposed algorithm, and the bounds computed using the algorithms described herein, are compared to the bounds derived after sending a random subset of coordinate-value pairs, as well as sending many random subsets and choosing the minimal resulting bound. In Section 3.5, more extensive experiments are reported.

Multiparty Bounds
When considering the scale and variability of modern distributed systems, an algorithm that supports multiple machines and incurs a low communication overhead is desirable.
We next suggest a few modifications in order to generalize Algorithms 1-3, for the upper and lower bounds of entropy centralized across t + 1 nodes. We denote X i as the vector of machine i, and in a manner similar to Section 3.2, S k (X i ), C k (X i ) are the ordered sets of the k maximum values and their coordinates, respectively. Typically, the coordinates of C k (X i ) and C k (X j ) will be disjoint, in which case each machine will have to broadcast its missing coordinates. The additional communication may cost us up to tk(b + log 2 n) bits. We hereby assume a second round of communication occurs, and that (S k (X 1 ), . . . , S K (X t )), as well as (C k (X 1 ), . . . , C K (X t )) include the same coordinates.
Below, we list the modifications to be made to the previous algorithms for the multiparty case. These changes are similar for all three algorithms. i Input: In addition to the local vector X Local , the k-sized largest value sets S k (X 1 ), . . . S k (X t ) and corresponding ordered coordinate sets C k (X 1 ), . . . , C k (X t )instead of single sets.
ii The sum to be added to all coordinates of X * , m will be t − ∑ t i=1 ∑ x∈S k (X i ) x , since there are t local vectors to process and construct, while local vector X i contributes 1 − ∑ x∈X i x.
iii The return value is now H( 1 t+1X ), since we have summed t additional vectors into X * and X Known .

Experimental Results
To evaluate our algorithms, we tested them on both real and synthetic probability vectors. We now describe the methods and data used to perform our experiments and analyze the results.
In Figures 2 and 3, we simulated the upper bound algorithm (Algorithm 2) and the lower bound algorithm (Algorithm 3). Figure 2b depicts a simulation of the algorithms on two randomly generated vectors: Node 1 with uniform distribution and Node 2 with beta distribution. The probability vectors of Node 1 and Node 2 are shown in Figure 2a. Note that as depicted in Figure 2b, the algorithms' results in each node are determined by the distribution of the local probability vectors. That is because the more probability mass is transmitted, the tighter the bounds become, and the quicker it converges with respect to k. As illustrated, in the bounds of Node 1 , which receives Node 2 's maximal beta distribution's probability values, the bounds converge quickly with respect to k. In contrast, the bounds that are computed at Node 2 , which receives the maximal values of the uniform distribution of Node 1 , converge slowly to the real entropy. This is due to the fact Node 2 does not gain much information from Node 1 . Fortunately, the difference between the bounds of Node 1 and Node 2 is an advantage to our proposed algorithms; we can compare them and use the better one simply by comparing the bounds (which requires transmitting only one scalar).

Results on Synthetic Vectors
Another interesting observation can be drawn from Figure 2b; Node 1 's lower bound is already quite close to the global entropy for very small k values. The algorithm works well here since the maximum value of Node 2 is not large, which in turn enables the algorithm to reach a tighter bound.  Figure 3b illustrates our experiments on the 20 Newsgroups Dataset [13] which includes about 20,000 newsgroup documents for 20 different topics. We measured the entropy of token frequency vectors (A vector where each value corresponds to the frequency of a word or token in the document). from the atheism-themed newsgroups and the hockeythemed newsgroups. To do so, we took the top 10,000 occurring tokens and created token frequency vectors on the first 200 articles from the atheism theme and the hockey theme. The visual illustration of the (sorted) tokens frequency is in Figure 3a. As can be observed, the atheism newsgroups is more verbally rich than the hockey newsgroup, having more words which are unique to it.
As demonstrated in Figure 3b, the upper bound computed by both nodes is almost the same. However, for the lower bound, Node 1 (atheism) converges faster to the real entropy as we increase the parameter k of the algorithm. We attribute that to the denser token histogram of the hockey theme; thus, more "probability mass" is transmitted for the same k. Figure 4 presents results for the multiparty case, as discussed in Section 3.4.  To conclude, the distribution of the probability vectors directly affects the tightness of the bounds. The less concentrated the probability vectors are, the less information we can send for every k; hence, the bounds become less tight, as demonstrated in Figure 5. A solution for this case is presented in Section 4.

Entropy Approximation
The algorithms described in Section 3 perform better in terms of communication overhead when there are a few relatively large values in the local frequency vectors, i.e., a substantial percentage of the overall "probability mass" resides in a relatively small percentage of the vectors' values. However, in the case in which the vectors are "flat"-that is, their distribution approaches a uniform one-the nodes will have to exchange many values in order to reach tight bounds on the overall entropy; see Figure 5. In this section, we offer a probabilistic solution to this problem.
Assume that two nodes N 1 , N 2 hold vectors X, Y, and the goal is to approximate the entropy of the average vector X+Y 2 , with a small communication overhead, relative to n, the length of the vectors.
One solution, which was applied in previous work on monitoring entropy [10], is to use sketches. This popular technique found many applications in computer science, for example, for computations over distributed data [7]. A well-known sketch for entropy, which we describe in Section 4.1, is presented in [14]; see also [15].
Here we use a different sketch, which, for our purposes, performed better than the sketch presented in [14]. In resemblance to Section 3, the two nodes first exchange all values which are greater or equal to a threshold ε, whose value is determined by a communication/accuracy trade-off. Hence, we assume hereafter that all values are smaller than ε. Next, choose a polynomial approximation, of degree at least 2, over the interval [0, ε], to the function h(t) −t ln(t). Assuming in the meanwhile a degree 2 approximation, denote it by At 2 + Bt + C. The proposed method is oblivious to the choice of this approximation; we have used the approach of minimizing which allows a closed-form solution Using this quadratic function, we can approximate the entropy of the average vector X+Y Note that with the exception of the term ∑ n i=1 X i Y i , all terms can be computed locally and require O(1) communication overhead to transmit. Thus, it only remains to approximate the term ∑ i X i Y i , which equals the inner product X, Y . To this end, we can apply an approximation based on the famed Johnson-Lindenstrauss Lemma [16], which is defined as follows: where R i are independent random vectors with all values i.i.d standard normal variables, which are generated by a pre-agreed upon random seed, and thus require no communication. A direct calculation yields that this estimate has expectation X, Y (i.e., is unbiased) and its variance equals Similarly, we can apply higher-order approximations. For a cubic approximation, we obtain a more complicated but identical in spirit sketch, which requires an approximation of the expressions ∑ n i=1 X 2 i Y i , ∑ n i=1 X i Y 2 i ; this, too, can be achieved by applying the estimate above, since these quantities can also be represented as inner products of "local vectors": Some results for two nodes are presented in Figure 6, in which the proposed sketch is compared to the one in [14] (see Section 4.1). Extending the sketch to the multiparty scenario is straightforward; results are presented in Figure 7b.

The Clifford-Cosma Sketch
We compare our sketch to an entropy sketch proposed in [14]. The sketch is a linear projection of the probability vector. The linear projection is performed by a multiplication matrix with i.i.d elements drawn from F(x; 1, −1, π/2, 0). The entropy approximation of the d-dimensional linear projected vector (y 1 , . . . , y d ) is: H(y 1 , . . . , y d ) = ln(d) − ln d ∑ i=1 e y i .

Sketch Evaluation
We now compare the proposed sketch to the one in [14], which is denoted "CC". The proposed quadratic sketch is denoted "Poly2".

Conclusions
We have presented novel communication-efficient algorithms for bounding and approximating the entropy in a distributed setting. The algorithms were tested on real and synthetic data, yielding a substantial reduction in communication overhead. Future work will address both sketch-based techniques and further development of the dynamic bound algorithms presented here. In addition, we intend to address the efficient distributed computation of other functions. Funding: This research received no external funding.

Data Availability Statement:
The 20 Newsgroups Dataset, which can be found in this link.

Conflicts of Interest:
The authors declare no conflict of interest.