Guessing with a Bit of Help

What is the value of just a few bits to a guesser? We study this problem in a setup where Alice wishes to guess an independent and identically distributed (i.i.d.) random vector and can procure a fixed number of k information bits from Bob, who has observed this vector through a memoryless channel. We are interested in the guessing ratio, which we define as the ratio of Alice’s guessing-moments with and without observing Bob’s bits. For the case of a uniform binary vector observed through a binary symmetric channel, we provide two upper bounds on the guessing ratio by analyzing the performance of the dictator (for general k≥1) and majority functions (for k=1). We further provide a lower bound via maximum entropy (for general k≥1) and a lower bound based on Fourier-analytic/hypercontractivity arguments (for k=1). We then extend our maximum entropy argument to give a lower bound on the guessing ratio for a general channel with a binary uniform input that is expressed using the strong data-processing inequality constant of the reverse channel. We compute this bound for the binary erasure channel and conjecture that greedy dictator functions achieve the optimal guessing ratio.


I. INTRODUCTION
In the classical problem of guessing, Alice wishes to learn the value of a discrete random variable (r.v.) X as quickly as possible, by sequentially asking yes/no questions of the form "is X = x?", until she gets it right.
Alice's guessing strategy, which is the ordering of the alphabet of X according to which she states her guesses, induces a random guessing time.It is well known and simple to check that the optimal guessing strategy, which simultaneously minimizes all the positive moments of the guessing time, is to guess according to decreasing order of probability.Formally then, for any s > 0, the minimal sth-order guessing-time moment of X is where ORD X (x) returns the index of the symbol x relative to a the order induced by sorting the probabilities in a descending order, with ties broken arbitrarily.For brevity, we refer to G s (X) as the guessing-moment of X.
The guessing problem was first introduced and studied in an information-theoretic framework by Massey [1], who drew a relation between the average guessing time of an r.v. and its entropy, and was later explored more systematically by Arikan [2].Several motivating problems for studying guesswork are fairness in betting games, computational complexity of lossy source coding and database search algorithms (see the introduction of [3] for a discussion), secrecy systems [4], [5], [6], crypt-analysis (password cracking) [7], [8], and computational complexity The authors are with the Department of EE-Systems, Tel Aviv University, Tel Aviv, Israel.Emails: {nir.wein@gmail.com,ofersha@eng.tau.ac.il}.This work was supported by an ERC grant no.639573. of sequential decoding [2].In [2], Arikan introduced the problem of guessing with side information, where Alice is in possession of another r.v.Y that is jointly distributed with X.In that case, the optimal conditional guessing strategy is to guess by decreasing order of conditional probabilities, and hence the associated minimal conditional sth-order guessing-time moment of X given Y is where ORD X|Y (x | y) returns the index of x relative to a the order induced by sorting the conditional probabilities of X, given that Y = y, in a descending order.Arikan showed that, as intuition suggests, side information reduces the guessing-moments [2, Corollary 1] Furthermore, he showed that if {(X i , Y i )} n i=1 is an i.i.d.sequence, then [2, Prop.5] where H α (X | Y ) is the Arimoto-Rényi conditional entropy of order α.The information-theoretic analysis of the guessing problem was further extended in multiple directions, such as allowing distortion in the guess [3], guessing under source uncertainty [9], improved bounds at finite blocklength [10] and an information-spectrum analysis [11], to name a few.
In the conditional setting described above, one may think of Y n as side information observed by a "helper", say Bob, who then sends his observations to Alice.In this case, as in the problem of source coding with a helper [12], [13], it is more realistic to impose some communication constraints, and assume that Bob can only send a compressed description of Y n to Alice.This question was recently addressed by Graczyk and Lapidoth [14], [15], who considered the case where Bob encodes Y n at a positive rate using nR bits, before sending this description to Alice.They then characterized the best possible guessing-moments attained by Alice for general distributions, as a function of the rate R. In this paper, we take this setting to its extreme, and attempt to quantify the value of a single bit in terms of reducing the guessing-moments, by allowing Bob to use only a one-bit description of Y n .
To that end, we define (in Section II) and study the guessing efficiency, which is the (asymptotically) best possible multiplicative reduction in the guessing-moments of X n offered by observing a Boolean function f (Y n ), i.e., the ) as a function of s, in the limit of large n.
Characterizing the guessing efficiency appears to be a difficult problem in general.Here we mostly focus on the special case where X n is uniformly distributed over the Boolean cube {0, 1} n , and Y n is obtained by passing X n through a memoryless binary symmetric channel (BSC) with crossover probability δ.We derive two upper bounds and two lower bounds on the guessing efficiency in this case.The upper bounds, presented in Section III, are derived by analyzing the efficiency attained by two specific functions, Dictator and Majority.We show that neither of these functions is better than the other for all values of the moment order s.The two lower bounds, presented in Section IV, are based on relating the guessing-moment to entropy using maximum-entropy arguments (generalizing [1]), and on Fourier-analytic techniques together with a hypercontractivity argument [16].Several graphs illustrating the bounds are given in Section V.In Section VI we briefly discuss the more general case where X n is still uniform over the Boolean cube, but Y n is obtained from X n via a general binary-input, arbitrary-output channel.We generalize our entropy lower bound to this case using the strong data-processing inequality (SDPI) applied to the reverse channel (from Y to X).We then discuss the case of the binary erasure channel (BEC), for which we also provide an upper bound by analyzing the Greedy Dictator function, namely where Bob sends the first bit that has not been erased.We conjecture that this function minimizes the guessing efficiency simultaneously at all erasure parameters and all moments s.
Related Work.Graczyk and Lapidoth [14], [15] considered the same guessing question in the case where Bob can communicate with Alice at some positive rate R, i.e., can use nR bits to describe Y n .This setup facilitates the use of large-deviation-based information-theoretic techniques, which allowed the authors to characterized the optimal reduction in the guessing-moments as a function of R. We note that this type of random-coding arguments cannot be applied in our extermal one-bit setup.Characterizing the guessing efficiency in the case of the BSC with a uniform input can also be thought of as a guessing variant of the most informative Boolean function problem introduced by Kumar and Courtade [17], who have asked about the maximal reduction in the entropy of X n obtainable by observing a Boolean function f (Y n ).They have conjectured that a Dictator function, e.g.f (y n ) = y 1 is optimal simultaneously at all noise levels, see [18], [19], [20], [21] for some recent progress.We note that as in the guessing case, allowing Bob to describe Y n using nR bits renders the problem amenable to an exact information-theoretic characterization [22].In another related work [23], we have asked about the Boolean function Y n that maximizes the reduction in the sequential mean-squared prediction error of X n , and have shown that the Majority function is optimal in the noiseless case, yet that there is no single function that is simultaneously optimal at all noise levels.
Finally, in a recent work [24] the average guessing time using the help of a noisy version of f (X n ), has been considered.By contrast, in this paper the noise is applied to the inputs of the function, rather than to its output.

II. PROBLEM STATEMENT
Let X n be an i.i.d.vector from distribution P X , who is transmitted over a memoryless channel of conditional distribution P Y |X .Bob observes Y n at the output of the channel, and can send one bit f : {0, 1} n → {0, 1} to Alice, who in turn needs to guess X n .Our goal is to characterize the best possible multiplicative reduction in guessing-moments offered by a function f , in the limit of large n.Precisely, we wish to characterize the guessing efficiency, defined as In this paper we are mostly interested in the case where P X = (1/2, 1/2), i.e., X n is uniformly distributed over {0, 1} n , and where the channel is a BSC with crossover probability δ ∈ [0, 1/2].With a slight abuse of notation, we denote the guessing efficiency in this case by γ s (δ).Before we proceed, we note the following simple facts.
Proposition 1.The following claims hold: 1) For γ s (δ) the limit-supremum (5) is a regular limit, achieved by a sequence of deterministic functions.
Proof: See Appendix A.

III. UPPER BOUNDS ON γ s (δ)
In this section we derive two upper bounds on the BSC guessing efficiency γ s (δ), by analyzing two simple functions -the Dictator function and the Majority function.Let a, b ∈ N, a ≤ b be given.The following sum will be useful for the derivations in the rest of the paper: where we will abbreviate K s (b n=1 be non-decreasing integer sequences such that a n < b n for all n and lim n→∞ (a n + 1)/b n = 0.Then, Proof: See Appendix A.
Theorem 3. We have and guessing efficiency equal to the right-hand side (r.h.s.) can be achieved by a Dictator function.
Proof: Assume without loss of generality that f (y n ) = y 1 .As 0 < δ < 1/2 it is easily verified that given y 1 , the optimal guessing strategy is to first guess one of the 2 n−1 vectors for which x 1 = y 1 (in an arbitrary order), and then guess one of the remaining 2 n−1 vectors (again, in an arbitrary order).From symmetry, and Lemma 2 The result then follows from ( 5) and Lemma 2.
We next consider the guessing efficiency of the Majority function.
and Z ∼ N (0, 1).Then, where Q(•) is the tail distribution function of the standard normal distribution, and guessing efficiency equal to the r.h.s. of ( 13) can be achieved by the Majority function.
Proof: We assume for simplicity that n is odd.The analysis for an even n is not fundamentally different.In , where ½(•) is the indicator function.To evaluate the guessingmoment, we first need to find the optimal guessing strategy.To this end, we let W H (x n ) be the Hamming weight of x n and note that the posterior probability is given by where ( 16) follows from symmetry.Evidently, r n (w) is an increasing function of w ∈ {0, 1, . . ., n}.Indeed, Bin(n, δ) be a binomial r.v. of n trials and success probability δ.Then, for any w ≤ n − 1, as δ ≤ 1/2, where in each of the above probabilities, the summations is over independent binomial r.v's.Hence, we deduce that whenever Maj(Y n ) = 1 (resp.Maj(Y n ) = 0) the optimal guessing strategy is by decreasing (resp.increasing) Hamming weight (with arbitrary order for inputs of equal Hamming weight).
We can now turn to evaluate the guessing-moment for the optimal strategy given Majority.Let M −1 = 0 and Thus, where W, W ′ ∼ Bin(n, 1/2) and independent.We next evaluate this expression using the central-limit theorem.To evaluate this expression asymptotically, we note that the Berry-Esseen theorem [25, Chapter XVI.
for some universal constant a δ .Using the Berry-Esseen central-limit theorem again, we have that where Z ∼ N (0, 1).Thus for a given w, where the last equality follows from the fact that |Q ′ (t)|≤ Similarly to (25), the upper bound holds, and a similar analysis leads to an expression which asymptotically coincides with the r.h.s. of (32).The result then follows from (5) and Lemma 2.
We remark that the guessing efficiency of functions similar to Dictator and Majority, such as Dictator on k > 1 for some t) may also be analyzed in a similar way.However, numerical computations indicate that they do not improve the bounds obtained in Theorems 3 and 4, and thus their analysis is omitted.

IV. LOWER BOUNDS ON γ s (δ)
We derive two lower bounds on the BSC guessing efficiency γ s (δ), one based on maximum-entropy arguments, and the other based on Fourier-analytic arguments.

A. A Maximum-Entropy Bound
Theorem 5. We have Proof: With a standard abuse of notation, let us write the guessing-moment and the entropy as a function of the distribution.Consider the following maximum entropy problem [26, Ch. 12] max P :Gs(P )=g where it should be noted that the support of P is only restricted to be countable.Assuming momentarily that the entropy is measured in nats, it is easily verified (using the theory of exponential families [27,Ch. 3] or by standard Lagrangian duality [28,Ch. 5]), that the entropy maximizing distribution is for i ∈ N + where Z(λ) := ∞ i=1 exp(−λi s ) is the partition function, and λ > 0 is chosen such that G s (P λ ) = g.Thus, the resulting maximum entropy is given in a parametric form as Evidently, if g = G s (P λ ) → ∞ then λ → 0. In this case, we may approximate the partition function for λ → 0 by a Riemann integral.Specifically, by the monotonicity of e −λi s in i ∈ N, where we have used the definition of the Gamma function Γ(z) := ∞ 0 t z−1 e −t dt in the last equality. 1Further, by the convexity of e −λt s in t ∈ R, Therefore where a λ → 0 as λ → 0. In the same spirit, where in (46), b λ → 0 as λ → 0, in (47) we have used the identity Γ(t + 1) = tΓ(t) for t ∈ R + , and in (48), Returning to measure entropy in bits, we thus obtain that for any distribution P and so where 1) is a vanishing term as G s (P ) → ∞.In the same spirit, (50) holds whenever We return to the Boolean helper problem.Using (50) once for the guessing-moment conditioning on f (Y n ) = 0, and once on f (Y n ) = 1 we get (see a detailed justification to (51) in Appendix A) where k n ∼ = 1 in (51), and (52) follows from Jensen's inequality.The bound (53) is directly related to the Boolean function conjecture [17], and may be proved in several ways, e.g., using Mrs. Gerber's Lemma [30, Th. 1], see [31, Section IV] [22], [18].
Remark 6.In [1] the maximum-entropy problem was studied for s = 1.In this case, the maximum-entropy distribution is readily identified as the geometric distribution.The proof above generalizes that result to any s > 0.
Remark 7. In [18], the bound used in the proof of Theorem 5 above (see (53)) was improved for balanced functions, assuming 1/2(1 Using it here leads to an immediate improvement in the bound of Theorem 5. Furthermore, it is known [19] that there exists δ 0 such that the most informative Boolean function conjecture holds for all δ 0 ≤ δ ≤ 1/2.For such crossover probabilities, holds, and then Theorem 5 may be improved to

B. A Fourier-Analytic Bound
The second bound is based on Fourier analysis of Boolean functions [16], and so we briefly remind the reader of the basic definitions and results.To that end, it is convenient to assume that the binary alphabet is {−1, 1} instead of {0, 1}.An inner product between two real-valued functions on the Boolean cube f, g : where X n ∈ {−1, 1} n is a uniform Bernoulli vector.A character associated with a set of coordinates S ⊆ [n] := {1, 2, . . ., n} is the Boolean function x S := i∈S x i , where by convention x ∅ = 1.It can be shown [16,Chapter 1] that the set of all characters forms an orthonormal basis with respect to the inner product (56).Furthermore, where { fS } S⊆[n] are the Fourier coefficients of f , given by fS Letting the correlation parameter be defined as ρ := 1 − 2δ, the noise operator is defined to be The noise operator has a smoothing effect on the function, which is captured by the so-called hypercontractivity theorems.Specifically, we shall use the following version.
Our Fourier-based bound is as follows: Theorem 9. We have This bound can be weakened by the possibly sub-optimal choice λ = 1, which leads to the simpler and explicit bound: Corollary 10.We have Proof of Theorem 9: From Bayes law (recall that and from the law of total expectation Letting fφ = Ef (X n ), and defining g = f − fφ , the first addend on the r.h.s. of (62) is given by where in the last equality, ℓ n ∼ = 1 (Lemma 2).Let λ ∈ [0, 1], and denote ρ 1 := ρ λ and ρ 2 = ρ 1−λ .Then, the inner-product term in (66) may be upper bounded as where (67) holds since T ρ is a self-adjoint operator, (68) follows from the Cauchy-Schwarz inequality.For (69) note that where (74) follows from Plancharel's identity, (75) is since ĝS = fS for all S = φ, and ĝφ = 0, and (76) follows 70) follows from Theorem 8, and in (72), k n ∼ = 1.The second addend on the r.h.s. of (62) can be bounded in the same manner.Hence, as n → ∞.

V. GRAPHS
In Fig. 1 (resp.Fig. 2) the bounds on γ s (δ) are plotted for fixed values of s (resp.δ).As for upper bounds, it can be found that when s 3.5 Dictator dominates Majority (for all values of δ), whereas for s 4.25 Majority dominates Dictator.For 3.5 s 4.25 there exists δ ′ s such that Majority is better for δ ∈ (0, δ ′ s ) and Dictator is better for δ ∈ (δ ′ s , 1/2).Fig. 2 demonstrates the switch from Dictator to Majority as s increases (depending on δ).As for lower bounds, first note that the conjectured maximum-entropy bound (55) was also plotted (see Remark 7).It can be observed that the maximum-entropy bound is better for low values of δ, whereas the Fourier analysis bound is better for high values.As a function of s, it seems that the maximum-entropy bound (resp.Fourieranalysis bound) is better for high (resp.low) values of s.Finally, we mention that the maximizing parameter in the Fourier-based bound (Theorem 9) is λ = 1, and the resulting bound is as in (60).For values of s as low as 10, the maximizing λ may be far from 1, and in fact it continuously and monotonically increases from 0 to 1 as δ increases from 0 to 1/2.

VI. GUESSING EFFICIENCY FOR A GENERAL BINARY INPUT CHANNEL
In this section, we consider the guessing efficiency for general channels with a uniform binary input.The lower bound of Theorem 5 can be easily generalized for this case.To that end, consider the SDPI constant [32], [33] of the reverse channel (P Y , P X|Y ), given by η(P Y , P X|Y ) := sup where Q X is the X-marginal of Q Y • P X|Y .As was shown in [34,Th. 2], the SDPI constant of (P Y , P X|Y ) is also given by Theorem 11.We have Proof: The proof follows the same lines of the proof of Theorem 5, up to (52), yielding where (86) follows since the SDPI constant tensorizes (see [34] for an argument obtained by relating the SDPI constant to the hypercontractivity parameter, or [35, p. 5] for a direct proof).Thus, for all f ≤ η(P Y , P X|Y ). (89) Inserting (89) to (83).Therefore and using the definition of the guessing efficiency (5) completes the proof.
Remark 12.It is evident from (88) that if the helper is allowed to send k bits, then the associated k-bit guessing efficiency is lower bounded by Remark 13.The bound for the BSC case (Theorem 5) is indeed a special case of Theorem 11 as the reverse BSC channel is also a BSC with uniform input and the same crossover probability.For BSCs, it is well known that the SDPI constant is (1 − 2δ) 2 [32, Th. 9].
Next, we consider in more detail the case where the observation channel is a BEC.

A. Binary Erasure Channel
Suppose that Y n ∈ {0, 1, e} n is obtained from X n by erasing each bit independently with probability ǫ ∈ [0, 1].
As before, Bob observes the channel output Y n and can send one bit f : {0, 1, e} n → {0, 1} to Alice, who wishes to guess X n .With a slight abuse of notation, the guessing efficiency (5) will be denoted by γ s (ǫ).
To compute the lower bound of Theorem 11, we need to find the SDPI constant associated with the reverse channel, which is easily verified to be with an input distribution P Y = ( 1−ǫ 2 , ǫ, 1−ǫ 2 ).Letting Q Y (y) = q y for y ∈ {0, 1, e} yields Q X (x) = q x + qe 2 for x ∈ {0, 1}.The computation of η(P Y , P X|Y ) is now a simple three-dimensional constrained optimization problem.
We plotted the resulting lower bound for s = 1 in Fig. 3.
Let us now turn to upper bounds, and focus for simplicity on the average guessing time, i.e., the guessing-moment for s = 1.To begin, let S represent the set of indices of the symbols that were not erased, i.e., i ∈ S if and only if where f S : {0, 1} |S| → {0, 1} designates the operation of the function when S is the set of non-erased symbols.
We also let Pr(S) = (1 − ǫ) |S| • ǫ |S c | be the probability that the non-erased symbols have index set S. Then, the joint probability distribution is given by and, similarly, Interestingly, for any given f , the optimal guessing order given that f (Y n ) = 0 is reversed to the optimal guessing order when f (Y n ) = 1.Also apparent is that the posterior probability is determined by a mixture of . This may be contrasted with the BSC case, in which the posterior is determined by a single Boolean function (though with noisy input).
A seemingly natural choice is a Greedy Dictator function, for which f (Y n ) sends the first non-erased bit.
Concretely, letting min {i: the Greedy Dictator is defined by where Ber(α) is a Bernoulli r.v. of success probability α.From an analysis of the posterior probability (see Appendix A), it is evident that conditioned on f (Y n ) = 0, an optimal guessing order must satisfy that x n is guessed before which can be loosely thought of as comparing the "base 1/ǫ expansion" of x n and z n .Furthermore, when ǫ is close to 1, then the optimal guessing order tends toward a minimum Hamming weight rule (of maximum in case f = 1).
Proof: See Appendix A.
The guessing efficiency of the Greedy Dictator for s = 1 can be evaluated, and the analysis leads to the following upper bound: [24, Proposition 1]. 2 Hence, one approach toward identifying the optimal function could be to try and find a function whose induced posterior distributions majorize the corresponding posteriors induces by any other functions with the same bias (it is of course not clear that such a function even exists).This approach unfortunately fails for the Greedy Dictator.For example, the posterior distributions induced by setting f S to be Majority functions are not always majorized by those induces by the Greedy Dictator (although they seem to be "almost" majorized), e.g.
for n = 5 and ǫ = 0.4, even though the average guessing time of Greedy Dictator is lower.In fact, the guessing moments for Greedy Dictator seem to be better than these of Majority irrespective of the value of s.

ACKNOWLEDGMENTS
We are very grateful to Amos Lapidoth and Robert Graczyk for discussing their recent work on guessing with a helper [14], [15] during the second author's visit to ETH, which provided the impetus for this work.
APPENDIX A

PROOFS
Proof of Proposition 1 : The claim that random functions do not improve beyond deterministic ones follows directly from that property that conditioning reduces guessing-moment [2, Corollary 1].Monotonicity follows from the fact that Bob can always simulate a noisier channel.Now, if δ = 1/2 then X n and Y n are independent, and s+1 for any f (Lemma 2).For δ = 0, let Pr(f (Y n ) = 1) := q, and assume without loss of generality that q ≤ 1/2.Then, ≥ 2 −(n−1) with equality when q = 1/2.Thus, the minimal G s (X n | f (Y n )) is obtained by any balanced function, and equal to K s (2 n−1 ) ∼ = 2 s(n−1) s+1 (Lemma 2).To prove that the limit in (5) exists, first note that where Second, let and let {f * n } be a sequence of functions such that f * n achieves γ and Thus, Hence, To continue, we further explore ℓ n .Noting that we can start the summation in the numerator (114) from i = 2, and using (138) and (140) (proof of Lemma 2 below), we get Thus, there exists c, C > 0 such that and consequently, as n → ∞.Now, (123) implies that is a non-increasing sequence which is bounded below by 0, and thus has a limit.Since s (δ) also has a limit.
Proof of Lemma 2: Due to monotonicity of i s , standard bounds on sums using integrals lead to for all n ≥ max{n 0 (η), n ′′ 1 (η)} such that n ∈ N\N + .Proof of (100): Let us evaluate the posterior probability conditioned on G-Dict(Y n ) = 0. Since G-Dict is balanced, Bayes law implies that This immediately leads to the guessing rule in (100).As stated in the beginning of Section VI, the guessing rule for G-Dict(Y n ) = 1 is on reversed order.

Proof of Proposition 14:
We denote the lexicographic order by ORD lex .Assume that G-Dict(Y n ) = 0 and that ORD lex (x n ) ≤ ORD lex (z n ).Then, there exists j ∈ [n] such that x j−1 = z j−1 (where x 0 is the empty string) and x j = 0 < z j = 1.Then, This proves the first statement of the proposition.Now, let ORD 0 (ORD 1 ) be the guessing order given that the received bit is 0 (resp.1), and let f the Boolean function (which are not necessarily optimal).Then, from (97) and (95) Pr(S) x n Pr(S) Pr(S) x n min PORD 0 (x S ||S), PORD 1 (x S ||S) , where for b ∈ {0, 1}, the projected orders are defined as It is easy to verify that if ORD 0 (ORD 1 ) is the lexicographic (resp.revered lexicographic) order then the Greedy Dictator achieves (159) with equality, due to the following simple property: If ORD lex (x n ) < ORD lex (z n ) then ORD lex (x n ) ≤ ORD lex (x n ) = x S c ORD lex (x 1 , x n 2 ) (162) ORD lex (x n 2 ) (163) where the inequality follows from the induction assumption and since x 1 ≤ z 1 .If 1 / ∈ S then, similarly, ORD lex (x n ) = x S c \{1} 2 n−1 + 2 • ORD lex (x n 2 ) (166) Proof of Theorem 15: We denote the lexicographic order by ORD lex .Then, where J 1 := 1/2 and for n ≥ 2 x n 2 )] (176) So Hence, Noting that K 1 (M ) = M/2 + 1/2, we get

s
(δ).Denote the order induced by the posteriorPr(X n = x n | f * n (Y n ) = b) as ORD b,n,n , b ∈ {0, 1}, and the order induced by Pr(X n+1 = x n+1 | f * n (Y n ) = b) as ORD b,n,n+1.As before (when breaking ties arbitrarily) ORD lex (z n ) (161)for all S ∈ [n].This can be proved by induction over n.For n = 1 the claim is easily asserted.Suppose it holds for n − 1, and let us verify it for n.If 1 ∈ S then whenever ORD lex (x n ) < ORD lex (z n )x S c