On Lower Bounds for Statistical Learning Theory

In recent years, tools from information theory have played an increasingly prevalent role in statistical machine learning. In addition to developing efficient, computationally feasible algorithms for analyzing complex datasets, it is of theoretical importance to determine whether such algorithms are “optimal” in the sense that no other algorithm can lead to smaller statistical error. This paper provides a survey of various techniques used to derive information-theoretic lower bounds for estimation and learning. We focus on the settings of parameter and function estimation, community recovery, and online learning for multi-armed bandits. A common theme is that lower bounds are established by relating the statistical learning problem to a channel decoding problem, for which lower bounds may be derived involving information-theoretic quantities such as the mutual information, total variation distance, and Kullback–Leibler divergence. We close by discussing the use of information-theoretic quantities to measure independence in machine learning applications ranging from causality to medical imaging, and mention techniques for estimating these quantities efficiently in a data-driven manner.


Introduction
Statistical learning theory refers to the rigorous mathematical analysis of machine learning algorithms [1,2].On one hand, it is desirable to derive error bounds for the performance of particular machine learning algorithms under appropriate assumptions on the probabilistic models used to generate the data.On the other hand, it is important to understand the fundamental limitations of any algorithmic procedure, which may be influenced by quantities such as the sample size, signal-to-noise ratio, or smoothness of an ambient function space.Whereas statistical techniques based on concentration inequalities and empirical process theory may often be employed to derive rates of convergence of specific estimators to the underlying parameters of a data-generating distribution, the somewhat trickier problem of quantifying the best possible performance of any learning procedure requires tools from information theory.
A general approach is to relate the machine learning task at hand to an appropriate channel decoding problem, where the output corresponds to the observed data and the input corresponds to a cleverly constructed subset of the parameter space.For estimation problems, the key observation is that, if the underlying parameters may be estimated closely (i.e., on the level of discretization of the subset of parameter space), decoding may be performed accurately with high probability.The hardness of the decoding problem may in turn be quantified using techniques in information theory [3], leading to a lower bound on the estimation error.This strategy has been applied successfully to a diverse array of statistical estimation problems, including parametric and nonparametric regression, structure estimation for graphical models, covariance matrix estimation, and dimension reduction methods such as principal component analysis [4][5][6][7][8][9].Section 2 discusses the method and several illustrative examples in greater detail.
Although some classes of machine learning problems may not be analyzed directly using these methods, alternative approaches involving related information-theoretic concepts may be employed.In Sections 3 and 4, we consider the problems of community recovery and online learning, which are both active areas of research in machine learning.Our discussion of weak recovery in the community estimation setting is similar to the framework described in Section 2, but since the loss function used to quantify the estimation error incurred by the algorithm is more complicated, a more careful analysis must be conducted to derive sharp lower bounds.The theory characterizing the regimes in which exact recovery is possible are of a somewhat different flavor, but the emergence of sharp thresholds may again be related to Shannon coding theory.Section 4, concerning online learning for multi-armed bandits, provides a still different setting, where the goal is to bound a quantity known as regret.Although this is a radically different goal from bounding estimation error, the techniques used to obtain lower bounds for multi-armed bandits nonetheless include components of reductions to channel decoding problems: The key is to relate the performance of a learning algorithm to a problem of distinguishing between pairs of parameter assignments corresponding to underlying reward distributions that are close in parameter space.
We include proof sketches for the stated theorems in the main text of the paper, with references to resources where the reader can find more detailed proofs and additional background material.Although the discussion of each problem setting is necessarily brief, given the broad scope of this paper, we hope that our survey will convey the high-level ideas involved in applying information-theoretic tools to derive lower bounds for some statistical machine learning problems in a clear, concise manner.We have intentionally selected a diverse variety of problem settings in order to help the reader compare and contrast different approaches for obtaining lower bounds and identify the common threads underlying all the strategies.

Statistical Estimation
We begin by discussing an approach based on minimax theory for statistical estimation problems [10].Our goal is a lower bound on the following quantity, known as the minimax risk: where is a symmetric loss function.Here, P denotes a class of data-generating distributions and θ : P → Ω is a functional that maps each distribution in P to a parameter in the metric space Ω.
The expectation in expression ( 1) is taken with respect to data from a particular distribution P ∈ P, and the infimum is then taken over all possible estimators θ = θ(X) computed from the data.In other words, quantity (1) captures the worst-case risk of the best possible estimator.Whereas statistical analysis of a specific estimator can provide an upper bound on the minimax risk, tools from information theory may be used to derive a lower bound on the same quantity.Throughout this section, we will restrict our attention to the setting where = Φ • ρ, for a metric ρ and monotonically increasing function For instance, Example 2 below will discuss the setting where ρ is the L 2 -distance in a function space and Φ(t) = t 2 , so is the squared L 2 -distance.The basic idea is to transform an estimation problem into a decoding problem, in which we wish to infer the correct message from a discrete set of messages, corresponding to a collection of parameters.The estimation problem must be at least as hard as the decoding problem, since, if the parameters in the discrete set are appropriately separated, accurate parameter estimation implies accurate decoding.In Section 2.1, we present a general technique based on Fano's inequality, which expresses the probability of error for the decoding in terms of the mutual information between the input (parameters in the discrete subset) and output (observed data).Sections 2.2 and 2.3 then provide methods for bounding the mutual information and discuss applications to concrete statistical estimation settings.We will follow the convention of Cover and Thomas [3] and take all logarithms with respect to base 2 in our definitions of entropy and mutual information; analogous results hold when logarithms are taken with respect to base e.

Fano's Method
We begin by describing the general approach for deriving lower bounds.The key idea consists of relating the estimation problem to a decoding problem, and then using Fano's inequality to lower-bound the probability of error for the decoding problem.Recall the definition of the mutual information: The main result relates the minimax risk to the mutual information between observations and the data-generating distribution.
Proof (sketch).We begin by writing If we define the decision rule where we break ties arbitrarily, we may verify that for each 1 ≤ i ≤ M. Inequality (a) is a direct application of Markov's inequality, and inequality (b) follows from the fact that if ( θ, θ i ) < Φ(δ), or equivalently, ρ( θ, θ i ) < δ, then implying that ψ(X) = i.Now, recall the statement of Fano's inequality: Lemma 1 (Fano's inequality [3]).For any estimator Y of Y such that Y → X → Y forms a Markov chain, it holds that where Y is the range of Y.
Applying Lemma 1 with Y = ψ(X) and writing out the error probability explicitly, we obtain where the equality follows from relation (2) and the fact that Y has a uniform distribution.Combining inequalities (3) and (4) establishes the desired result.
In the following subsections, we describe two methods for upper-bounding the mutual information term I(Y; X) appearing in Theorem 1, yielding a lower bound on the minimax risk.

Local Packings
The first method applies the convexity of the Kullback-Leibler (KL) divergence to obtain an upper bound on I(Y; X) in terms of pairwise KL divergences.We have the following lemma: Lemma 2. Let X and Y be defined as in Theorem 1.Then, 1≤i,j≤M D KL (P i P j ).
Proof.We can check that where P = 1 M ∑ M j=1 P j is a mixture distribution.By the convexity of the KL divergence, we then have which is the desired expression.
This bounding technique is known as a "local packing", since the trick is to design an appropriate set {P 1 , . . ., P M } such that the parameters θ(P i ) are 2δ-separated, while the pairwise KL divergences between the data-generating distributions are relatively small.Example 1 (High-dimensional linear regression).Suppose we have observation pairs {(x i , y i )} n i=1 from a linear model: where x i ∈ R p and w i ∼ N(0, σ 2 ) is i.i.d.noise, and β * ∈ R p is the unknown parameter vector.We assume that p > n, but β * is known to have at most s nonzero values, where s ≤ n.More precisely, if B q (r) denotes the ball of radius r in the q norm, we are interested in characterizing the minimax risk over the parameter space For any fixed parameter δ > 0, it is possible to construct a subset of parameters {β 1 , . . ., β M } lying in the parameter space such that δ ≤ , essentially by rescaling a packing of the subset of {−1, 0, 1} p of s-sparse vectors such that the Hamming distance between any two elements is at least s 2 [4,11].Furthermore, we may compute the pairwise KL divergences in terms of the squared 2 -norm between parameter vectors, so where γ 2s = sup β∈B 0 (2s) . Note that P j and P k refer to the conditional distributions of the y i 's given the x i 's for this example, so we are assuming the design matrix is fixed.Applying Theorem 1 and Lemma 2 with ρ equal to the 2 -distance and Φ equal to the identity, we therefore have and assuming that the problem dimensions satisfy n ≥ Cs log p, we then obtain a lower bound of the form In the case of the 2 -loss, the Lasso estimator achieves the risk expression in the lower bound (up to constant factors), implying that it is a rate-optimal estimator [4].Similar bounds on the minimax risk may be derived when the norms appearing in the loss function and/or parameter space are replaced by a general q -norm [4,12].

Metric Entropy
The second method for bounding I(Y; X), due to Yang and Barron [13], is based on the metric entropy of the parameter space.Recall the notion of the -covering number of a set in a metric space, which is the minimum number of -balls required to cover the set.The logarithm of the covering number is also known as the metric entropy.In particular, we are interested in the quantity log N KL ( ; P ), defined by which denotes the -covering number of P, where distances are measured with respect to the square root KL divergence.We have the following bound: Lemma 3. Let X and Y be defined as in Theorem 1.Then,

Proof (sketch).
Suppose {Q 1 , . . ., Q N } is an -cover of P with respect to the square root KL divergence.
where the inequality holds because P minimizes the average KL divergence with respect to the second argument.Furthermore, we know that there exists some Since the above inequality holds for all > 0, we may take an infimum over to obtain the stated bound.
As an example of the above technique, we consider the problem of nonparametric regression.Note that the following example shows that the general machinery developed above, though described in terms of parameter estimation, may be applied to nonparametric settings involving function estimation, as well.
Example 2. (Nonparametric regression) Suppose we observe i.i.d.pairs {(x i , y i )} n i=1 , where , and x i is independent of w i .We also assume that f * belongs to the function class F s , for a positive integer s, defined as the set of all continuous functions f on [0, 1] satisfying the following properties: We derive lower bounds on the minimax risk of estimating f * when is the squared L 2 -distance, defined by Hence, we will take Φ(t) = t 2 and ρ equal to the L 2 -distance.Let P denote the set of joint distributions of (x, y) generated by the class F s .By standard results on the metric entropy of function classes [14,15], we have the bound where log N 2 ( ; F s ) denotes the metric entropy of F s with respect to the L 2 -distance.Furthermore, for any . For two functions f , g ∈ F s , we may compute the KL divergence between the corresponding distributions P f , P g ∈ P: Hence, it follows that Minimizing the bound obtained from Lemma 3 with respect to , we obtain * = C n 1 4s+2 , and plugging back into Theorem 1, we obtain the lower bound Taking δ then yields the bound .
A matching upper bound may be derived using local weighted polynomial regression [16], so the minimax risk is Θ n −s/(2s+1) .

Community Recovery
Another area of machine learning that has recently received a substantial amount of attention concerns recovering communities based on node connectivity in a network.A popular probabilistic model is known as the stochastic block model (SBM).In the simplest form of the model, parametrized by (n, K, p, q), the graph has nodes {1, . . ., n} partitioned into K communities.Let the community label of node i be denoted by σ(i).The edge set E of the random graph G is then constructed in the following manner: each edge (i, j) is generated independently from all others, with probability The goal is to partition the n nodes into the underlying communities based on observing the graph G.
In order to measure the performance of an algorithm, we consider the loss function Here, the estimator σ : {1, . . . ,n} → {1, . . ., K} corresponds to a partitioning of the nodes into K communities, and d H denotes the Hamming distance between assignments.Furthermore, we take the minimum over all permutations S K of the community labels.Hence, r( σ, σ) is the proportion of incorrectly labeled nodes (for the optimal labeling of partitions).We will focus our discussion on the setting where K is fixed, but p and q may vary with n; generalizations exist in the literature where K is allowed to grow with n, as well.We are interested in the behavior of various algorithms as n → ∞.
In the following two subsections, we discuss the popular notions of weak recovery and exact recovery.The algorithm σ achieves weak recovery if E[r( σ, σ)] → 0 (i.e., the expected fraction of misclassified nodes tends to 0 as n → ∞), and achieves exact recovery if r( σ, σ) = 0.For a more complete description of current work on stochastic block models, see the extensive survey paper by Abbe [17].

Weak Recovery
Analogous to the setting discussed in Section 2, we may derive bounds on the minimax risk where Σ(n, K) is an appropriate class of underlying community labelings.We state and prove a result for approximately equal-sized communities in the limit as n → ∞, so Σ(n, K) is the set of all labelings σ such that |{i : σ(i) = k}| = (1 + o(1)) n K , for all 1 ≤ k ≤ K.The main result is the following [18]: Theorem 2. Suppose p = a n and q = b n , and suppose nI K → ∞, where A lower bound on the minimax risk of community estimation is given by Proof (sketch).The core of the approach bears similarity to the method for obtaining lower bounds for estimation, in the sense that we construct a subset Σ L of the parameter space corresponding to "messages", which we wish to recover via an appropriate decoding strategy.In the case when K = 2 (and n is even), the subset Σ L consists of all partitions of the nodes into equal-sized communities and communities of size n 2 + 1, n 2 − 1 .We focus on the case K = 2 in the present proof sketch to avoid technical complications.
The proof is somewhat more involved than the strategies outlined in Section 2, however, since the unknown quantity to be estimated is a set of discrete labelings and the loss function is defined with respect to an optimal permutation.The first step is to lower-bound the minimax risk by the average risk over the class Σ L .Furthermore, a more technical argument shows that we may just examine the average local risk defined with respect to a single node in the graph: where r 1 is the local loss function defined with respect to node 1, which is the fraction of optimal permutations of community assignments that incorrectly classify node 1.The next step is to lower-bound the local risk (uniformly over all choices of σ ∈ Σ L ) using the minimum risk of a binary hypothesis testing problem, where the two hypotheses correspond to the possible assignments of node 1 as a member of the first or second community.In particular, we have the following inequality, which holds for each σ: ∼ Bernoulli a n are independent random variables.Standard techniques involving large deviation inequalities allow us to lower-bound the latter probability, thus yielding the overall lower bound appearing in the theorem.
As demonstrated by Zhang and Zhou [18], the lower bound on the risk appearing in Theorem 2 may be achieved using a form of penalized likelihood estimation.A computationally feasible procedure was subsequently provided in Gao et al. [19].
Remark 1.The quantity I appearing in Equation ( 5) is the Renyi divergence of order 1  2 between a Bernoulli a n and Bernoulli b n distribution.In fact, these results generalize to the case of non-binary edge weights, and the Renyi divergence of order 1  2 also appears in the minimax rates for estimation in weighted stochastic block models [20].Furthermore, if the communities are not all of equal size, alternative divergence functions appear in the error exponent [21,22].Finally, note that the regime where p = a n and q = b n , with a, b = Θ(1), corresponds to the threshold at which giant components emerge in the network [23].Theorem 2 allows a and b to scale arbitrarily with n, provided nI K → ∞, which will not hold if a, b n.

Exact Recovery
Information-theoretic arguments may also be used to establish lower bounds for exact recovery in stochastic block models, which corresponds to correct classification of every single node (up to permutation the of community labels).We present a result, due to Abbe et al. [24], that provides lower bounds for exact recovery in the case of two equal-sized communities.
We have the following result: then for sufficiently large n, the maximum likelihood estimator fails in recovering the communities with probability bounded away from 0: Proof (sketch).We denote the two communities by A and B. Let F be the event that the maximum likelihood estimator fails in performing exact recovery, and let By symmetry, we have P(F A ) = P(F B ). Furthermore, note that since if both F A and F B were to occur simultaneously, swapping the labels of the nodes i and j would lead to a higher value of the likelihood than in the case of correct labeling.In particular, this implies that Let H ⊆ A denote a fixed subset with |H| = n log 3 (n) , and define the event It remains to derive a lower bound on P(F H ). For j ∈ H, let and note that the F H 's are independent.Hence, Straightforward techniques for bounding sums of independent Bernoulli random variables show that P(F for each j, from which we can conclude that Combining inequalities ( 6)-( 8) then yields the desired result.
Note that for any other estimator σ, we have Hence, Theorem 3 also implies that In fact, a converse of Theorem 3 holds, as well: Theorem 4.Under the same conditions as in Theorem 3, suppose instead that Then, the maximum likelihood estimator succeeds in recovering the communities with probability tending to 1: Since the focus of this paper is to establish lower bounds, we refer the reader to Abbe [24] for the proof of Theorem 4, which proceeds by direct calculation.An extension of Theorems 3 and 4 for weighted stochastic block models may be found in Jog and Loh [25].
Remark 2. The threshold behavior described in Theorems 3 and 4 is perhaps not surprising in light of known threshold behavior in Shannon coding theory, and the connections between each of the statistical learning tasks and the problem of decoding on a discrete alphabet after passage through a noisy channel.Indeed, the community recovery problem has been cast in information-theoretic terminology as decoding in a "graphical channel" [26].On the other hand, the coding scheme is fixed according to the stochastic block model, whereas Shannon theory allows one to design an optimal encoding scheme to achieve channel capacity.See also the paper by Chen et al. [27], and the derivation of similar types of sharp threshold behavior in submatrix localization problems [28,29].Finally, we note that the scaling p = a log n n and q = b log n n , when a, b = Θ(1), corresponds to the threshold for the graph to have isolated vertices with probability tending to 1 [23].Indeed, it would be impossible to perform exact recovery with high probability in the presence of isolated vertices: flipping the community assignments of two isolated vertices belonging to the two different communities would not change the value of the likelihood.

Online Learning
We now shift our focus to sequential allocation problems.The setup we consider involves a series of actions taken by a player, using limited feedback about the environment based on his/her past actions.We study the setting of a multi-armed bandit, where each potential action of the player is associated with a reward distribution, but the player only observes the reward corresponding to his/her action on successive rounds.In the following two subsections, we will consider the cases of stochastic and adversarial bandits and obtain bounds on a quantity known as regret.More details on the setting and results may be found in Bubeck and Cesa-Bianchi [30] or Cesa-Bianchi and Lugosi [31].

Stochastic Bandits
We first analyze the setting of stochastic multi-armed bandits.On each round, the player may choose one of k different arms.Associated to arm j is a reward distribution P θ j , where θ j ∈ Θ belongs to some parameter space.Furthermore, we assume that the reward distributions {P θ j } k j=1 remain fixed across all rounds.We use the notation µ(θ) to denote the mean of the distribution P θ , and let µ * = max 1≤j≤k µ(θ j ) denote the maximum expected reward.
Denote the sequence of actions chosen by the player as (I 1 , . . ., I n ), where I t ∈ {1, . . ., k} is the arm played at time t, and let X I t ,t ∼ P θ I t denote the observed reward, which is an i.i.d.drawn from the distribution P θ I t . Note that I t may be a function of the previously observed reward sequence (X I 1 ,1 , . . ., X I t−1 ,t ) and may also involve additional randomization.We are interested in bounding a quantity known as the pseudo-regret, defined as where we may also write R n (θ 1 , . . ., θ k ) to make the dependence on the reward distributions explicit.If the player employs a random strategy, the expectation is computed with respect to randomness in the sequence of actions (I 1 , . . ., I n ), as well as randomness generated by draws from the reward distributions.In other words, the pseudo-regret measures the difference between the expected reward incurred by the player's strategy and the expected reward incurred by playing the arm with maximum expected reward on every round.
Lai and Robbins [32] prove the following result.We omit some technical regularity conditions on the parameter space, such as denseness of the parameter space and continuity with respect to the KL divergence, in order to avoid cluttering the presentation.
Proof (sketch).We may write where ∆ j = µ * − µ j and T j (n) = ∑ n 1{I t = j}.The main step is to show that the inequality holds for any strategy.Inequality ( 9) provides a lower bound on the expected number of pulls to any suboptimal arm (note that, as P θ j becomes further from P θ * , the two arms are easier to distinguish, so the expected number of pulls to the suboptimal arm can be smaller).We focus on proving inequality (9) for j = 2; the other cases are similar.Consider two parameter vectors θ = (θ 1 , θ 2 , . . ., θ k ) and θ = (θ 1 , θ 2 , . . ., θ k ), which differ only in the second coordinate.We further choose the parameters such that so the second arm is suboptimal in the first setting but optimal in the second.We will choose θ 2 close to θ 1 , so D KL (P (The regularity conditions on the parameter space and reward distributions ensure that such a choice is possible.)The idea is that, since P θ and P θ are close, any strategy should pick roughly the same sequence of arms in both scenarios, but a strategy that performs well on θ will behave relatively poorly on θ (and vice versa), since the ordering of arms according to optimality is different in the two settings.
In particular, we will derive the following bound, relating the probabilities of pulling the second arm in each of the parameter settings: where ) , and we take α < 1 3 .We can show that b n = o(1) since P θ and P θ are close, and that the right-hand probability is also o(1), since arm 2 is optimal under θ .
For a fixed strategy, let {X j,s } 1≤j≤k 1≤s≤n denote the rewards corresponding to various arm pulls. ).
Finally, we bound P θ (T 2 (n) < a n ) using Markov's inequality: where the last equality follows from the fact that a n = o(n) and the assumption on R n (θ ).Altogether, we conclude that the right-hand side of inequality (10) is o (1).By another application of Markov's inequality, we conclude that Hence, as wanted.
Note that the assumption R n (θ 1 , . . ., θ k ) = o(n α ) implies that a sufficiently good player strategy exists for all choices of reward parameters.In particular, such a condition may be verified when the reward distributions are Bernoulli (e.g., P θ ∼ Bernoulli(θ)).Then, we have and combined with Theorem 5, we obtain the lower bound A player strategy known as the Upper Confidence Bound (UCB) strategy may be shown to achieve this lower bound, up to constant factors [32,33].Finally, we mention a non-asymptotic lower bound on the pseudo-regret that comes from the probably approximately correct (PAC) literature on bandits [34][35][36]: Theorem 6.In the case of Bernoulli reward distributions, there exist positive constants {c i } 5 i=1 such that for all k ≥ 2 and n ≥ 1, the pseudo-regret of any strategy satisfies Proof (sketch).For a detailed proof of Theorem 6, we refer the reader to Mannor and Tsitsiklis [36].
The main idea is to construct a collection of k vectors {θ 1 , . . ., θ k } ⊆ [0, 1] k corresponding to the parameters of the reward distributions on arms.For each 2 ≤ i ≤ k, we define the vector and we define the vector θ 1 such that In other words, the reward distribution of arm 1 is the same for all k parameter settings, but in the case of vector θ i , the reward distribution for arm i is slightly better than the reward distributions of the other arms.We then compute a weighted sum of the regret incurred in each parameter setting, where θ 1 is given weight 1 2 and all other θ i 's are given weight 1 2(n−1) .We may show that this weighted regret is lower-bounded by the quantity appearing in inequality (11), implying the existence of at least one parameter setting that satisfies the desired bound.Computing the lower bound for the weighted regret is similar to the procedure adopted in the proof of Theorem 5, in that we compute a lower bound on the expected number of arm pulls of each suboptimal arm in each parameter setting in terms of .Theorem 6 is a type of minimax result, stating that, for any player strategy, a distribution of Bernoulli rewards exists for which the problem incurs Ω(log n) regret.The same UCB strategies of Auer et al. [33] may be used to obtain O(log n) upper bounds on the minimax regret even for the worst-case reward distribution, showing that the bound stated in Theorem 6 is tight.

Adversarial Bandits
In the adversarial setting, we allow the reward distributions to vary arbitrarily over time.Thus, we assume that the reward distributions are chosen by an "adversary", where the class of permissible adversarial strategies is denoted by P. For a player strategy S and an adversarial strategy P ∈ P, we define the pseudo-regret analogously to the stochastic case: where the first expectation is taken with respect to possible randomization in the adversarial strategy, and the second expectation is taken with respect to randomization in the strategies of both the player and adversary.
The following result provides a lower bound for the minimax pseudo-regret, where the supremum is taken over P Ber , the set of all Bernoulli reward distributions over the k time steps, and the infimum is taken over all player strategies [30,37]: where the infimum is taken over all (possibly randomized) player strategies.
Proof (sketch).Note that it suffices to prove the bound when the infimum is taken over deterministic player strategies, since the pseudo-regret for a randomized strategy will be a convex combination of the pseudo-regret of deterministic strategies.Fix a deterministic player strategy, and consider the reward distributions P 1 , . . ., P k ∈ P Ber , where P j corresponds to the distribution where the reward of each arm i = j is i.i.d.Bernoulli 1 2 , and the reward of arm j is i.i.d.Bernoulli 1 2 + .Note that this construction bears some similarity to the proof outline for Theorem 6 provided above, in that the reward distribution P j slightly favors arm j.We will also compute a lower bound for the weighted regret, this time allocating uniform weights to each parameter setting, in order to conclude the existence of at least one assignment of reward distributions satisfying the desired lower bounds.Let E j denote the expectation with respect to the reward distribution P j .
We may compute where T i (n) denotes the number of pulls of arm i.
Let P denote the reward distribution where all arms have a Bernoulli 1 2 distribution.We may obtain the following bound: where inequality (a) may be derived by first relating the difference in expectations for bounded random variables to total variation distance and then applying Pinsker's inequality, and equality (b) follows from a direct computation.Combining inequalities ( 12) and ( 13), we then obtain and taking an infimum over all player strategies produces the desired result.
Note that the lower bound provided in Theorem 7 clearly also holds when the supremum is taken over any class of adversarial strategies containing P Ber .In particular, one topic of study is that of oblivious adversaries, which are allowed to perform any strategy that is non-adaptive to the actions of the player (i.e., it is chosen before the start of the first round).The Exp3 algorithm provides an upper bound on the minimax pseudo-regret for oblivious adversaries that matches the lower bound in Theorem 7 up to a factor of log k [37].The study of non-oblivious adversaries refers to the setting where the adversary's actions may be chosen in response to the player's sequential choices, as well, and is also an active area of research [31,38].

Discussion
In this article, we have presented several distinct approaches for deriving lower bounds in various statistical learning problems.In each of the settings described-statistical estimation, community recovery, and online learning-we have shown how to simplify the problem to one involving channel decoding, and leverage information-theoretic bounds on the hardness of the decoding problem to bound the hardness of the corresponding statistical problem.It is worth reflecting on the similarities between the techniques employed in each of the approaches.Although the specific interpretation involving channel decoding looks quite different in each of the settings, the trick is to find an appropriate discretization of parameter space so that pairs of parameters are relatively far apart, but the corresponding data-generating distributions are close.In the context of statistical estimation, this means that we construct a packing of parameter space.In the community recovery setting, we consider pairs of community partitions that differ only in the assignment of a single node.In the multi-armed bandit setting, we consider pairs of arm parameters that flip the assignment of the optimal arm, while perturbing the parameter values as little as possible.
On a more applied note, information-theoretic tools have made an appearance in various machine learning algorithms involving maximizing independence between observed quantities.Some examples include decision tree learning via information gain [39]; independent component analysis by mutual information minimization [40]; causal inference algorithms maximizing independence [41]; minimal-redundancy-maximal-relevance (mRMR) methods for feature selection [42]; and image registration via mutual information maximization in medical imaging [43].As a result, quantities such as mutual information have become increasingly mainstream in data science applications.Note, however, that such applications of information theory to machine learning have no connection to the channel decoding techniques or hardness results discussed in this article.In terms of statistical theory, these applications have created a renewed interest in deriving efficient estimators of entropy and other related information measures based on finite samples [44][45][46][47], but a detailed discussion of such methods is somewhat orthogonal to the main topic of this survey.
denotes the number of edges between j and the nodes in C. Note that, if event F H occurs and all nodes in H are connected to at most log n log log n other nodes in H, then event F A must occur.Furthermore, one can show that, with high probability, every node in H is connected to at most