Smoothing of Binary Codes, Uniform Distributions, and Applications

The action of a noise operator on a code transforms it into a distribution on the respective space. Some common examples from information theory include Bernoulli noise acting on a code in the Hamming space and Gaussian noise acting on a lattice in the Euclidean space. We aim to characterize the cases when the output distribution is close to the uniform distribution on the space, as measured by the Rényi divergence of order α∈(1,∞]. A version of this question is known as the channel resolvability problem in information theory, and it has implications for security guarantees in wiretap channels, error correction, discrepancy, worst-to-average case complexity reductions, and many other problems. Our work quantifies the requirements for asymptotic uniformity (perfect smoothing) and identifies explicit code families that achieve it under the action of the Bernoulli and ball noise operators on the code. We derive expressions for the minimum rate of codes required to attain asymptotically perfect smoothing. In proving our results, we leverage recent results from harmonic analysis of functions on the Hamming space. Another result pertains to the use of code families in Wyner’s transmission scheme on the binary wiretap channel. We identify explicit families that guarantee strong secrecy when applied in this scheme, showing that nested Reed–Muller codes can transmit messages reliably and securely over a binary symmetric wiretap channel with a positive rate. Finally, we establish a connection between smoothing and error correction in the binary symmetric channel.


INTRODUCTION
Many problems of information theory involve the action of a noise operator on a code distribution, transforming it into some other distribution.For instance, one can think of Bernoulli noise acting on a code in the Hamming space or Gaussian noise acting on a lattice in the Euclidean space.We are interested in characterizing the cases when the output distribution is close to the uniform distribution on the space.Versions of this problem have been considered under different names, including resolvability [24,25,64], smoothing [18,33], discrepancy [53,11], and entropy of noisy functions [47,48,49].Direct applications of smoothing include secrecy guarantees in both the binary symmetric wiretap channel [7,25,64] and the Gaussian wiretap channel [5,30], error correction in the binary symmetric channel (BSC) [27,43], converse coding theorems of information theory [2,24,41,42], strong coordination [14,17,7,16,13], secret key generation [12,30], and worst-to-average case reductions in cryptography [33,10].Some aspects of this problem also touch upon approximation problems in statistics and machine learning [34,21,22].
Our main results are formulated for the smoothing in the binary Hamming space H n .For r : H n → R + 0 , and f : be the action of r on the functions on the space.We set r to be a probability mass function (pmf) and call the function T r f the noisy version of f with respect to r, and refer to r and T r as a noise kernel and a noise operator respectively.By smoothing f with respect to r we mean applying the noise kernel r to f .We often assume that r(x) is a radial kernel, i.e., its value on the argument x ∈ H n depends only on the Hamming weight of x.
There are several ways to view the smoothing operation.Interpreting it as a shift-invariant linear operator, we note that, from Young's inequality, T r f α = f * r α ≤ f α , 1 ≤ α ≤ ∞, so smoothing contracts the α-norm.Upon applying T r , the noisy version of f becomes "flatter", hence the designation "smoothing".Note that if f is a pmf, then T r f is also a pmf, and so this view allows us to model the effect of communication channels with additive noise.
The class of functions that we consider are (normalized) indicators of subsets (codes) in H n .A code C ⊂ H n defines a pmf f C = ½ C |C | , and thus T r f C can be viewed as a noisy version of the code (we also sometimes call it a noisy distribution) with respect to the kernel r.The main question of interest for us is the proximity of this distribution to U n , or the "smoothness" of the noisy code distributions.To quantify closeness to U n , we use the Kullback-Leibler (KL) and Rényi divergences (equivalently, L α norms), and the smoothness measured in D α (• •) is termed as D α -smoothness (L α -smoothness).
We say that a code is perfectly smoothable with respect to the noise kernel r if the resultant noisy distribution becomes uniform.Our main emphasis is on the asymptotic version of perfect smoothing and its implications for some of the basic information-theoretic problems.A sequence of codes (C n ) n is asymptotically smoothed by the kernel sequence r n if the distance between (T rn f Cn ) and U n approaches 0 as n increases.This property is closely related to the more general problem of channel resolvability introduced by Han and Verdú in [24].Given a discrete memoryless channel W (Y |X) and a distribution P X , we observe a distribution P Y on the output of the channel.The task of channel resolvability is to find P X supported on a subset C ⊂ H n that approximates P Y with respect to KL divergence.As shown in [24], there exists a threshold value of the rate such that it is impossible to approximate P Y using codes of lower rate, while any output process can be approximated by a well-chosen code of rate larger than the threshold.Other proximity measures between distributions were considered for this problem in [55,29,64].Following the setting in [64], we consider Rényi divergences for measuring closeness to uniformity.We call the minimum rate required to achieve perfect asymptotic smoothing the D α -smoothing capacity of noise kernels (r n ) n , where the proximity to uniformity is measured by α-Rényi divergence.In this work, we characterize the D α -smoothing capacity of the sequence (r n ) n using its Rényi entropy rate.Asymptotic smoothing.We will limit ourselves to studying smoothing bounds under the action of Bernoulli noise or the ball noise kernels.A common approach to deriving bounds on the norm of a noisy function is through hypercontractivity inequalities [35,39,63].In its basic version, given a code C of size M , it yields the estimate where β δ is the Bernoulli kernel (see Section 2 for formal definitions) and α ′ = 1 + (1 − 2δ) 2 (α − 1).This upper bound does not differentiate codes yielding higher or lower smoothness, which in many situations may not be sufficiently informative.Note that other tools such as "Mrs.Gerber's lemma" [61,35] or strong data processing inequalities also suffer from the same limitation.
A new perspective of the bounds for smoothing has been recently introduced in the works of Samorodnitsky [47,48,49].Essentially, his results imply that codes satisfying certain regularity conditions have good smoothing properties.Their efficiency is highlighted in recent papers [27,26], which leveraged results for code performance on the binary erasure channel (BEC) to prove strong claims about the error correction capabilities of the codes when used on the BSC.Using Samorodnitsky's inequalities, we show that duals of some BEC capacity-achieving codes achieve D α -smoothing capacity for α ∈ {2, 3, . . ., ∞} with respect to Bernoulli noise.This includes duals of polar codes and doubly transitive codes such as Reed-Muller (RM) codes.
Smoothing and the wiretap channel.Wyner's wiretap channel [62] models communication in the presence of an eavesdropper.Code design for this channel pursues reliable communication between the legitimate parties, while at the same time leaking as little information as possible about the transmitted messages to the eavesdropper.The connection between secrecy in wiretap channels and resolvability was first mentioned by Csiszár [15] and later developed by Hayashi [25].It rests on the observation that to achieve secrecy it suffices to make the distribution of eavesdropper's observations conditioned on the transmitted message nearly independent of the message.The idea of characterizing secrecy based on smoothness works irrespective of the measure of secrecy [25,7,64], and it was also employed for nested lattice codes used over the Gaussian wiretap channel in [5].
Secrecy on the wiretap channel can be defined in two ways, measured by the information gained by the eavesdropper, and it depends on whether this quantity is normalized to the number of channel uses (weak secrecy) or not (strong secrecy).This distinction was first highlighted by Maurer [32], and it has been adopted widely in recent literature.Early papers devoted to code design for the wiretap channel relied on random codes, but (for simple channel models such as BSC or BEC) this has changed with the advent of explicit capacity-approaching code families.Weak secrecy results based on LDPC codes were presented in [57], but initial attempts to attain strong secrecy encountered some obstacles.To circumvent it, first works on code construction [56,31] had to assume that the main channel is noiseless.The problem of combining strong secrecy and reliability for general wiretap channels was resolved in [23], but that work had to assume that the two communicating parties share a small number of random bits unavailable to the eavesdropper.Apart from the polar coding scheme of [23], explicit code families that support reliable communication with positive rate and strong secrecy have not previously appeared in the literature.In this work, we show that nested RM codes perform well in binary symmetric wiretap channels based on their smoothing properties.While our work falls short of proving that nested RM codes achieve capacity, we show that they can transmit messages reliably and secretly at rates close to capacity.
Ball noise and decoding error.Ball-noise smoothing provides a tool for estimating the error probability of decoding on the BSC.We derive impossibility and achievability bounds for D α -smoothness of noisy distributions with respect to the ball noise.Smoothing of a code with respect to the L 2 norm plays a special role because in this case the second norm (the variance) of the resulting distribution can be expressed via pairwise distance between codewords, enabling one to rely on tools from Fourier analysis.The recent paper by Debris-Alazard et al. [18] established universal bounds for smoothing of codes or lattices, with cryptographic reductions in mind.The paper by Sprumont and Rao [43] addressed bounds for error probability of list decoding at rates above BSC capacity.A paper by one of the present authors [4] studied the variance of the number of codewords in balls of different radii (a quantity known as quadratic discrepancy [6,54]).
The main contributions of this paper are the following: (1) Characterizing D α -smoothing capacities of radial noise operators on the Hamming space for α ∈ (1, ∞].(2) Identifying some explicit code families that attain smoothing capacity of Bernoulli noise for α ∈ {2, 3, . . ., ∞}; (3) Obtaining rate estimates for RM codes used on the BSC wiretap channel under the strong secrecy condition; (4) Showing that codes possessing sufficiently good smoothing properties are suitable for error correction.In Section 2, we set up the notation and introduce the relevant basic concepts.Then, in Section 3, we derive expressions for D α -smoothing capacities for α ∈ (1, ∞], and in Section 4 we use these results to analyze smoothing of code families under the action of Bernoulli noise.Section 5 is devoted to the application of these results for the binary symmetric wiretap channel.In particular, we show that RM codes can achieve rates close to the capacity of the BSC wiretap channel, while at the same time guaranteeing strong secrecy.In Section 6 we establish threshold rates for smoothing under ball noise, and derive bounds for the error probability of decoding on the BSC, including the list case, based on the distance distribution.Concluding the paper, Section 7 briefly points out that the well-known class of uniformly packed codes are perfectly smoothable with respect to "small" noise kernels. and let (A i , i = 0, . . ., n) be the distance distribution of the code.If the code C forms an F 2 -linear subspace in H n , we denote by C ⊥ := {y ∈ H n : i x i y i = 0 for all x ∈ C } its dual code.
The function ½ C denotes the indicator of a subset C ⊂ H n , and |C | is corresponding pmf denoting the uniform distribution over the set, calling it a code distribution.We use a special notation s t for this pmf when C = S(0, t) and similarly write b t for it when C = B(0, t).Finally, β δ is the binomial distribution on H n , given by (3) , and U n is the uniform distribution, given by U n (x) = 2 −n for all x.
Entropies and norms.For a function f : H n → Ê, we define its α-norm as follows.
Given a pmf P , let denote its Shannon entropy and Rényi entropy of order α, respectively.If P is supported on two points, we write h(P ) and h α (P ) instead (all logarithms are to the base 2).The limiting cases of α = 0, 1, ∞ are well-defined; in particular, for α = 1, H α (P ) reduces to H(P ).
For two discrete probability distributions P and α, the α-Rényi divergence (or simply α-divergence) is defined as follows: Channels.In this paper, a channel is a conditional probability distribution W : {0, 1} → Y , where Y is a finite set, so that W (y|x) is the conditional probability of the output y for the input x.We frequently consider the binary symmetric channel with crossover probability δ and the binary erasure channel with erasure probability λ, abbreviating them as BSC(δ) and BEC(λ), respectively.We are often interested in the n-fold channel W (n) , i.e., the conditional probability distribution corresponding to n-uses of the channel.For the input X, let Y (X,W ) be the random output of the channel W (n) .If the input sequences are chosen from a uniform distribution on a code C , we denote the input by X C .Since the number of uses of the channel is usually clear from the context, we suppress the dependency on n from the notation for channels and sequences.
Let C be a code of length n.For a channel W and input X C , the block-MAP decoder is defined as For a given code and channel, denote the error probability of block-MAP decoding by 2.2.D α -and L α -smoothness.Recall that in the introduction, we expressed the smoothness of a distribution as its proximity to uniformity.Here we formalize this notion based on two (equivalent) proximity measures.
Let g be a pmf on H n .A natural measure of the uniformity of ).We call this the D α -smoothness of g.Observe that with equality iff g = U n .Thus, the better the pmf g approximates uniformity, the closer is 2 n g α to 1 (The denominator is simply a normalization quantity that allows dimension-agnostic analysis).Therefore, 2 n g α (α ∈ (0, 1) ∪ (1, ∞]) can be considered as another measure of proximity.We call 2 n g α the L α -smoothness of g.From ( 7) and (8), it follows that D α -smoothness and L α -smoothness are equivalent.
Recall that for a given code C , and a noise kernel r, T r f C = r * f C is the noisy distribution of code C with respect to r.We intend to study smoothing properties of such noisy distributions of codes.In particular, we characterize the necessary conditions for D α (T r f C U n ) to be close to zero (equivalently, for 2 n T r f C α close to one).In Section 3, we quantify these requirements in the asymptotic setting.

2.3.
Resolvability.The problem of channel resolvability was introduced by Han and Verdú [24] under the name of approximating output statistics of the channel.The objective of channel resolvability is to approximate the output distribution of a given input by the output distribution of a code with a smaller support size.In this work, we are interested in code families whose noisy distributions approximate uniformity.Resolvability characterizes the necessary conditions for this to happen in terms of the rate of the code.
Let W be a (discrete memoryless) channel whose input alphabet is X and output alphabet is Y .Let X = {X n } ∞ n=1 be a discrete-time random process where the RVs X n take values in X. Denote by Y n the random output of W with input X n and let Y = {Y n } ∞ n=1 .Denote by P Y the distribution of Y and let P Y (n) be the pmf of the n-tuple Y (n) For a legitimate (realizable) output process Y , define where ∆ is a measure of closeness of a pair of probability distributions.In words, we look for sequences of distributions (f Cn ) n of the smallest possible rate that approximates P Y on the output of W .
The original problem as formulated by Han and Verdú in [24] seeks to find resolvability of the channel, defined as where ∆ is either the variational distance or normalized KL divergence 1  n D(• •).Hayashi [25] considered the same problem where the proximity was measured by unnormalized KL divergence.In each case, resolvability equals Shannon capacity of the channel W .
Theorem 2.1 ( [24,25]).Let W be a discrete memoryless channel.Suppose that ∆ is either the KL divergence (normalized or not) or the variational distance, then resolvability is given by r (W ) = C(W ).The authors of [24] proved this result under the additional assumption that the channel W satisfies strong converse, and Hayashi [25] later showed that this assumption is unessential.
In addition to the proximity measures considered in Theorem 2.1, papers [55,29,64] considered other possibilities.In particular, Yu and Tan [64] studied the resolvability problem for a specific target distribution P Y and for the Rényi divergence ∆ = D α (6).Their main result is as follows.
Theorem 2.2 ([64], Theorem 2).Let W be a channel and P Y be an output distribution.then where P(W , P Y ) is the set of distributions P X consistent with the output P Y .
A direct corollary of Theorem 2.2 is the following.

Corollary 2.3 ([64],
Eq. ( 55)).Let Y * be the output process where for each n, Y * n ∼ Ber(1/2).Then This corollary gives necessary conditions for the rate of codes that can approximate the uniform distribution via smoothing.We will connect this result to the problem of finding smoothing thresholds in Section 4.

PERFECT SMOOTHING -THE ASYMPTOTIC CASE
For a given family of noise kernels (T rn ) n , there exists a threshold rate such that it is impossible to approximate uniformity with codes of rate below the threshold irrespective of the chosen code, while at the same time, there exist families of codes with rate above the threshold that allows perfect approximation in the limit of infinite length.For instance, for the Bernoulli(δ) noise applied to a code C , the smoothed distribution is nonuniform unless C = H n or δ = 1/2.At the same time, it is possible to approach the uniform distribution asymptotically for large n once the code sequence satisfies certain conditions.Intuitively it is clear that, for a fixed noise kernel, it is easier to approximate uniformity if the code rate is sufficiently high.In this section, we characterize the threshold rate for (asymptotically) perfect smoothing.Of course, the threshold also depends on the proximity measure ∆ that we are using.In this section, we use perfect smoothing to mean "asymptotically perfect".If the proximity measure ∆ for smoothing is not specified, this means that we are using KL divergence.We obtain threshold rates for perfect smoothing measured with respect to α-divergence for several values of α.In the subsequent sections, we work out the details for the Bernoulli and ball noise operators, which also have some implications for communication problems.
Definition 3.1.Let (C n ) n be a sequence of codes of increasing length n and let 0 ≤ α ≤ ∞.We say that the sequence C n is asymptotically perfectly D α -smoothable with respect to the noise kernels r n if One can also define a dimensionless measure for perfect asymptotic smoothing by considering the limiting process ( 13) Proposition 3.1.Convergence in (13) implies perfect smoothing for all 1 < α ≤ ∞ and is equivalent to it for α = ∞.
Proof.Since by the triangle inequality, is not weaker than the mode of convergence in Definition 3. [52, p. 388].Their form depends on α; namely, for 2 ≤ α < ∞ we have For 1 < α < 2 the inequality has the form , where α ′ = α α−1 is the Hölder conjugate.These equations show that for α ∈ (1, ∞), Definition 3.2.Let (r n ) n be a sequence of noise kernels.We say that rate R is achievable for perfect D α -smoothing if there exists a sequence of codes also achievable.Indeed, consider a (linear) code C 1 of rate R 1 that has good smoothing properties.Construct C 2 by taking the union of 2 n(R2−R1) non-overlapping shifts of C 1 .Then the rate of C 2 is R 2 and since each shift has good smoothing properties, the same is true for C 2 .Therefore, let us define the main concept of this section.Definition 3.3.Given a sequence of kernels r = (r n ) n , define the D α -smoothing capacity as ( 14) S r α := inf Note that this quantity is closely related to resolvability: if, rather than optimizing on the output process in (12), we set the output distribution to uniform and take ∆ = D α , then S r α equals J (Dα) (W , P y ) for the channel W given by the noise kernel r.To avoid future confusion, we refer to the capacity of reliable transmission as Shannon's capacity.
The following lemma provides a lower bound for D α -smoothness.
Lemma 3.2.Let C ⊂ H n be a code of size M = 2 nR and let r be a noise kernel.Then for α ∈ [0, ∞] Proof.We will first prove that Together with (7), this implies that the claimed inequality holds for α ∈ (1, ∞).
A question of interest is whether there exist sequences of codes of R > 1 − π(α) that achieves perfect D α -smoothing.
The next theorem shows that this is the case for α ∈ (1, ∞].
The proof relies on a random coding argument and is given in Appendix B. This result will be used below to characterize smoothing capacity of the Bernoulli and ball noise operators.

BERNOULLI NOISE
In this section, we characterize the value S β δ α for a range of values of α.Then we provide explicit code families that attain the D α -smoothing capacities.
As already mentioned, resolvability for β δ with respect to α-divergence was considered by Yu and Tan [64].Their results, stated in Corollary 2.3, yield an expression for S β δ α for α ∈ [0, 2] ∪ {∞}.The next theorem summarizes the current knowledge about S β δ α , where the claims for 2 < α < ∞ form new results.Theorem 4.1.Having quantified smoothing capacities, let us examine code families with strong smoothing properties.Since D 1 -smoothing capacity and Shannon capacity coincide, it is natural to speculate that codes that achieve Shannon capacity when used on the BSC(δ) would also attain D 1 -smoothing capacity.However, the following result demonstrates that capacity-achieving codes do not yield perfect smoothing.For typographical reasons, we abbreviate T β δ by T δ from this section onward.Proposition 4.2.Let C n be a sequence of codes achieving capacity of BSC(δ).Then Proof.The second part of the statement is Theorem 2 in [51].The first part is obtained as follows.Let C n be a capacity-achieving sequence of codes in BSC(δ).Then from [40] (Theorem 49), there exists a constant Apart from random codes, only polar codes are known to achieve D 1 -smoothing capacity.Before stating the formal result, recall that polar codes are formed by applying several iterations of a linear transformation to the input, which results in creating virtual channels for individual bits with Shannon's capacity close to zero or to one, plus a vanishing proportion of intermediate-capacity channels.While by Proposition 4.2, that polar codes that achieve the BSC capacity, cannot achieve D 1 -smoothing capacity, adding some intermediate-bit channels to the set of data bits makes this possible.This idea was first introduced in [31] 1 and expressed in terms of resolvability in [8].
n be the virtual channels formed after applying n steps of the polarization procedure.For γ ∈ (0, 1/2), define Hence, the polar code construction presented above achieves the perfect smoothing threshold with respect to KL divergence.Furthermore, since the convergence in α divergence for α < 1 is weaker than the convergence in α = 1, the same polar code sequence is perfectly D α -smoothable for α < 1. Noticing that the smoothing threshold for α < 1 is 1 The authors of [31] had to include these channels to attain secrecy for the wiretap channel.At the same time, it is this inclusion that did not allow them to also attain transmission reliability.See Sec. 5 for more details about this issue.
1 − h(δ) by Theorem 4.1, we conclude that the above polar code sequence achieves smoothing capacity in α-divergence for α < 1.
As mentioned earlier, the smoothing properties of code families other than random codes and polar codes, have not been extensively studied.We show that duals of capacity-achieving codes in the BEC exhibit good smoothing properties using the tools developed in [49].As the first step, we establish a connection between the smoothing of a generic linear code and the erasure correction performance of its dual code.Lemma 4.4.Let C be a linear code and let X C ⊥ be a random uniform codeword of C ⊥ .Let Y X C ⊥ ,BEC(λ) be the output of the erasure channel BEC(λ) for the input X C ⊥ .Then The proof is given in Appendix D. Using this lemma, we show that duals of BEC capacity-achieving codes (with growing distance) exhibit good smoothing properties.In particular, they achieve D α -smoothing capacities for α ∈ {2, 3, . . ., ∞}.Theorem 4.5.Let (C n ) n be a sequence of linear codes with rate R n → R. Suppose that the dual sequence (C ⊥ n ) n achieves Shannon's capacity of the BEC(λ) with λ = R, and assume that d( In particular, the sequence C n achieves D α -smoothing capacity S β δ α for α ∈ {2, 3, . . ., ∞}.Proof.Since the dual codes achieve capacity of the BEC, it follows from [58, Theorem 5.2] that, if their distance grows with n, then their decoding error probability vanishes.In particular, if Together with Theorem 4.1, we have now proved the final claim.
The known code families that achieve capacity of the BEC include polar codes, LDPC codes, and doubly transitive codes, such as constant-rate RM codes.LDPC codes do not fit the assumptions because of low dual distance, but the other codes do.This yields explicit families of codes that achieve D α -smoothing capacity.
We illustrate the results of this section in Fig. 1, where the curves show the achievability and impossibility rates for perfect smoothing with respect to Bernoulli noise.Given a code (sequence) of rate R, putting it through a noise β δ below Shannon capacity cannot achieve perfect smoothing.The sequence of polar codes from [31], cited in Theorem 4.3, is smoothable at rates equal to the Shannon capacity (we stress again that they do not provide a decoding guarantee at that noise level; see footnote 1).At the second curve from the bottom, the duals of codes that achieve Shannon's capacity in BEC achieve perfect D 1 -smoothing; at the third (fourth) curve, these codes are perfectly D 2 -(or D ∞ -) smoothable, and they achieve the corresponding smoothing capacity.Remark 4. Observe that the strong converse of the channel coding theorem does not imply perfect smoothing.To give a quick example, consider a code C n = B(0, δ ′ n) formed of all vectors in the ball.Let 0 < δ < 1/2 and let us use this code on a BSC(δ), where h(δ) + h(δ ′ ) > 1 and δ < 1/2.From the choice of the parameters, the rate of C n is above capacity, and therefore, P B (BSC(δ), C n ) ≈ 1 from the strong converse.At the same time, where the transition from the ball noise to Bernoulli noise (the second equality) is shown in [35].Since Remark 5.In this paper, we mostly study the tradeoff between the rate of codes and the level of the noise needed to achieve perfect smoothing.A recent work of Debris-Alazard et al. [18] considered guarantees for smoothing derived from the distance distribution of codes and their dual distance (earlier, similar calculations were performed in [4,3]).Our approach enables us to find conditions for perfect smoothing similar to [18] but relying on fewer assumptions.
Proposition 4.6.Let C n be a sequence of codes whose dual distance With this, the proof is a straightforward application of Lemma 4.4.
Compared to [18], this claim removes the restrictions on the support of the dual distance distribution of the codes C n .

BINARY SYMMETRIC WIRETAP CHANNELS
In this section, we discuss applications of perfect smoothing to the BSC wiretap channel.Wyner's wiretap channel model V [62] for the case of BSCs is defined as follows.The system is formed of three terminals, A, B, and E. Terminal We assume that messages are encoded into a subset of H n , which imposes some probability distribution on the input of the channels.The goal of the encoding is to ensure reliability and secrecy of communication.The reliability requirement amounts to the condition Pr(M = M ) → 0 as n → ∞, where M is the estimate of M made by B. To ensure secrecy, we require the strong secrecy condition I(M ; Z) → 0. This is in contrast to the condition 1 n (M ; Z) → 0 studied in the early works on the wiretap channel, which is now called weak secrecy.Denote by R = 1 n log |M| the transmission rate.The secrecy capacity C s (V ) is defined as the supremum of the rates that permit reliable transmission, which also conforms to the secrecy condition.
The nested coding scheme, proposed by Wyner [62], has been the principal tool of constructing wellperforming transmission protocols for the wiretap channel [57,31,23] Strong secrecy can be achieved relying on perfect smoothing.Denote by c m a leader of the coset that corresponds to the message m.The basic idea is that if P Z|M=m = (T δ f Ce )(• + c m ) is close to a uniform distribution U n for all m, these conditional pmf's are almost indistinguishable from each other, and terminal E has no means of inferring the transmitted message from the observed bit string Z.
As mentioned earlier, weak secrecy results for the wiretap channel based on LDPC codes and on polar codes were presented in [57] and [31], respectively.The problem that these schemes faced, highlighted in Theorems 4.2, 4.3, is that code sequences that achieve BSC capacity have a rate gap of at least 1/ √ n to the capacity value.At the same time, the rate of perfectly smoothable codes must exceed capacity by a similar quantity [60].For this reason, the authors of [31] included the intermediate virtual channels in their polar coding scheme, which gave them strong secrecy, but required a noiseless main channel (cf.footnote 1).A similar general issue arose earlier in attempting to use LDPC codes for the wiretap channel [56].
Contributing to the line of work connecting smoothing and the wiretap channel [25,7,64], we show that nested coding schemes C e ⊂ C b , where the code C b is good for error correction in BSC(δ b ) and C e is perfectly smoothable with respect to β δ b , attain strong secrecy and reliability for a BSC wiretap channel (δ b , δ e ).As observed in Lemma 4.4, duals of good erasure-correcting codes are perfectly smoothable for certain noise levels and hence they form a good choice for C e in this scenario.
The following lemma establishes a connection between the smoothness of a noisy distribution of a code and strong secrecy.
Lemma 5.1.Consider the nested coding scheme for the BSC wiretap channel introduced above.If D(T δe f Ce U n ) < ǫ, then I(M ; Z) < ǫ.
Proof.We have This lemma enables us to formulate conditions for reliable communication while guaranteeing the strong secrecy condition.Namely, it suffices to take a pair (a sequence of pairs) of nested codes C e ⊂ C b such that D(T δe f Ce U n ) → 0 as n → ∞.If at the same time the code C b corrects errors on a BSC(δ b ), then the scheme fulfills both the reliability and strong secrecy requirements under noise levels δ b and δ e for channels W b and W e , respectively, supporting transmission from A to B at rate R b − R e .Together with the results established earlier, we can now make this claim more specific.) n be sequences of linear codes that achieve capacity of the BEC for their respective rates.Suppose that To give an example of a code family that satisfies the assumptions of this theorem, consider RM codes of constant rate.Namely, let C n e ⊂ C n b be two sequences of RM codes whose rates converge to R e and R b respectively.Note that duals of RM codes are themselves RM codes.By a well-known result [28], RM codes achieve capacity of the BEC, and for any sequence of constant-rate RM codes, the distance scales as 2 Θ( √ n) .Therefore, RM codes satisfy assumptions of Theorem 5.2.Note that for RM codes we can obtain a stronger result, based on their error correction properties on the BSC.Involving this additional argument brings them closer to secrecy capacity under the strong secrecy assumption.Proof.Very recently, Abbe and Sandon [1], building upon the work of Reeves and Pfister [44], proved that RM codes achieve capacity in symmetric channels.Therefore, the condition R b < 1 − h(δ b ) guarantees reliability.The rest of the proof is similar to that of Theorem 5.2.Theorems 5.2 and 5.3 stop short of constructing codes that attain secrecy capacity of the channel (this is similar to the results of [27] for the transmission problem over the BSC).To quantify the gap to capacity, we plot the smoothing and decodability rate bounds in Fig. 2 As an example, let us set the noise parameters δ b = 0.05 and δ e = 0.3 and denote the corresponding secrecy capacity by C s .Suppose that we use a BEC capacity achieving code as code C b and a dual of a BEC capacity achieving code as code C e in the nested scheme.The value R ′ is the largest rate at which we can guarantee both reliability and strong secrecy.In the example in Fig. 2, C s = R As noted, generally, RM codes support a higher communication rate than R ′ .Let R ′′ be their achievable rate.For the same noise parameters as above, we obtain Remark 6.The fact that RM codes achieve capacity in symmetric channels immediately implies that nested RM codes achieve secrecy capacity in BSC wiretap channel under weak secrecy.While it is tempting to assume that, coupled with channel duality theorems of [45,46], this result also implies that RM codes fulfil the strong secrecy requirement on the BSC wiretap channel, an immediate proof looks out of reach [37].
5.1.Secrecy from α-divergence.Classically, the (strong) secrecy in the wiretap channel is measured by I(M, Z).In [7], slightly weaker secrecy measures were considered besides the mutual information.However, more stringent secrecy measures may require in certain scenarios.α-divergence-based secrecy measured were introduced by Yu and Tan [64] as a solution to this problem.
Observe that secrecy measured by D α (P Z|M U n |M ) for α ≥ 1 is stronger than the mutual-informationbased secrecy.This is because for α ≥ 1 Given a wiretap channel with an encoding-decoding scheme, we say α-secrecy is satisfied if The following theorem establishes that it is possible to achieve the rate C(δ b ) − S Evidently, to achieve a stringent version of secrecy, it is necessary to reduce the rate of the message.The capacity of the (δ b , δ e )-wiretap channel is h(δ e ) − h(δ b ), while the known highest rate that assures α-secrecy and reliability is h α (δ e ) − h(δ b ).Hence, to achieve α-secrecy, we must give up h(δ e ) − h α (δ e ) of the attainable rate.

BALL NOISE AND ERROR PROBABILITY OF DECODING
This section focuses on achieving the best possible smoothing with respect to ball noise.As an application, we show that codes that possess good smoothing properties with respect to ball noise are suitable for error correction in the BSC.
6.1.Ball noise.Recall that perfect smoothing of a sequence of codes is only possible if the rate is greater than the corresponding D α -smoothing capacity.In addition to characterizing D α -smoothing capacities of ball noise, we quantify the best smoothing one can expect with rates below the D α -smoothing capacity.We will use these results in the upcoming subsection when we derive upper bounds for the decoding error probability on a BSC.The next theorem summarizes our main result on smoothing with respect to ball noise.Theorem 6.1.Let (b δn ) n be the sequence of ball noise operators, where δn is the radius of the ball.Let δ ∈ [0, 1/2], α ∈ [0, ∞].Let C n be a code of length n and rate R n .Then we have the following bounds: There exist sequences of codes of rate R n → R that achieve asymptotic equality in (19) for all R > 1−h(δ).At the same time, if R < 1 − h(δ), then there exist sequences of codes achieving asymptotic equality in (20).
Proof.The inequality in ( 19) is trivial.Let us prove that asymptotically it can be achieved with equality.From Theorem 3.3, there exists a sequence of codes Hence the equality case in (19) is achievable for all α ∈ [0, ∞].
Let us prove (20).From Lemma 3.2, we have We are left to show that for R < 1 − h(δ), (20) can be achieved with equality in the limit of large n.We use a random coding argument to prove this.Let C n be an (n, 2 nRn ) code whose codewords are chosen independently and uniformly.In Eq. ( 30), Appendix B, we define the expected norm of the noisy function.
Here we use this quantity for the ball noise kernel.For α ∈ [0, ∞), define From Lemma B.2, for any rational α ≥ 1, 1)) for rational values of α using induction.Let α ∈ [1, 2] be rational and note that p ≤ q.Since Q n (•) ≤ 1 when the argument is less than 1, we can write (21) as follows: Now assume that ( 21) holds for all rational α ∈ [1, m] for some integer m ≥ 2 and prove that in this case it holds also for α ∈ (m, m + 1].By the induction hypothesis, Therefore, for every rational α ∈ (1, ∞) there exists a sequence of codes satisfying which is equivalent to the equality in (20).
The above theorem characterizes the D α -smoothing capacities with respect to ball noise.
) n be a sequence of ball noise operators, where δn is the radius corresponding to the n-th kernel.Then Norms of T bt f C can be used to bound decoding error probability on a BSC.While estimating these norms for a given code is generally complicated, the second norm affords a compact expression based on the distance distribution of the code.In the next section, we bound decoding error probability using the second norm of T bt f C .The following proposition provides closed-form expressions for 2 n T bt f C 2 2 .Proposition 6.3.
where µ t (i) is defined in (1) and L t is the Lloyd polynomial of degree t (26).
The proof is immediate from Proposition A.1 in combination with ( 27) and (28).
6.2.Probability of decoding error on a BSC(δ).The idea that smoothing of codes under some conditions implies good decoding performance has appeared in a number of papers using different language.Smoothing of capacity achieving codes was considered in [51,42].H ązła et al. [27] showed that if a code (sequence) is perfectly smoothable with respect to Bernoulli noise, then the dual code is good for decoding (see Theorem C.3, Corollary C.4). Going from smoothing to decodability involves representing the D 2 -smoothness of codes with respect to Bernoulli noise as a potential energy form and comparing it to the Bhattacharya bound for the dual codes.One limitation of this approach is that it cannot infer decodability for rates R > 1 − log(1 + 2 δ(1 − δ)) (this is the region above the blue solid curve in Figure 2).Rao and Sprumont [43] and H ązła [26] proved that sufficient smoothing of codes implies the decodability of the codes themselves rather than their duals.However, these results are concerned with list decoding for rates above the Shannon capacity, resulting in exponential list size, which is arguably less relevant from the perspective of communication.
Except for [43], the cited papers utilize perfect or near-perfect smoothing to infer decodability.For codes whose rates are below the capacity, perfect smoothing is impossible.At the same time, codes that possess sufficiently good smoothing properties are good for decoding.This property is at the root of the results for list decoding in [43]; however, their bounds were insufficient to make conclusions about list decoding below capacity.
Consider a channel where, for the input X ∼ f C , the output Y is given by Y = X + W with W ∼ b t .Define F t (y) = |C ∩ B(y, t)| be the number of codewords in the ball B(y, t).Hence, for a received vector y, the possible number of codewords that can yield y is given by F t (y).Intuitively, the decoding error is small if F t (y) ≈ 1 for typical errors.Therefore, F t is of paramount interest in decoding problems.Since the typical errors for both ball noise and the Bernoulli noise is almost the same, this allows us to obtain a bound for decodability in the BSC channel.Using this approach, we show that the error probability of decoding on a BSC(δ) can be expressed via the second moment of the number of codewords in the ball of radius t δn.
Assume without loss of generality that C is a linear code and 0 n is used for transmission.Let Y be the random Bernoulli vector of errors, and note that Y ∼ β δ .The calculation below does not depend on whether we rely on unique or list decoding within a ball of radius t, so let us assume that the decoder outputs L ≥ 1 candidate codewords conditioned on the received vector y, which is a realization of Y.
In this case, the list decoding error can be written as Theorem 6.4.Let t and t ′ be integers such that 0 < t ′ < t < n.Then for any L ≥ 1, Proof.Define S t ′ ,t = B(0, t) \ B(0, t ′ ).Clearly, Let us estimate the first of these probabilities.
Remark 7. In the case of L = 1, the bound in ( 24) can be considered a slightly weaker version of Poltyrev's bound [38], Lemma 1.By allowing this weakening, we obtain a bound in a somewhat more closed-form, also connecting decodability with smoothing.We also prove a simple bound for the error probability of list decoding expressed in terms of the code's distance distribution (and, from (28), also in terms of the dual distance distribution).The latter result seems not to have appeared in earlier literature.
The following version of this lemma provides an error bound, which is useful in the asymptotic setting.
By the Hoeffding bound, Together with Lemma 6.4 this implies our statements.
A question of prime importance is whether the right-hand side quantities in Proposition 6.5 converge to 0. For R < 1 − h(δ), one can easily see that for random codes, n w=1 µt(w) Vt A w = 2 −Θ(n) where t = δn + n θ showing that this is in fact the case.
From Proposition 6.3, it is clear that the potential energy n w=1 µ t (w)A w is a measure of the smoothness of T bt f C .This implies that codes that are sufficiently smoothable with respect to b t are decodable in the BSC with vanishing error probability.In other words, Proposition 6.5 establishes a connection between smoothing and decoding error probability.

PERFECT SMOOTHING-THE FINITE CASE
In this section, we briefly overview another form of perfect smoothing, which is historically the earliest application of these ideas in coding theory.It is not immediately related to the information-theoretic problems considered in the other parts.
We are interested in radial kernels that yield perfect smoothing for a given code.We call ρ(r) := max(i : r(i) = 0) the radius of r.Note that the logarithm of the support size of r (as a function on the space H n ) is exactly the 0-Rényi entropy of r.Therefore, kernels with smaller radii can be perceived as less random, supporting the view of the radius ρ(r) as a general measure of randomness.Definition 7.1.We say a code C is perfectly smoothable with respect to r if T r f C (x) = 1  2 n for all x ∈ H n , and in this case we say that r is a perfectly smoothing kernel for C .
Intuitively, such a kernel should have a sufficiently large radius.In particular, it should be as large as the covering radius of the code ρ(C ) or otherwise, smoothing does not affect the vectors that are ρ away from the code.To obtain a stronger condition, recall that the external distance of code C is d(C ) = |{i ≥ 1 : Proof.Note that perfect smoothing of C with respect to r is equivalent to which by Proposition A.1 is equivalent to the following condition: Since ρ(C ) ≤ d(C ), this inequality strengthens the obvious condition ρ(r) ≥ ρ(C ).At the same time, there are codes that are perfectly smoothable by a radial kernel r such that ρ(r) = ρ(C ).Definition 7.2.[50] A code C is uniformly packed in the wide sense if there exists rational numbers {α i } ρ i=0 such that where A i (x) is the weight distribution of the code C − x.
Our main observation here is that some uniformly packed codes are perfectly smoothable with respect to noise kernels that are minimal in a sense.The following proposition states this more precisely.Proposition 7.2.Let C be a code that is perfectly smoothable by a radial kernel of radius ρ(r) = ρ(C ).Then C is uniformly packed in the wide sense with α i ≥ 0 for all i.
Proof.By definition, if C is perfectly smoothable with respect to r, then 2 n T r f C = 1 which is tantamount to y∈Hn |C | r(i) A i (x) = 1 for all x ∈ H n , completing the proof.
To illustrate this claim, we list several families of uniformly packed codes ( [50,20]; see also [59]) that are perfectly smoothable by a kernel of radius equal to the covering radius of the code.
(ii) 2-error correcting BCH codes of length 2 2m+1 , m ≥ 2. The smoothing kernel r is given by (iii) Preparata codes.The smoothing kernel r is given by (iv) Binary (2 m − 1, 2 2 m −3m+2 , 7) Goethals-like codes [20].The smoothing kernel r is given by Here L is a generic notation for the normalizing factor.More examples are found in a related class of completely regular codes [9].Definition 7.2 does not include the condition that α i ≥ 0, and in fact there are codes that are uniformly packed in the wide sense, but some of the α i 's are negative, and thus they are not smoothable by a noise kernel of radius ρ(C ).One such family is 3-error-correcting binary BCH codes of length 2 2m+1 , m ≥ 2 [20].
The Fourier transform of a function f : H n → R is a function on the dual group H n , which we identify with H n : (25) f The Fourier transform of the indicator function of the sphere is given by ½ S(0,t) = 1 Given a code C ⊂ H n , we define the dual distance distribution of C as the set of numbers A ⊥ j := 2).Note that when C is linear, the set (A ⊥ j ) n j=0 coincides with the distance distribution of its dual code C ⊥ .For a radial potential V on H n and a code C we have The L 2 -smoothness of a noisy code distribution can be written in terms of the distance distribution or of the dual distance distribution.
Proposition A.1.Let C be a code and r be a noise kernel.Then Proof.Let us prove the first equality: The second equality is immediate by noticing that r * r = 2 n r2 and using (28).

APPENDIX B. PROOF OF THEOREM 3.3
We will first establish Theorem 3.3 when α is rational, and then use a density argument to extend the proof to all real numbers.The case α = ∞ is handled separately at the end of this appendix.
We will use the following technical claim.
Lemma B.1.Let x and y be two non-negative reals.Further, let p and q be positive integers.Then Proof.Clearly (x + y) For M ≥ 1 let C = (c 0 , c 2 , . . ., c M−1 ) be a code whose codewords are chosen randomly and independently from H n .For α ∈ [0, ∞), define (30) Q In the next lemma, we obtain a recursive bound for Q n .We will then use an induction argument to show the full result.
Lemma B.2.Let α = p q + 1 and let C ⊂ H n be a random code of size M = 2 nR .Then Proof.In the calculation below we write for C .Starting with (30), we obtain where c i , i = 1, . . ., M are random codewords in the code C .Recalling that r(x − c i ) a = r a a for any a > 0, we continue as follows: where we used ( 5) and the fact that r is a pmf.
Proof.This statement is true for all 0 ≤ α < 1, so also for all rational α in [0,1) Assume that it holds for all rational α in [0, m) where m ∈ + .Let α ∈ [m, m + 1) and choose p, q ∈ + 0 such that α = 1 + p q .By Lemma B.2, for all k > 0. Since p q < m, by the induction hypothesis we have lim sup n Q n p−k q ≤ 1 for k = 0, 1, . . ., p. Therefore, all the terms except the one with k = 0 vanish, yielding lim sup n Q n (α) ≤ 1.
Proof.From the monotonicity of Rényi entropies, for α Now let us choose a subsequence (r n k ) k such that Therefore, Note that H α is a continuous function of the order α for α > 1.We use the mean value theorem to claim that there is a value γ k ∈ (α, α ′ such that .Taking these remarks together, we obtain Therefore, π(α) is continuous on (1, ∞).
It remains to address the case α = ∞.We obtain the following upper bound, whose proof is inspired by [64].

APPENDIX C. SAMORODNITSKY'S INEQUALITIES AND THEIR IMPLICATIONS
Samorodnitsky [47,49] recently proved certain powerful inequalities for α-norms of noisy functions, which permit us to estimate proximity to uniformity upon action of Bernoulli noise kernels.We state some of them in this appendix after introducing a few more elements of notation.These results are used in Theorem 5.2 and in Appendix D, where we prove Lemma 4.4.
For a subset Γ ⊂ [n], write x| Γ to denote the coordinate projection of a vector x ∈ H n on Γ.If the subset Γ is formed by random choice with Pr(i ∈ Γ) = λ independently for all i ∈ [n], we write Γ ∼ λ.The entropy of a function f : H n → R is defined as This quantity can be thought of as the KL divergence between the distribution induced by f on H n and the uniform distribution: If f itself is a pmf, then D(f U n ) = 2 n Ent(f ) = Ent(2 n f ).

1 . 1 )
Notation.Throughout this paper, H n is the binary n-dimensional Hamming space Balls and spheres.Denote by B(x, t) := {y ∈ H n : |y − x| ≤ t} the metric ball of radius t in H n with center at x, and denote by S(x, t) := {y ∈ H n : |y − x| = t} the sphere of radius t.Let V t = |B(x, t)| be the volume of the ball, and let µ t (i) be the intersection volume of two balls of radius t whose centers are distance i apart:µ t (i) = |B(0, t) ∩ B(x,t)|, where |x| = i.(Codes and distributions.A code C is a subset in H n .The rate and distance of the code are denoted by R(C ) := log |C |/n and d(C ), respectively.Let (2)

1 FIGURE 1 .
FIGURE 1. Capacities and achievable rates for perfect smoothing.The lowermost curve gives Shannon capacity of the BSC(δ), the second curve from the bottom is the smoothing threshold for duals of BEC capacityachieving codes, the third one is S β δ 2 and the top one is S β δ ∞ .
A communicates with B by sending messages M chosen from a finite set M. Communication from A to B occurs over a BSC W b with crossover probability δ b , and it is observed by the eavesdropper E via another BSC W e with crossover probability δ e > δ b .A message M ∈ M is encoded into a bit sequence X ∈ H n and sent from A to B in n uses of the channel W b .Terminal B observes the sequence Y = X + W b , where W b ∼ Bin(n, δ b ) is the noise vector, while terminal E observes the sequence Z = X + W e with W e ∼ Bin(n, δ e ).
. To describe it, let C e and C b be two linear codes such that C e ⊂ C b and |M| = |C b | |Ce| .We assign each message m to a unique coset of C e in C b .The sequence transmitted by A is a uniform random vector from the coset.As long as the rate of the code C b is below the capacity of W b , we can ensure the reliability of communication from A to B.

Theorem 5 . 2 .
Let ((C n e ) ⊥ ) n and (C n b R e > 4δ e (1 − δ e ), then the nested coding scheme based on C n e and C n b can transmit messages with rate R b − R e from A to B, satisfying the reliability and strong secrecy conditions.Proof.From Corollary C.4, the conditions d(C (n) b ) = ω(log n) and R b < 1 − log(1 + 2 δ b (1 − δ b )) guarantee transmission reliability.Furthermore, by Theorem 4.5, the conditions d((C n e ) ⊥ ) = ω(log n) and R e > 4δ e (1 − δ e ) imply that D(T δe f Ce U n ) → 0, which in its turn implies strong secrecy by Lemma 5.1.

Theorem 5 . 3 .
Let C n e and C n b be two sequences of RM codes satisfying C n e ⊂ C n b whose rates approach R e > 0 and R b > 0, respectively.If R b < 1 − h(δ b ) and R e > 4δ e (1 − δ e ), then the nested coding scheme based on C n e and C n b supports transmission on a BSC wiretap channel (δ b , δ e ) with rate R b − R e , guaranteeing communication reliability and strong secrecy.
3181.The only assumption required here is that the codes C ⊥ e and C b have good erasure correction properties.

Proposition 7 . 1 .
Let r be a perfectly smoothing kernel of code C .Then ρ(r) ≥ d(C ).
1FIGURE 2. Achievable rates in the BSC wiretap channel with BEC capacity-achieving codes.The bottom curve is the lower bound on the code rate that guarantees decodability on a BSC(δ).The middle curve shows Shannon's capacity and the top one is the D 1 -smoothing threshold for Bernoulli noise T δ .
[4,s a Krawtchouk polynomial of degree t.Then clearly the Fourier transform of the indicator of the ball is (x) is called the Lloyd polynomial[19, p.64].The intersection of balls in (1) can be written as ½ B(0,t) * ½ B(x,t) , which implies the expression[4, Lemma 4.1] , and Lemma 3.2, to prove Theorem 3.3, we need to prove the following Theorem B.3.Consider a sequence of ensembles of random codes of increasing length n and rate R n