Polar Codes for Covert Communications over Asynchronous Discrete Memoryless Channels

This paper introduces an explicit covert communication code for binary-input asynchronous discrete memoryless channels based on binary polar codes, in which legitimate parties exploit uncertainty created by both the channel noise and the time of transmission to avoid detection by an adversary. The proposed code jointly ensures reliable communication for a legitimate receiver and low probability of detection with respect to the adversary, both observing noisy versions of the codewords. Binary polar codes are used to shape the weight distribution of codewords and ensure that the average weight decays as the block length grows. The performance of the proposed code is severely limited by the speed of polarization, which in turn controls the decay of the average codeword weight with the block length. Although the proposed construction falls largely short of achieving the performance of random codes, it inherits the low-complexity properties of polar codes.


Notation
Before describing the asynchronous covert communication model, we briefly introduce the notation used throughout the paper. Random variables are denoted by upper case letters, e.g., X, and their realizations by lower case letters, e.g., x. Sets are denoted with calligraphic fonts, e.g., X . Vectors of length n are denoted as X 1:n = (X 1 , · · · , X n ) and x 1:n = (x 1 , · · · , x n ) when the length needs to be explicit, and by boldface fonts, e.g., X and x, when the length can be inferred from the context without ambiguity. When multiple blocks of length n are used, we denote the block index as a subscript, e.g, X 1:n 1:b denotes a sequence of b blocks of length n. The function log is understood in the base 2, while ln denotes the logarithm to the base e. For two distributions P, Q on some countable set X , we write the Kullback-Leibler divergence and the total variation distance as respectively. We also denote P ⊗n (x) as the product distribution ∏ n i=1 P(x i ) for x ∈ X n . We make repeated use of the Landau notation. In particular, for two real-valued functions f (n) and g(n) of n ∈ N, we write f (n) = o (g(n)) if ∀α > 0 ∃n 0 ∈ N * such that ∀n ≥ n 0 | f (n)| ≤ α |g(n)|; f (n) = O (g(n)) if ∃α > 0 ∃n 0 ∈ N * such that ∀n ≥ n 0 | f (n)| ≤ α |g(n)|; f (n) = ω (g(n)) if ∀α > 0 ∃n 0 ∈ N * such that ∀n ≥ n 0 | f (n)| ≥ α |g(n)|.
The polarization kernel matrix G 2 = 1 1 0 1 will be merely denoted G. We denote G ⊗ν the matrix representing the recursive transformation over ν levels of polarization. Thus, the corresponding polar code is of length n = 2 ν . Since the length of binary polar codes is a power of two, we restrict our attention to block lengths n ∈ D {2 ν : ν ∈ N * }.

Channel Model
The channel model for covert communication is illustrated in Figures 1 and 2. A legitimate transmitter (Alice) attempts to reliably communicate to a legitimate receiver (Bob) over a DMC (X , W Y|X , Y ), while avoiding detection from an adversary (Willie) who observes signals through another DMC (X , W Z|X , Z ). In the remainder of the paper, we restrict our attention to a binary input alphabet X = {0, 1}, with 0 representing the innocent input symbol in the absence of communication. We denote P 0 W Y|X=0 and Q 0 W Z|X=0 as the output distributions induced by the innocent symbol 0. Similarly, we denote P 1 W Y|X=1 and Q 1 W Z|X=1 as the output distributions induced by symbol 1. We assume that both P 1 and Q 1 are absolutely continuous with respect to (w.r.t.) P 0 and Q 0 , respectively, to avoid the special situations discussed in Appendix V of [3]. Formally, a message W ∈ 1, M n with uniform distribution is encoded into a codeword of length n, possibly with the help of secret key S ∈ 1, K n only known to Alice and Bob but using a public codebook known to all parties; the codeword is hidden within a larger transmission window of size N > n, with N a function of n, by choosing the starting index T of the codeword uniformly at random between 1 and N N − n + 1. The set of indices corresponding to the codeword forms the codeword window. The sequence transmitted during the transmission window is denoted X 1:N , and the corresponding observations of Bob and Willie are denoted Y 1:N and Z 1:N , respectively. It is convenient to introduce the following distributions. The distribution induced at the output of the adversary's channel in a codeword window is denoted Q n . When the codeword is embedded in a transmission window starting at a known index t, the distribution induced at the output of the adversary's channel in the transmission window is Finally, the distribution induced at the output of the adversary's channel in the transmission window when randomizing the start index T is Q N E T Q N T . Given the secret key S and the observation Y 1:N , Bob forms an estimate W of the original message W, whose performance is measured by the probability of error P (n) e E S P W = W|S . Given the observation Z 1:N and the knowledge of Alice's codebook, Willie performs a hypothesis test to determine if communication took place. Hypothesis H 0 corresponds to the absence of communication, in which case the distribution of Z 1:N is Q ⊗N 0 ; Hypothesis H 1 corresponds to communication, in which case the distribution induced by the code is Q N over the transmission window. Note that Q N can be computed using knowledge of the codebook and the distribution of T. The covertness of the transmission is measured by the total variation V (n) V Q N , Q ⊗N 0 . A small value of V (n) ensures that the best binary hypothesis test is not significantly better than a "blind" test that would ignore the observation Z 1:N [3].
Our objective is to construct sequences of codes such that lim n→∞ P (n) e = 0 and lim n→∞ V (n) = 0.

Main Results
We start by recalling a known result established with a random coding argument, which serves as a benchmark for our code construction. Proposition 1 (adapted from [6]). Consider sequences of positive numbers {α n } n∈N * , {β n } n∈N * such that (1) as n goes to infinity. Let N = 2 nα 2 n β n α 2 n . There exist codes of increasing block length n hidden in transmission windows of size N such that Proposition 1 states that the number of bits log M n scales as nα n with a constant pre-factor at least equal to D(P 1 P 0 ) for all admissible choices of α n . As α n increases, so does the scaling of log M n , but at the expense of increasingly larger monitoring windows. Proposition 1 captures the correct scaling for the transmission window size, the number of message bits, and the number of key bits with the block length n, as shown by the converse proof in [5]. While this result has been obtained with a random coding argument, in which codewords are sampled independently according to product distributions, the main contribution of the present paper is to establish a similar result using polar codes in place of random codes.
In the following, we allow ourself a slight modification of the coding scheme defined in Section 2.2 to consider b n consecutive transmission windows of size N, where b n will be specified later. The messages and keys used in the transmission windows might be dependent, but the codeword in each of them is otherwise created as defined earlier. The probability of error P (n) e is appropriately modified to consider the set of messages and V (n) considers the distribution induced over the b n consecutive transmission windows Our results also depend on a constant κ, whose value results from the analysis of finite length polarization and is further discussed in Section 3.

Proposition 2.
There exists a constant κ ∈]0, 1 2 [, such that for all sequences of positive numbers {α n } n∈D ∈ ω 1 log n , and sequence of integers {b n } n∈D ∈ ω(log n) ∩ o 1 β n ∩ o(n) as n goes to infinity, there exist low-complexity polar-code based schemes operating over b n transmission windows of size N = 2 nα 2 n β n α 2 n , each embedding a codeword window of length n, with lim n∈D→∞ log M n nb n α n ≥ D(P 1 P 0 ), Proof. See Section 4.
The constant κ in the statement of Proposition 2 is more precisely identified in Proposition 3. The precise code construction behind the statement is provided in Section 4, and the exact encoding and decoding algorithms are given in Algorithms 1 and 2 in Section 4.2. The complexity of both algorithms scales linearly with the number of transmission windows b n and as n log n with the codeword length n. Note that Proposition 2 differs from Proposition 1 on two accounts. First, the polar-code based scheme only holds for a limited range of scalings for α n . A numerical investigation suggests that κ is on the order of 10 −3 , which completely precludes our codes from operating in the square-root law regime and requires absurdly large code length; however, if one backs away from the optimal scalings identified above, our approach does provide a low-complexity construction with provable guarantees. As further discussed in Section 3, this results from our inability to establish a faster polarization speed. In particular, as will be clear from our analysis, we rely on a fine polarization result from [12] to show that covertness holds, and the value of κ is therefore much more constrained than what would be expected by only looking at the inverse scaling exponent [14,15]. Our results might be improved by considering a "moderate deviation regime" in the same spirit as [14], but this would require a non-trivial extension of existing results, which we defer to future work. Second, the proposed scheme requires a chaining over b n transmission windows; we shall see in Section 4 that the chaining allows us to "realign" polarization sets. Although this chaining does not fall into the exact situation of Section 2.2 in which a single block is considered, covertness is guaranteed over the entire chain of blocks; in addition, a mild scaling such as b n = ω(log n) is valid so that the number of blocks may be much smaller than the block-length. Finally, the proposed code construction is non-trivial, but its performance is still far from that of the random codes in Proposition 1. Section 5 discusses several ongoing efforts to improve performance.

Require:
Vector C of |V C | uniformly distributed key bits; ,b n of log N uniformly distributed key bits; 1: for block i = 1 to b n do 2: if i = 1 then 6: Successively draw the components of 12: Transmit X 1:n i U 1:n i G ⊗ν over the channel W Y|X , which gives the output Y 1:n i , and over the channel W Z|X , which gives the output Z 1:n i . Assume that C i ⊕ S i U 1:n i [V C ] is made available at the decoder. Randomize the position of the codeword window using S i 13: end for Algorithm 2 Bob's decoder

Require:
Vector C of |V C | uniformly distributed key bits; Form the estimate U 1:n i = X 1:n i G ⊗ν 8:

Preliminaries: Polarization of Sources with Vanishing Entropy Rate
Our code construction exploits recent results on polar codes that suggest how information-theoretic proofs exploiting source coding with side information and privacy amplification as primitives [16,17] may be converted into polar coding schemes by a suitable identification of polarization sets [11,18]. Specifically, the approach consists in recognizing that both primitives have counterparts based on polar codes, see Lemma 3 and Lemma 4 of [11], as well as [19,20]. Before we pursue a similar approach here, we must first extend Lemmas 3 and 4 of [11] to the case relevant for covert communications.
Formally, consider the sequences of positive numbers {α n } n∈D such that α n ∈ ω 1 √ n ∩ o(1). For every n ∈ D, define the Bernoulli distribution Π α n over {0, 1} as Π α n (1) = 1 − Π α n (0) = α n and its associated product distribution Define the joint distribution of sequences in X n × Y n q X 1:n Y 1: with W Y|X defined in Section 2.2. In other words, for a fixed n, the process X 1:n Y 1:n has a product distribution but the process {X 1:n Y 1:n } n∈D is not stationary and the entropy rate 1 n H X 1:n |Y 1:n vanishes. We refer to such a source as a "vanishing entropy rate source". Assume now that the random vector X 1:n ∈ X n is transformed into U 1:n = X 1:n G ⊗ν . For δ n ∈]0, 1 2 [, the set of high entropy bits is defined as and the set of very high entropy bits is defined as The following proposition shows that the sets H X|Y and V X|Y can still polarize for vanishing entropy rate sources.

Proposition 3 (Fine polarization of vanishing entropy sources).
For any δ ∈ 0 , 1 2 , set δ n = 2 −n δ . For any ε ∈ [0 , 1 − 2δ], there exists κ δ,ε > 0, A δ,ε > 0 and C δ,ε such that for any vanishing entropy rate source q X 1:n Y 1:n (x, y) as in (6) and for any integer n ∈ D with n > 2 C δ,ε , we have Proof. The proof adapts the approach developed for finite length channel polarization [12] to source polarization. The idea is to first analyze a "rough" polarization to obtain a bound on the cardinality of the set of unpolarized sources, followed by a "fine" polarization to boost the polarization. Details require a careful adaptation but are otherwise similar to [12], and are therefore provided as supplementary material.
For Proposition 3 to be meaningful, the relative size of the sets H X|Y (δ n ) and V X|Y (δ n ) in (10) and (11) should be asymptotically equivalent to the entropy rate 1 n H X 1:n |Y 1:n . This is possible if 1 n H X 1:n |Y 1:n = ω( 1 n κ δ,ε ε ), i.e., polarization happens "fast enough" and the relative number of unpolarized symbols in (9) decays faster than the entropy rate. Therefore, our result only ensures the polarization of vanishing entropy rate sources for values of α n that do not decay too rapidly; specifically, we require α n = ω 1 n κ δ,ε ε ∩ o(1). Numerical analysis shows, for instance, that for δ = 0.1 and ε = 0.59, κ δ,ε ε ≈ 6.53 × 10 −3 . Note that this falls short of 1 √ n , which would be required for the square-root-law of communication. Nevertheless, we are now able to extend Lemma 3 and Lemma 4 of [11] to the finite length regime, which forms the basis of our construction for covert communications.
Proof. See the proof of Lemma 3 in [11], using Proposition 3 instead of the standard polarization result.
Proof. See the proof of Lemma 4 in [11], using Proposition 3 instead of the standard polarization result.

Polar Codes for Covert Communication
In this section, we describe our proposed polar-code based scheme for covert communication. After preliminaries regarding covert processes in Section 4.1, the algorithms used for encoding and decoding are described in Section 4.2, and their performance is analyzed in Sections 4.3-4.5.

Covert Process
Our code construction follows the idea put forward in [3,6], which suggests to have the code induce a "covert process" at the output of the adversary's channel by leveraging the notion of channel resolvability [8], and to show that the covert process is itself indistinguishable from the product distribution Q 0 .
Formally, consider any sequence of positive numbers {α n } n∈D such that α n ∈ ω 1 √ n ∩ o(1). For every n ∈ D, recall the definition of the Bernoulli distribution Π α n over {0, 1} as Π α n (1) = 1 − Π α n (0) = α n , and its associated product distribution Π ⊗n α n ; this distribution induces the mixture Q α n = α n Q 1 + (1 − α n )Q 0 at the output of the channel (X , W Z|X , Z ), for which we also define the product distribution The "covert process" is the distribution In other words, Q N α n is the distribution at the output of the channel (X , W Z|X , Z ) obtained when randomizing the start index T ∈ 1, N of a block of n consecutive bits sampled according to Π α n . The name "covert process" is justified by the following lemma, which provides the scaling of the parameters α n and N such that the distribution Q N α n becomes asymptotically indistinguishable from the distribution Q ⊗N 0 .

Encoding and Decoding Algorithms
Let n ∈ D be the length of the codeword window. We propose a scheme that operates over b n transmission windows of length N, where b n will be specified later. In every transmission window i ∈ 1, b n : 1. Transmitter and receiver use a secret key S i of log N bits to determine the position of the codeword window within the transmission window. Note that this secret key is not required in the random coding proof of [6], but is required here to maintain a low complexity at the decoder; fortunately, this change has negligible effect on the scaling of the key. 2. The content of each codeword window is obtained through a polar-code based scheme that ensures reliable decoding to the receiver and approximates the process Q ⊗n α n at the adversary's output, which we describe next.
In the remainder of this section we fix δ ∈ 0, 1 2 , ε ∈]0, 1 − 2δ[, δ n 2 −n δ . We let κ κ δ,ε ε and A A δ,ε , where κ δ,ε and A δ,ε are the constants identified by Proposition 3. We consider sequences of positive numbers {α n } n∈D ∈ ω 1 n κ ∩ o(1), {β n } n∈D ∈ ω 2 − nαn log n ∩ o 1 log n , a sequence of integers {b n } n∈D ∈ ω(log n) ∩ o 1 β n ∩ o(n), and we set N = 2 nα 2 n β n α 2 n . Finally, we consider a vanishing entropy rate source q X 1:n Y 1:n Z 1:n ∼ Π ⊗n α n W ⊗n Y|X W ⊗n Z|X (the marginal q ⊗n Z 1:n is Q ⊗n α n ) and we define the sets where all entropies should be computed based on q X 1:n Y 1:n Z 1:n . To alleviate the notation, we drop the dependence on δ n in the sets from now on, and write for instance V X in place of V X (δ n ). We also write H(X) and H(X|Y) although these quantities should be understood for the independent and identically distributed (i.i.d.) random variables obtained as marginals of q X 1:n Y 1:n Z 1:n . As illustrated in Figure 3a, the construction is based on the following sets: • V C H X|Y ∩ V X|Z , which will contain uniformly distributed bits C representing the code; which will contain non-uniformly distributed bits C computed from the other bits; which will contain additional uniformly distributed messages W; • V S , any subset of H X|Y ∩ V c X|Z ∩ V X such that |V S | = |V W |, which will use messages W transmitted in the previous transmission window as a key; which will contain uniformly distributed secret key symbols S.
Alice's encoder is formally provided in Algorithm 1 while Bob's decoder is provided in Algorithm 2, but the chaining of the transmission windows over b n blocks is illustrated in Figure 3b and we discuss here the salient features of the algorithms. In every block i ∈ 1, b n , a message W i is transmitted with the assistance of a secret key S i as expected from the model of Section 2.2. In addition, the chaining exploits the property that the bits in V W are held secret from Willie and can therefore be used as a secret key in the next block, which is formally proved in Section 4.5; this chaining allows us to transmit an additional message W i in every block, which is crucial to achieve the scalings of Proposition 2 as shown in Section 4.3. The chaining also relies on the secrecy of the bits in V C , which allows us to reuse the same random bits C across all blocks. Finally, some bits of shared randomness C i must be transmitted secretly, covertly, and reliably to the receiver. As we show in Section 4.3, the number of such bits is negligible compared to the number of covert bits transmitted; we therefore ensure their secrecy by performing a one time pad C i ⊕ S i with another secret key S i , and we ensure reliability and covertness in a single additional block at the end, e.g., using the somewhat inefficient scheme of [1]. In the remainder, we will ignore this last block for simplicity and assume that C i ⊕ S i is made available to the decoder "for free".
Ultimately, the messages transmitted consist of the messages W i and W i transmitted in every block i; the keys required consist of the keys S i , S i , S i used in every block i, as well as the bits C. Remark 1. The proposed chaining scheme could be further modified as follows. First, since the bits of C are secret from the perspective of Willie, they could be publicly disclosed and not counted as part of the secret keys, without compromising the performance. We have opted to count C as part of the key to make the analysis slightly more concise. Second, the bits of C i ⊕ S i could be chained by sacrificing part of the message W i ; since their amount is negligible, this would again not affect performance. We have opted to avoid this chaining since a last transmission for the bits C b n ⊕ S b n would be necessary anyway.

Remark 2.
Because of the stochastic encoding in Algorithm 1, our codes are neither linear codes nor cosets of linear codes. In that regard, calling our codes "polar codes" is a slight abuse of terminology but follows standard practice [11,18,20]. Strictly speaking, our codes are only "polarization-based".

Analysis of Normalized Set Sizes
We start by analyzing the normalized set sizes of the proposed scheme. Specifically, we are interested in characterizing the asymptotic total number of message bits log M n and total number of key bits log K n , normalized by nb n α n .
Over b n transmission windows, the total number of message bits consists of those in V W and V W in every transmission window. Hence, for every n ∈ D, log M n = b n |V W | + b n |V W |. Similarly, the total number of key bits consists of those in V S (except for the first block which requires V S + V S ), the bits for the one time pad in V C , the bits required to identify the codeword window within the transmission window, and the bits in V C , so that log K n = b n |V S | + |V S | + b n |V C | + b n log N + |V C |.
Proof. We first assume that |H c We analyze the terms on the right hand side in order. First, since V X|Z ⊂ V X , we have V W + |V S | + V S = |V c X|Z ∩ V X | = |V X | − |V X|Z |, and by Proposition 3 applied to the vanishing entropy rate sources q X 1:n and q X 1:n Z 1:n Since H(X) − H(X|Z) = I(X; Z) = α n D(Q 1 Q 0 ) + o(α n ) (Lemma 1, [3]) and remembering the choice of α n , δ n earlier, it follows that . This also implies that |V S | nb n α n = o(1). Next, since V c X ⊂ V c X|Y , we have with Proposition 3 that which vanishes by definition of α n . Similarly, since V C ⊂ H X , Proposition 3 applied to the vanishing entropy rate source q Z 1:n ensures that which vanishes with our choice of α n and b n (note that we use the condition b n = ω(log n) here).
Finally, since N = 2 nα 2 n β n α 2 n , we have log N nα n = α n − log β n nα n − 2 log α n nα n , which vanishes with the choice of α n , β n (note that we use the condition β n ∈ ω 2 − nαn log n here).
We finally assume that |H c With Proposition 3 and Lemma 1 of [3], this implies that D(P 1 P 0 ) > D(Q 1 Q 0 ) + o(1). Since we have V S = ∅ in this case, following the same steps as earlier we now obtain lim n∈D→∞ log K n nb n α n = 0, which is the desired result.

Reliability Analysis
In this section, we prove that the proposed scheme ensures reliable communication. To avoid any confusion between the distribution induced by the algorithms and the underlying vanishing entropy rate source, we denote the distribution induced by Algorithm 1 byp; accordingly, all random variables generated according to this distribution have a tilde, e.g.,X has distributionp X . The estimates obtained from Algorithm 2 are denoted with a hat, e.g., X. Since the location of the transmission window is known to the legitimate receiver, it is sufficient to show that lim n→∞ P X 1:n 1:b n = X 1:n 1:b n = 0. We proceed to prove this with a series of lemmas. Lemma 6. For any transmission window i ∈ 1, b n D q X 1:n Y 1:n p X 1:n Proof. We have D q X 1:n Y 1:n p X 1:n i Y 1:n i = D q X 1:n p X 1:n i + E q X 1:n D q Y 1:n |X 1:n p Y 1:n where (26) comes from the chain rule of divergence, (27) comes from E q X 1:n D q Y 1:n |X 1:n p Y 1:n i |X 1:n i = E q X 1:n D W ⊗n Y|X W ⊗n Y|X = 0 (34) Equation (28) comes from the invertibility of X 1:n = U 1:n G ⊗ν and X 1:n = U 1:n G ⊗ν , (29) comes from the chain rule of divergence, (30) comes from the definition of the encoder for j ∈ V c X in (4), (31) comes from the uniformity of the symbols in V X , (32) comes from the definition of V X .

Lemma 7.
For any transmission window i ∈ 1, b n , define the event Then, where δ (2) Proof. For i ∈ 1, b n , define the event that the sequence produced by the polar encoder differs from the actual one: E and define an optimal coupling such that P E (XY) i = V q X 1:n Y 1:n ,p X 1:n i Y 1:n i ≤ D q X 1:n Y 1:n p X 1:n by Pinsker's inequality and Lemma 6. Then, we have where (39) comes from the law of total probabilities, (40) from P E Proof. We have the following partition Thus, P X 1:n 1:b n = X 1:n where (45) comes from the probability of the partition, (46) from E i−1 ⊆ i−1 j=1 E j , (47) from the definition of conditional probability, (48) from P E c i−1 ≤ 1, and (49) from Lemma 7. The choice b n = o(n) ensures that P X 1:n 1:b n = X 1:n 1:b n vanishes.

Covertness Analysis
In this section, we prove that the proposed scheme is covert in the sense that lim n∈D→∞ V(p where we have used the Markov chain Z 1:n 1:i−1 − CW i − Z 1:n i . The result then follows by Lemma 10. where δ (6) n = b n ( δ (5) n + δ (4) n ).
Lemma 12. With our choice of β n , b n , and δ n , note that lim n→∞ δ (6) n = 0 and lim n→∞ b n β n = 0, hence establishing covertness (note that we use the condition b n ∈ o 1 β n here).

Conclusions
In this paper, we have proposed a coding scheme for covert communication based on polar codes. Although our scheme offers a first explicit solution of covert communication in a non-trivial regime, its performance is still far from that of random codes. The proven speed of polarization severely limits the rate at which the average weight of codewords can decay, and in particular we cannot approach the average codeword weight on the order of √ n required by the square root law. We have circumvented this issue by hiding the transmission window within a larger window as in [5,6], and at least in the regime for which our proofs hold, the proposed scheme achieves the best known rates. Several extensions and improvements are currently under investigation, particularly the refinement of Proposition 3 to improve the constant κ and the use of non-binary polar codes in conjunction with pulse-position modulation [22].