Coarsely Quantized Decoding and Construction of Polar Codes Using the Information Bottleneck Method

The information bottleneck method is a generic clustering framework from the field of machine learning which allows compressing an observed quantity while retaining as much of the mutual information it shares with the quantity of primary relevance as possible. The framework was recently used to design message-passing decoders for low-density parity-check codes in which all the arithmetic operations on log-likelihood ratios are replaced by table lookups of unsigned integers. This paper presents, in detail, the application of the information bottleneck method to polar codes, where the framework is used to compress the virtual bit channels defined in the code structure and show that the benefits are twofold. On the one hand, the compression restricts the output alphabet of the bit channels to a manageable size. This facilitates computing the capacities of the bit channels in order to identify the ones with larger capacities. On the other hand, the intermediate steps of the compression process can be used to replace the log-likelihood ratio computations in the decoder with table lookups of unsigned integers. Hence, a single procedure produces a polar encoder as well as its tailored, quantized decoder. Moreover, we also use a technique called message alignment to reduce the space complexity of the quantized decoder obtained using the information bottleneck framework.


Introduction
In a typical transmission chain, forward error correction is a resource-hungry and computationally complex component.Therefore, it is always desirable to reduce the decoder's complexity.The commonly used soft decision decoders perform many intensive arithmetic operations on soft values, i.e., log-likelihood ratios (LLRs).In practice, the computational complexity of a decoder is reduced by replacing the intensive operations with simpler approximations.The space complexity of the decoder, i.e., the memory for storing the soft values and the number of wires for exchanging them between hardware components, depends on the representation of the soft values in the hardware.The space complexity of the decoder is reduced by coarsely quantizing the soft values where each soft value is represented using a small number of bits.These complexity reduction measures degrade the error-correcting performance of a decoder to some extent.
An interesting remedy for reducing the decoder complexity of low-density parity-check (LDPC) codes was presented in [1,2], where a quantized decoder for regular LDPC codes was developed using the information bottleneck method [3,4].The design principle of such a quantized decoder combines the information bottleneck method with discrete density evolution [5].Discrete density evolution is the quantized counterpart of density evolution [6], an analysis tool for LDPC codes.In the quantized decoder design of [1], referred to as the information bottleneck decoder here, the information bottleneck method is used to design mappings between the quantized input and output messages of each node in the Tanner graph of the code which maximize the mutual information between the outgoing messages and the bits they correspond to.Consequently, all the node operations in the information bottleneck decoder are deterministic mappings that can be realized as simple table lookups.Further, the decoder can be implemented using only unsigned integer arithmetic since all the quantized messages are unsigned integers.Moreover, no internal high resolution is needed for the node operations.Surprisingly, the computational simplicity of the information bottleneck decoders causes only a minute degradation in the error rate as compared to a double-precision floating-point sum-product decoder.It was reported that an information bottleneck decoder having a bitwidth as low as 4 bits performs very close to the belief propagation decoder [1].The idea of information bottleneck decoders was extended to irregular LDPC codes in [7,8].Design of quantized message passing decoders that utilize information preserving table lookup operations was also addressed in [9][10][11][12].
Results of [1,2,7,12] encourage to explore the application of the information bottleneck method in polar codes [13], an important class of channel codes that are seen as a breakthrough in the coding theory.They are the first linear block codes that can provably achieve the symmetric capacity of binary input memoryless channels.Despite being asymptotically good, the error correcting performance of polar codes at practical block lengths under successive cancellation (SC) [13] decoding is not impressive.This ordinary performance of polar codes was partly due to their weak Hamming distance properties [14,15] and partly due to the SC decoder being suboptimal [16].To circumvent these issues, the authors in [16] proposed a serial concatenation of the polar code with an outer cyclic redundancy check (CRC) code and the use of the successive cancellation list (SCL) decoder.On the one hand, the SCL decoder is more powerful than the originally proposed SC decoder.On the other hand, the outer CRC code improves the distance properties of the polar code as observed in [17].The resulting CRC-aid successive cancellation list decoding scheme can outperform state-of-the-art coding schemes.The outer CRC code can also be replaced with generalized parity check codes [18], resulting in parity check aided SCL decoding.Eventually, polar codes were included in the 5G standard for control channels in the eMBB scenario [19].Although proposals for using other decoders, e.g., belief propagation for improving performance of polar codes appeared in the literature as early as [20], the CRC or parity check aided SCL decoder remains the practical decoding choice [21].
Polar codes exploit the channel polarization phenomenon where the effective channels, known as the bit channels, experienced by the bits at the input of the polar encoder exhibit polarized behavior: The bit channels tend to be either extremely reliable or completely unreliable.The code uses the reliable bit channels to carry information bits, whereas the input of the unreliable bit channels is fixed to predetermined values.Hence, the order of the bit channels with respect to their reliabilities is determined in order to identify the most reliable ones, a process commonly referred to as code construction.This order of the bit channels is not universal and depends on the underlying physical channel.For channel models of practical importance, e.g., AWGN channels, the output alphabet of the bit channels expands exponentially in the codeword length, making exact and efficient computation of the bit channel reliabilities challenging.Therefore, approximate construction methods are used for such channel models.
An important contribution regarding approximate construction methods was made in [22], where it was shown that the density evolution tool can also be used for the construction of polar codes.However, the precision requirements for an exact implementation of the method of [22] become impractical as the codeword length increases.The idea of [22] was extended by Tal and Vardy in [23], where they provided an efficient implementation of the quantized density evolution for polar code construction.In order to keep track of the inaccuracy introduced by the quantization, the method of [23] determines an upper and a lower bound on the probability of error of each bit channel.Performing the density evolution under a Gaussian approximation provides another notable construction method [24] that tracks the reliability of each bit channel using a single scalar quantity instead of its transition probabilities, hence considerably reducing the computation complexity of density evolution.For a binary input AWGN channel, the sets of reliable bit channels of polar codes constructed with the Gaussian approximation and the Tal and Vardy method differ only in few elements.
The idea of combining the information bottleneck method with density evolution [1] can also be utilized for polar code construction.However, unlike other construction methods, the benefits of this approach are twofold: Firstly, the discrete density evolution using the information bottleneck method compresses the output alphabet of the bit channels of a polar code to a small size, while minimizing the information loss due to the compression as much as possible.This compression facilitates computing the reliabilities of the bit channels in terms of their capacities, keeping track of the loss of the mutual information caused by the quantization [25].Secondly, the intermediate steps of the discrete density evolution provide deterministic mappings between the quantized input and output messages for each node in the Tanner graph of the polar code.These deterministic mappings can replace the LLR computations of a conventional decoder, constituting an information bottleneck SC or SCL decoder [26].Hence, a polar code and its tailored quantized decoder are obtained from the same process.Similar to the information bottleneck LDPC decoders [1,7], all the operations in the information bottleneck SC decoder are table lookups of unsigned integers.However, the path metric computation in an information bottleneck SCL decoder is done similar to that in a conventional decoder as in [27].The error rate of a 4-bit information bottleneck SCL decoder is only slightly worse than that of a double-precision floating-point SCL decoder.The lookup tables used in the information bottleneck SCL decoders are categorized into decoding and translation tables.The decoding tables are used for the successive cancellation part of the decoder, while the translation tables are used to translate the integer valued messages into LLRs for path metric updates.For a codeword length N, 2N − 2 distinct decoding tables and N distinct translation tables are used [26].Thus, the space complexity of an information bottleneck SCL decoder increases with the codeword length N.
Quantized SC and SCL decoders have been studied in, e.g., [27][28][29][30][31].There are two major differences between these quantization approaches from the literature and the one proposed in this paper.Firstly, unlike the aforementioned quantization approaches, the quantization performed using the information bottleneck method is nonuniform in nature.Secondly, the quantization in [27][28][29][30][31] assigns a numerical value, e.g., reconstructed from the quantization labels, to each quantized LLR, which is then used in the relevant arithmetic operation.By contrast, the information bottleneck SCL decoder works entirely with the integer valued quantization labels and translates them into LLRs only when required, i.e., for path metric update.In addition to addressing the trade-off between quantization bitwidth and rate loss of an SC decoder, the authors in [28] proposed a very coarse 3-level, i.e., 2-bit, quantization scheme.Moreover, it was shown that the performance loss caused by the quantization is more pronounced for low code rates.The 3-level quantization scheme of [28] was deployed in short block length polar codes in [31], where it leads to significant degradation in the error rate.However, the author proposed improvements to the SCL decoder which drastically reduce this degradation.
In this paper, we present the polar code construction using the information bottleneck method [25] in detail, along with its comparison to the closely related construction method of Tal and Vardy.Afterwards, the information bottleneck decoders of [26] are reviewed accompanied by detailed insights into their working.The huge number of lookup tables used in the information bottleneck decoders of [26] can be viewed as a space complexity cost for their computational simplicity.We demonstrate that using the message alignment technique of [7], the required number of tables and, hence, the space complexity of the information, bottleneck SCL decoders can be reduced.
The rest of the paper is structured as follows.After a brief recap of polar codes and the information bottleneck method in Section 2, polar code constructing using the information bottleneck method and Tal and Vardy's method is discussed in Section 3. Section 4 shows how the results of Section 3 are used for decoding.Section 5 presents our latest work on the subject: space complexity reduction of the information bottleneck SCL decoder.Section 6 presents simulation results before concluding the paper in Section 7.

Notations
Random variables are denoted by capital italic font, e.g., X with realizations x ∈ X .I(X; Y), p(y|x) and p(x, y) represent the mutual information, conditional probability, and joint probability distribution of the two random variables X and Y. Multivariate random variables are represented with capital bold italic font, e.g., Y = [Y 0 , Y 1 ] T .Matrices are represented with capital bold font, e.g., Y, and vectors are represented using a lower case bold font, e.g., y.All vectors are column vectors.

Prerequisites
This section presents a brief recap of the relevant aspects of polar codes and introduces the information bottleneck framework.

Polar Codes
A polar code with length N = 2 n , where n = 1, 2 . . ., is described by its N × N generator matrix where matrix F = 1 1 0 1 and B is the bit reversal permutation matrix [13].While the encoding follows x = G N u, the FFT-like structure of F ⊗n allows efficient encoding with a complexity of O(N log 2 N).
For a code rate of K/N, (N − K) bits in the encoder input u = [u 0 , . . ., u N−1 ] T are set to fixed values, e.g., 0's, and referred to as frozen bits.The values and location of the frozen bits are known to the decoder.The remaining K positions in u, specified in the information set, A, carry the information bits.
The matrix F serves as the building block of polar codes in the sense that all the processing steps required for encoding, code construction, and decoding are defined over it.From an encoding perspective, F encodes the bits in u = [u 0 , u 1 ] T into codeword x = [x 0 , x 1 ] T , which is received as y = [y 0 , y 1 ] T after transmission through a symmetric discrete memoryless channel described by transition probabilities p(y|x).Figure 1 depicts the factor graph of this elementary setup of polar codes.The key concept of polar codes lies in the definition of virtual bit channels between each element in u and the channel output, y.Two bit channels are defined over the building block.The first bit channel, described by transition probabilities p(y 0 , y 1 |u 0 ), is established between y 0 = [y 0 , y 1 ] T and u 0 and ignores u 1 : Factor graph of the building block (dashed rectangle) of a polar code along with the transmission channel.
The second virtual channel, described by transition probabilities p(y 0 , y 1 , u 0 |u 1 ), treats u 1 as input and y 1 = [y 0 , y 1 , u 0 ] T as output, assuming the true knowledge of u 0 : If P e , P 0 e , and P 1 e are the probabilities of error of the physical channel, the bit channel of u 0 and the bit channel of u 1 , respectively, then P 0 e ≥ P e ≥ P 1 e [13].Hence, the setup of Figure 1 transforms two instances of the physical channel into qualitatively different synthetic channels.
Figure 2 shows the factor graph of a polar code for N = 4, where the transmission channel is implicitly included, i.e., the right-most variable nodes correspond to received coded bits.The code structure is composed of the building blocks arranged in vertical columns, referred to as levels, j, with right-most being j = 1 and left-most being j = n.Within a level j, variable nodes to the left of the check nodes are labeled with j, while those on the right of the check nodes are labeled with the level j − 1.Thus, every variable node has a label v i,j with i = 0, 1, . . ., N − 1 and j = 0, . . ., n.Then, v i,n are the encoder inputs u i and v i,0 are the channel outputs y i .It can be seen in Figure 2 that bit channels of a polar code are obtained after recursive application of the polarizing transform of Figure 1.Each application of the polarization transform at level j in the code structure manufactures a pair of bit channels from a pair of qualitatively similar (bit) channels from level j − 1, similar to Equations ( 2) and (3), where the bit channel corresponding to the upper branch is worse, while the one corresponding to the lower branch is better.The node labels, v i,j indicate the vertical stage, i = 0, 1, . . ., N − 1, and horizontal level, j = 0, . . ., n.For all i, v i,0 = y i , i.e., the channel output at level j = 0, while v i,2 = u i , i.e., the encoder input bits at level j = 2.
The bit channels for a polar code with N > 2 are the effective channels experienced by the individual bits u i under successive cancellation decoding where the true values of u 0 , . . ., u i−1 are provided by a helpful genie regardless of any previous decoding error, and the bits u i+1 , . . ., u N−1 are treated as unknowns.The transition probabilities of the bit channel of u i with equilikely inputs are given as: where the concise notation y i = [y, u 0 , . . . ,u i−1 ] T is used for the bit channel output, and x = [x 0 , x 1 , . . . ,x N−1 ] T is the codeword.

Successive Cancellation List Decoder
The SC decoder works on the structure of Figure 2 and estimates the bit values u i in a sequential manner from i = 0 to N − 1.At each decoding stage i, ûi is determined as: where L u i (y i ) = log p(y i |u i =0) p(y i |u i =1) is the LLR of the bit u i , and A denotes the information set.When ties in the case of L u i (y i ) = 0 are broken randomly, Equation ( 5) is a maximum likelihood decoding of the ith bit channel.The LLR L u i (y i ) is computed from subsequent levels in recursive steps as follows: with i = i 2 + i 2 j 2 j−1 , i = i + 2 j−1 and b i,j = (−1) vi−1,j , where the bit value vi−1 is available from the previous decoding stages.y i,j represents output of the intermediate bit channel experienced by v i,j .The operator for two LLR values L 1 and L 2 is defined as [32]: The LLR values L v i,0 (y i,0 ) in the recursion are given by the channel LLRs: The successive cancellation list decoder can be viewed as N L successive cancellation decoders working in parallel.In SCL decoding, every time an estimate ûi for i ∈ A has to be made, the decoder proceeds as an SC decoder for both possible decisions of ûi , instead of using Equation (5).The decoder maintains a list of possible decoding results where the number of decoding paths doubles after estimation of each information bit.If at any stage, the number of decoding paths in the list exceeds the maximum allowed list size, N L , the decoder retains only the N L most likely decoding paths, dropping the rest.The likelihood of path l ∈ {0, 1, . . . ,N L − 1} in the list at stage i is captured by the path metric M i,l [27]: where M i−1,l is the metric of the lth path at decoding stage i − 1, ûi,l is the bit value with which the path is being extended, and L u i (y i ) l is the LLR value L v i,j=n (y i,j=n ) for the lth path according to Equation (6).The path with the smallest metric is the most likely decoding path.The path increment ∆M i,l = ln(1 + e −(1−2 ûi,l )L u i (y i ) ) is well-approximated as [27]: where sign(L u i (y i )) denotes the sign of the LLR value L u i (y i ).
After the last decoding stage, i.e., i = N − 1, the most likely decoding path from the list is selected as the decoder output.In the CRC-aided settings, a CRC checksum of N crc bits is appended after the K information bits, and the K + N crc bits are then polar encoded into an N bit codeword.The decoder output is then the most likely decoding path in the final list that passes the CRC check.

Polar Code Construction
The process of determining the position of information bits in u is known as code construction.Ideally, the K qualitatively best bit channels are included in the information set A during the code construction phase.The quality of the bit channels can be quantified in terms of their error probability or capacity, both of which can be computed from their transition probabilities.The probability of error, P i e , of the ith bit channel is computed as ( [23], Equation ( 13)): while, its (symmetric) capacity is given by the mutual information I(U i ; Y i ): However, exact computations according to Equations (11) or (12) are not easy in the case when the underlying channel has a continuous output alphabet, e.g., the AWGN channel.Arikan proposed in [13] to approximate the AWGN channel as a binary erasure channel with equivalent Bhattacharyya parameter, in which case, bounds on the error probabilities of the bit channels in terms of their Bhattacharyya parameters can be computed efficiently.
Mori and Tanaka [22] showed that the error probability of a bit channel can be determined using density evolution since decoding at each stage i of the SC decoder can be treated as belief propagation on a cycle-free graph.However, precise implementation of the check node and variable node convolutions during the density evolution is challenging for a channel with continuous output, like the AWGN channel.Discretizing the continuous channel output using a quantizer also does not solve this problem since the output alphabet of bit channels is exponential in the codeword length N. Therefore, if the codeword length N is not too small, the sizes of the conditional probability mass functions p(y i |u i ) of the bit channels, required for capacity or probability of error computation, become impractically large.For example, if the 4-bit channel quantizer is used that quantizes the physical channel output into 16 bins, the output alphabet size of the bit channels of a polar code will be of the order of 16 8 even for a short codeword length of N = 256.The information bottleneck as well as Tal and Vardy's construction methods circumvent the exponential growth of the bit channel output alphabet by introducing quantization at each level of the polarization transform during the evolution of the conditional probability densities and restrict the output alphabet size to a small finite number.

Information Bottleneck Method
The information bottleneck method [3,4] is a generic clustering framework with its roots in the field of machine learning.Figure 3a visualizes the elementary setup of the framework.By designating a random variable X as the relevant random variable, the principal aim of the information bottleneck method is to extract as much information about X contained in an observation Y as possible and capture this relevant information in a compression variable T.More precisely, a compression mapping p(t|y) is determined, which maps the event space Y of the observed random variable onto the event space T of a compression variable, such that |T | |Y | while I(X; T) is maximized.In other words, the framework aims to simultaneously minimize the compression information I(Y; T) and maximize the relevant information I(X; T).The realizations of T are referred to as clusters whose labels can be chosen arbitrarily.We label these clusters with unsigned integer values, i.e., T = {0, 1, . . . ,|T | − 1}.The so-called information bottleneck graph [33] provides a compact visualization of this principle in Figure 3b where the observed random variable Y is compressed into T, while retaining the relevant information about X.This problem can be posed as a Lagrangian optimization problem and solved by minimizing the so-called information bottleneck functional [3] F {p(t|y)} = I(Y; T) − βI(X; T).The non-negative Lagrangian multiplier β in ( 13) serves as trade-off parameter between information preservation and compression.In extreme cases, β → ∞ corresponds to maximum information preservation, while β → 0 corresponds to maximum compression.In this work, we are interested in obtaining the maximum relevant information I(X; T), i.e., β → ∞.The compression results from the choice of the (constrained) cardinality of T, i.e., |T |.
Note that after performing a mapping p(t|y), any physical meaning contained in Y is lost, i.e., the compression variable T on its own has no direct meaning and belongs purely to an abstract domain.Hence, a coupling between the relevant variable X and T is needed, which is given by p(x|t).Only with the distribution p(x|t) can the meaning of a cluster index t for a certain x be recaptured.
Several information bottleneck algorithms exist which obtain the distributions p(t|y), p(x|t) and p(t) from the joint distribution p(x, y) [3,4].It can be shown that choosing β → ∞ yields a deterministic clustering p(t|y) ∈ {0, 1} [4].A deterministic clustering is convenient from an implementation perspective as it can be realized as a static lookup table.An algorithm optimized for deterministic mappings that is used in this work is the modified sequential information bottleneck algorithm from [1,2], whose Python implementation is available at [34].
The information bottleneck method is a generic framework in the sense that it can be used for quantization of an observed quantity in any problem where the joint distribution p(x, y) of observation y and relevant quantity x is available.With the appropriate joint distribution p(x, y) at hand, the framework has been used in the field of communication for channel estimation [35], to quantize the AWGN channel and the node operations of LDPC decoders [1,2,7], the bit channels of a polar code [25], as well as the decoding operations of SC and SCL decoding of polar codes [26].Among these applications, we explain the AWGN channel quantization using the information bottleneck framework [1,2] as an illustrative example in the following.It is important to point out that such a channel quantizer is an integral part of the information bottleneck decoders designed for LDPC and polar codes, i.e., the decoders assume that the physical channel output is clustered into a finite number of bins by a quantizer designed using the information bottleneck method or another principle.
Let x ∈ {0, 1} be a coded bit which is modulated into transmit symbols s(x) using BPSK as s(x) = 1 − 2x.The symbol s(x) is affected by an AWGN with variance σ 2 N , resulting in continuous channel output ỹ.Then, for the channel quantizer design using the information bottleneck, the observed and the relevant quantities are channel output ỹ and transmitted code bit x, respectively.The compression variable is labeled Y such that the quantized channel output is y ∈ Y = {0, 1, . . . ,|Y | − 1}.For the a priori distribution of p(x) = 1/2, the joint distribution p(x, ỹ) is given by: The distribution p( ỹ, x) is fed into the modified sequential information bottleneck algorithm [1,2], which determines the mapping p(y| ỹ) that divides the AWGN channel output into |Y | clusters such that I(X; Y) is maximized.Note that the selected algorithm works only with discrete random variables.Therefore, the continuous channel output is first discretized using a fine uniform resolution.Figure 4 illustrates the relevant information preserving clustering of AWGN channel output into |Y | = 8 clusters for noise variance σ 2 N = 0.5.The algorithm randomly initializes the cluster boundaries in a symmetric manner as exemplarily depicted in Figure 4a and computes the mutual information I(X; Y) for the resulting compressed joint distribution p(x, y).The clusters are labeled such that the left-most cluster is y = 0, while the right-most cluster is y = 7.When the event space Ỹ is sorted with respect to its LLRs 2 ỹ/σ 2 N , as is the case in Figure 4, each cluster is a subset of contiguous elements of Ỹ.Then, the task of finding the desired clustering p(y| ỹ) reduces to optimizing the cluster boundaries due to the separating hyperplane condition [5].The algorithm, therefore, adjusts each cluster boundary such that I(X; Y) is maximized and updates the distributions p(x, y) and p(y| ỹ).The algorithm stops when I(X; Y) does not increase further and returns the mapping p(y| ỹ), the distribution p(y), and p(x|y).Figure 4b shows the optimized boundaries of quantization bins or clusters which assign all channel outputs ỹ < −0.99 to cluster y = 0, −0.99 ≤ ỹ < −0.56 to cluster y = 1, and so on.The quantizer does not assign any real valued representative to each quantization bin but instead produces the index y of the cluster in which ỹ falls.Should the quantizer be used with a decoder that requires LLRs as input, the cluster indices y can be translated into channel level LLRs as: where the distribution p(x|y) is delivered by the information bottleneck algorithm along with the quantizer mapping p(y| ỹ).The second equality in Equation ( 15) holds due to the equiprobable input assumption, i.e., p(x) = 0.5.

Polar Code Construction Using the Information Bottleneck Method
This section describes the construction of polar codes using the information bottleneck method and compares it to the closely related construction method proposed by Tal and Vardy.Thanks to the nice recursive structure of polar codes, describing the construction steps of each method for a single building block-i.e., Figure 1-is sufficient.

Information Bottleneck Construction
The information bottleneck construction method starts by quantizing the AWGN channel using the quantizer of Section 2.4.The quantizer is designed for a target noise variance σ 2 N in Equation ( 14), computed from the design E b /N 0 of the polar code.As a result, the quantized AWGN channel becomes a binary input discrete memoryless channel with output alphabet size of |Y | and transition probabilities p(y|x) in the building block of Figure 1.The output alphabets of the bit channels of u 0 and u 1 have sizes respectively.The goal here is to quantize both of the bit channels using the information bottleneck method and reduce their output alphabets to a desired size |T |, which is chosen to be a power of 2.
Initially, the two bit channels defined over the building block of Figure 1 are casted to the information bottleneck setup of Figure 3a.For the bit channel of u i , with i ∈ {0, 1}, output y i is the observation which we use to infer input u i .Therefore, the multivariate random variable Y i is the observed random variable, while the bit channel input is the relevant random variable, i.e., U i .Our goal here is to cluster the realizations of the observed variable Y i to a new compressed random variable T i = t i ∈ {0, 1, . . . ,|T | − 1}, while minimizing the quantization loss in terms of mutual information I(U i ; Y i ) − I(U i ; T i ) as much as possible.The information bottleneck graphs for quantization of the two bit channels are depicted in Figure 5.An information bottleneck algorithm requires the joint distribution p(u i , y i ) relating the relevant and the observed random variables.For the bit channel of u 0 , we have: where, from the factor graph in Figure 1: Here, p(y i |x i ) and p(u i ) = 0.5 for i ∈ {0, 1} are the channel transition probabilities and priors, respectively, while p(x 0 |u 0 , u 1 ) ∈ {0, 1} and p(x 1 |u 1 ) ∈ {0, 1} are deterministic mappings according to x 0 = u 0 ⊕ u 1 and x 1 = u 1 , respectively.Applying the selected information bottleneck algorithm to the joint distribution p(u 0 , y 0 ) will return the distributions p(u 0 |t 0 ) and p(t 0 ), as well as the clustering p(t 0 |y 0 ), which maps the observed sequence, y 0 , onto a cluster index t 0 ∈ T .Hence, the |Y | 2 outputs of the bit channel are quantized into |T | clusters.The joint distribution of bit channel of u 1 equals: The information bottleneck algorithm produces the distributions p(u 1 |t 1 ), p(t 1 ) and p(t 1 |y 1 ) for the joint distribution of Equation (18).The mapping p(t 1 |y 1 ) clusters the 2 • |Y | 2 outputs of the bit channel into |T | clusters.Thus, the output alphabet of both the bit channels is compressed to size |T |.Further, the compressed joint distributions p(u i , t i ) = p(u i |t i )p(t i ) also facilitate computing the capacities of the bit channels.
For the construction of a polar code, this treatment is applied to all the building blocks in the structure of the code, starting at level j = 1 and progressing step by step to higher levels.In our running example of Figure 2, the aforementioned procedure produces quantized bit channels at level j = 1.Specifically, the quantized bit channels of the intermediate variables nodes labeled v 0,1 and v 2,1 are similar to the one obtained for u 0 of Figure 1, i.e., p(v 0,1 , t 0,1 ) = p(v 2,1 , t 2,1 ) = p(u 0 , t 0 ).Moreover, the quantized bit channels of the intermediate variables nodes labeled v 1,1 and v 3,1 are similar to the one obtained for u 1 of Figure 1, i.e., p(v 1,1 , t 1,1 ) = p(v 3,1 , t 3,1 ) = p(u 1 , t 1 ).At level j = 2, the bit channel of v 0,2 = u 0 now has t 0,1 and t 2,1 as outputs, with an output alphabet of size This bit channel is quantized using the selected information bottleneck algorithm, which provides the distributions p(u 0 |t 0,2 ), p(t 0,2 ) and p(t 0,2 |t 0,1 , t 2,1 ) from the input joint distribution p(u 0 , t 0,1 , t 2,1 ), compressing the output alphabet to |T | elements (recall that in Figure 2, v The input joint distribution p(u 0 , t 0,1 , t 2,1 ) for the algorithm is computed from Equations ( 16) and ( 17) with appropriate replacements, e.g., by replacing p(y 0 |x 0 ) = p(y 1 |x 1 ), x 0 and x 1 with p(t 0,1 |v 0,1 ), v 0,1 and v 2,1 , respectively.The joint distribution p(u 1 , t 0,1 , t 2,1 , u 0 ) of the bit channel of u 1 in Figure 2 can be obtained similarly from Equations ( 17) and ( 18) with appropriate replacements, and quantized using the information bottleneck algorithm.Figure 6 shows the resulting information bottleneck graph for the polar code of Figure 2, obtained after quantizing all the building blocks, where the output alphabet of all the bit channels is restricted to |T |.Consequently, the compressed joint distribution p(u i , t i,2 ) for each bit channel of the code (level j = n) can be used to compute its capacity using Equation (12), and thus to construct information set A [25].

Tal and Vardy's Construction
The key step of Tal and Vardy's construction method is the degrading procedure which merges the elements in the output alphabet of a binary input discrete memoryless symmetric channel.Assume that the output alphabet Y of the channel characterized by p(y|x) in Figure 1 Such a merging always causes a loss in the capacity of the channel amounting to δI ≥ 0. Tal and Vardy's degrading merge procedure chooses the elements to be merged wisely; the pair of elements to be merged is selected such that their merging will cause the smallest amount of loss δI.Such a degrading merge is performed repeatedly, until the output alphabet size of the channel is reduced to the desired value of |T |.The bit channels synthesized from such a channel using Equations ( 2) and (3) will have an output alphabet size of |T | 2 and 2|T | 2 , respectively.The aforementioned degrading merge is applied to each bit channel such that their output alphabet sizes are reduced to |T |.For constructing a polar code with N > 2, the degrading merge is used after application of the polarization transform at each level, and the output alphabet size of the (intermediate) bit channels is kept restricted to |T |.Equation ( 11) is then used to compute the error probability of each degraded bit channel, which serves as an upper bound on the error probability of the unquantized bit channel.
Tal and Vardy performed the degrading merge in an efficient way by exploiting the symmetry of the channel.Firstly, the elements of Y are sorted with respect to their LLRs.In the sorted output alphabet, only pairs of consecutive elements need to be considered for merging ([23], Theorem 8).Secondly, due to the channel symmetry, for each y ∈ Y there exists a ȳ ∈ Y, referred to as the conjugate of y, such that p(y|x = 0) = p( ȳ|x = 1).This fact is exploited in the sense that computations for mergers are done for half of the channel output alphabet Y + , where: When a pair y a , y b ∈ Y + is merged into y ab , their conjugates ȳa , ȳb ∈ Y \ Y + are also merged into ȳab .
The final code construction algorithm computes the upper bounds on the error probabilities of the bit channels using the degrading merge procedure and also computes the Bhattacharyya parameter for each bit channel.The smallest of the two is then considered for including a bit channel into information set A. In order to track the inaccuracy introduced by the quantization, another approximation, namely, the upgrading merge is used.For every bit channel which is quantized using the degrading merge, the upgrading merge synthesizes another bit channel, with alphabet size |T |, which is upgraded, i.e., its probability of error is equal or smaller than that of the channel under consideration.The error probability of the upgraded channel provides a lower bound on the probability of error of a bit channel.The value of |T | is then selected such that the two bounds are very close to each other.When such a value for |T | is found, the upgrading merge can be discarded.|T | = 256 or 512 was shown to suffice ([23], Figure 2 and Table I).
For the binary input AWGN channel p( ỹ|x), the continuous output ỹ is first descritized into a fine resolution with |T ch | |T |.However, the bit channels synthesized from this fine resolution quantized channel have their output alphabets reduced to |T | using the aforementioned degrading merge procedure.Therefore, the code construction at higher levels j > 1 in the code structure for N > 2 procedes as explained for channels with discrete output.

Information Bottleneck vs. Tal and Vardy Construction
In this section, polar code construction using the information bottleneck method and the Tal and Vardy method are compared.The working principle behind both construction methods is the same: Compress the output alphabet of bit channels at each level j while keeping the loss of mutual information caused by the compression as small as possible.In fact, the degrading merge procedure of Tal and Vardy's construction ( [23] Algorithm C) is known as the modified agglomerative information bottleneck algorithm in the information bottleneck circle [36].However, since Tal and Vardy's construction method also uses the Bhattacharyya parameter for designating a bit channel to be frozen or not, it is a fusion of the information bottleneck method and Arikan's approximation [13].Some further differences between the two methods are as follows.
The rule used to choose the quantized alphabet size |T | in Tal and Vardy's method is different than that used in the information bottleneck method.Tal and Vardy's method chooses |T | such that the bounds obtained from degrading and upgrading merge are very close, and thus, the values |T | = 256 or 512 are suggested.On the other hand, the choice of |T | in the information bottleneck method is driven by tracking the amount of relevant mutual information retained during the density evolution across the levels of a polar code.For that purpose, the rate of preserved mutual information for quantizing the ith (intermediate) bit channel at a level j is defined as R i,j = I(V i,j ; T i,j )/I(V i,j ; Y i,j ), where I(V i,j ; T i,j ) is the preserved mutual information after compression and I(V i,j ; Y i,j ) is the mutual information before the compression.Then, the rate of mutual information preservation R j for the jth level is defined as the mean of the R i,j values.For a codeword length of N = 2 n , the overall rate of information preservation is computed as a cumulative product, R cum : Figure 7 shows the evolution of R cum over the levels in the polar code structure for different compressed cardinalities |T |.Considering a polar code with length N = 512, already for a quite small number of 16 clusters, approximately 98% of relevant information is preserved over all levels.Increasing the number of clusters further to 32 yields only a small gain, i.e., 99% of relevant information is preserved.Hence, the information bottleneck construction method is computationally more efficient than Tal and Vardy's method due to the use of small values of |T |.The information bottleneck construction method benefits from the availability of different algorithms and hence offers to choose an algorithm according to the problem at hand, e.g., an information bottleneck algorithm with a nonbinary relevant variable can be chosen for constructing nonbinary polar codes.Another difference between the two construction methods is that while Tal and Vardy's method incorporates the channel quantizer into the merging operations at level j = 1, the information bottleneck method separates the channel quantizer design from the code structure.This adds flexibility to the information bottleneck construction method and facilitates construction of polar codes for other channel models or modulation schemes.For instance, replacing the channel quantizer of Section 2.4 with the one from [37], i.e., designed using the information bottleneck method for an AWGN channel with 16-QAM input symbols, the construction method can be adapted for constructing a polar codes for a higher order modulation scheme.
Lastly, the Tal and Vardy method does not fully exploit the benefits of the discrete density evolution performed for code construction, i.e., the intermediate results could be used for decoding.The benefits of application of the information bottleneck method in polar codes go beyond the code construction.For each variable node v i,j , 0 ≤ i < N and 1 ≤ j ≤ n, in Figure 2, the distributions p(t i,j |y i,j ) and p(v i,j |t i,j ) are obtained.Here, y i,j represents the output of the virtual bit channel experienced by v i,j , and t i,j is the cluster index in which y i,j falls.With appropriate book keeping, we save all the hard work done during the discrete density evolution for computing these distributions and use them for coarsely quantized decoding as shown in the next section.

Information Bottleneck Polar Decoders
In this section, we show how the results of polar code construction using the information bottleneck method can also be used for coarsely quantized SC and SCL decoding [26].Note that the lookup tables used for decoding in the following section are generated offline for a single design E b /N 0 and are used mismatched to the whole operating E b /N 0 range.First, decoding at a single building block is demonstrated, followed by demonstration of SCL decoding in Figure 2.

Lookup Tables for Decoding on a Building Block
In Section 3.1, the distributions p(t i |y i ) and p(u i |t i ) , i ∈ {0, 1}, were obtained from p(u i , y i ) using the information bottleneck method on the building block of Figure 1.The joint distribution, p(u i , y i ), of a bit channel embeds the relation of its output y i to its input u i , i.e., knowing y i , one can figure out whether u i = 0 or u i = 1 is more probable.By ensuring that I(U i ; T i ) ≈ I(U i ; Y i ), the information bottleneck method strives to preserve this relation when compressing and mapping the event space of Y i onto that of T i .Therefore, knowing the value of T i = t i (i.e., the label of the cluster to which the channel output belongs) after using the mapping p(t i |y i ), we can figure out the most likely value of the channel input.
The deterministic mapping p(t i |y i ) in tabular form is referred to as the decoding table for the bit u i , e.g., Figure 8a shows the decoding table for u 0 .The decoding table of u 0 is essentially a quantizer of the bit channel p(y 0 , y 1 |u 0 ) which produces the cluster index t 0 ∈ T when the outputs of the underlying AWGN channel are put in clusters y 0 and y 1 by the channel quantizer.Typically, the cluster index alone does not provide any meaning for the relevant quantity.However, due to the symmetry of the bit channel, the knowledge of t 0 can be used for a hard decision on u 0 since half of the |T | clusters correspond to u 0 = 0 and the remaining half to u 0 = 1.In order to perform soft decision estimation, the conditional distribution p(u 0 |t 0 ) is used to compute the LLR L u 0 (t 0 ) as: where the second equality in Equation ( 21) is due to equiprobable input, i.e., p(u 0 ) = 0.5.Equation ( 21) is tabulated with exemplary values in Figure 8b and referred to as the translation table since it translates the cluster indices t 0 ∈ {0, 1, ..., |T | − 1} to LLRs L u 0 (t 0 ).It can be seen in the translation table of Figure 8b that û0 = 1 is most likely for all the bit channel outputs y 0 put in the cluster t 0 = 0, i.e., the largest LLR magnitude with a negative sign.When the decoding table lookup produces t 0 = 1, û0 = 1 is still more probable but with less confidence, i.e., smaller LLR magnitude.The lookup tables of Figure 8 can be used to decode u 0 as follows: 1. Use the decoding table p(t 0 |y 0 ) of Figure 8a to determine the cluster index to which the observed channel output y 0 belongs.For example, t 0 = |T | − 1 when y 0 = y 1 = 0. 2. Use t 0 from step 1 for a hard decision on u 0 or translate it into an LLR value using the translation table of Figure 8b.For the example of t 0 = |T | − 1, û0 = 0 and L u 0 (t 0 )= 2.19.
Decoding table The bit u 1 can also be decoded in a similar fashion using its decoding and translation tables p(t 1 |y 1 ) and p(u 1 |t 1 ), respectively.Similarly to conventional SC decoding, the decoding table of u 1 assumes the knowledge of u 0 from the previous decoding stage.Thus, the decoding on a building block can be done using lookup tables instead of computations according to Equations (6a) and (6b).The decoding tables p(t 0 |y 0 ) and p(t 1 |y 1 ) have |Y | 2 and 2 • |Y | 2 integer valued entries, respectively.Both the translation tables have only |T | real valued entries, i.e., LLRs.

Information Bottleneck Successive Cancellation List Decoder
Having replaced the decoding steps on a single building block with table lookups, we proceed to demonstrate successive cancellation list decoding on the code of Figure 2 using lookup tables.For each variable node v i,j , 0 ≤ i < N and 1 ≤ j ≤ n, in Figure 2, the decoding table p(t i,j |y i,j ) and the translation table p(v i,j |t i,j ) is available from application of the information bottleneck framework to the whole structure of Figure 2. Assume that the decoding list size is N L = 2 and A = {1, 2, 3}, i.e., only u 0 = 0 is frozen.
In the next decoding stage, the first information bit u 1 is decoded.The cluster index t 1,2 is obtained from the decoding table p(t 1,2 |y 1,2 ) = p(t 1,2 |t 0,1 , t 2,1 , u 0 ), for which all the three inputs are known from the previous decoding stage.The translation table p(u 1 |t 1,2 ) is used to translate t 1,2 to LLR L u 1 (t 1 ).Instead of using Equation ( 5) for the information bit, the decoder extends the existing decoding path with both possible decisions û1 = 0 and û1 = 1 and updates the path metrics M 1,l , l ∈ {0, 1} for both paths using L u 1 (t 1 ) in (9).Similar steps are taken to estimate the remaining bits, i.e., u 2 and u 3 .At each decoding stage i ∈ A, the number of decoding paths in the list doubles, and for i ≥ 2, the decoding list size will exceed the maximum allowed value N L = 2. Therefore, N L of the decoding paths having the largest path metric are dropped from the decoding list.
The schedule of computations in the lookup table-based decoder is the same as in a conventional SCL decoder.However, all the computations, except the path metric update, are replaced by table lookups.Further, the inputs and outputs of all the decoding tables are unsigned integers, i.e., 0, 1, . . ., |T | − 1 with |T | being a small number.The translation tables hold LLR values, i.e., real numbers.For a polar code of length N, 2N − 2 distinct decoding tables are required.It is important to note that the translation tables are required only at the decision level, i.e., j = n, in the information bottleneck SCL decoders when an LLR value is needed for Equation (9).Therefore, we require only N translation tables.Moreover, at any level j, every decoding table at an even stage i requires |T | 2 • Q t bits of memory, while each decoding table at an odd stage i requires 2 • |T | 2 • Q t bits of memory, where Q t = log 2 |T | is the number of bits required to represent a cluster index t i,j .If the LLR values in the translation tables are stored using a resolution of Q LLR bits, the N translation tables will consume a memory of N • |T | • Q LLR bits.

Space-Efficient Information Bottleneck Successive Cancellation List Decoder
The computational simplicity of the information bottleneck SCL decoding comes at the cost of additional memory for storing the decoding and translation tables.In this section, we show how to reduce the memory requirement of the translation tables in an information bottleneck SCL decoder.First, the role of translation tables is discussed in detail.Then, the message alignment principle of [7] is exploited to obtain a single translation table that can be used instead of the N distinct translation tables.For the following discussion, we drop the level subscript from cluster indices at the nth level for the sake of brevity and write t i,n as t i .

The Role of Translation Tables
The output of all the decoding table lookups in the structure of an information bottleneck SCL decoder has the same alphabet t i,j ∈ {0, 1, . . . ,|T | − 1}.The purpose of translation tables is to decipher the cluster index t i , an abstract quantity obtained from a decoding table lookup, at the decision level into LLR L u i (t i ), i.e., its meaning for the bit u i .The reason for requiring a distinct translation table for each bit lies in the fact that the translation tables of two qualitatively different bit channels translate the same cluster index into an LLR value differently.Figure 9 shows the translation tables for bits u 14 , u 83 , and u 124 of an information bottleneck SCL decoder with |T | = 16 generated for a half-rate polar code with N = 128 and design E b /N 0 = 3 dB.In the figure, cluster indices are given along the x-axis, while the LLR values they correspond to are given along the y-axis.It can be seen that L u 14 (t 14 ) = L u 83 (t 83 ) = L u 124 (t 124 ) if t 14 , t 83 , and t 124 take the same numerical value, e.g., t 14 = t 83 = t 124 = 0. Further, the same cluster index translates to different LLR magnitudes not only for frozen and information bits, but also for different information bits.However, one notices that L u 14 (t 14 ) = L u 83 (t 83 ) for t 14 = 1 and t 83 = 5.Similarly, L u 14 (t 14 ) = L u 124 (t 124 ) for t 14 = 2 and t 124 = 7.We need to somehow align all the translation tables such that a cluster index t translates to the same LLR value regardless of the bit channel i, 0 ≤ i < N, for which it is used.

Message Alignment for Successive Cancellation List Decoder
Recall that the cluster indices t i take values from the same finite alphabet T = {0, 1, . . . ,|T | − 1} regardless of the decoding stage i, e.g., we can have t 14 = 1 and t 83 = 1.However, the appropriate translation table, obtained from the conditional distribution p(u i |t i ) according to Equation ( 21), translates the cluster index t i into an LLR L u i (t i ).Now let us make some notational changes.We separate the stage subscript from the cluster index and express t i as a pair [t, i] T .In our new notation, t 14 = 1 and t 83 = 1 are expressed as [1,14] T and [1, 83] T , respectively.We drop the subscript i from the bit u i altogether such that a bit value is denoted by u regardless of the decoding stage i.Thus, the use of distinct translation tables with this new notation can be stated as follows; a correct translation of a cluster index t ∈ {0, 1, . . . ,|T | − 1} into an LLR L u (t) requires the knowledge of the decoding stage i ∈ {0, 1, . . . ,N − 1}.In other words, we observe t and i and infer the value of bit u.This task of message alignment can be treated as an information bottleneck problem; the pair y * = [t, i] T is the observed quantity, while the bit u is the relevant quantity, depicted in the information bottleneck graph of Figure 10.Casting the problem to the setup of Figure 3a The application of an information bottleneck algorithm requires the joint distribution of the observation and the relevant quantity, i.e., p(u, y * ).This joint distribution is given by: where p(u i , t i ) is available from the information bottleneck decoder construction via density evolution and p(i) = 1 N since there are N translation tables, each occurring once.The algorithm returns the distributions p(u|t * ) and p(t * ), as well as the mapping p(t * |y * ), which clusters the pairs [t, i] T to t * .The alignment mapping p(t * |y * ) needs not to be implemented as an extra table lookup and can be incorporated in the decoding tables at the j = n level.The mapping p(t * |y * ) is used to replace the cluster indices y * = [t, i] T = t i in the decoding tables of the decision level, i.e., p(t i |y i,n ), with the cluster indices t * .In other words, the decoding tables at the decision level are aligned.As a result, the aligned cluster indices t * are translated using a single new translation table obtained from p(u|t * ) in line with Equation (21). Figure 11  It is pointed out that we are free to choose the alignment cardinality |T * |.The choice of |T * |, however, affects the performance of the quantized decoders, as seen in the next section.Moreover, the aligned decoding tables at the decision level require a memory of respectively, where Q t * = log 2 |T * |.Therefore, when |T * | is chosen to be larger than |T |, the size, i.e., the number of elements in each decoding table at the decision level remains unchanged, but the memory required for each decoding table increases since Q t * > Q t .

Numerical Results
This section provides a comparison of the information bottleneck and Tal and Vardy code construction and discussion of the simulation results for the information bottleneck decoders.

Code Construction
Figures 12 and 13 present the so-called frozen charts, a compact visualization of information sets from [38], for polar codes constructed using Tal and Vardy's method and the information bottleneck method, as well as the Gaussian approximation [24].A frozen chart is a grid of squares where each square denotes a bit channel, starting from top left and reading columnwise.A colored square denotes a frozen bit position, while a white square represents an information bit position.The bit channels in the frozen charts are sorted with respect to the descending order of their error probabilities obtained from Tal and Vardy's method with |T | = 512 such that the top left square represents the least reliable bit channel, while the bottom right square represents the most reliable bit channel.Each polar code is constructed for rate 0.5 such that K = N/2 worst bit channels are frozen.The design E b /N 0 for any polar code presented in this paper is selected in the following way.First, the block error probability of the code with target code rate and codeword length N is computed using Equation (3) from [39] for various candidate E b /N 0 values.Then, the candidate value which achieves the block error probability of 10 −3 at the smallest E b /N 0 is selected as the design E b /N 0 for the polar code.Figure 12 shows the frozen charts obtained for N = 128.It is seen that the codes obtained from all of the construction methods are the same.Especially, Tal and Vardy polar codes for |T | = 512 and |T | = 16 are the same.This result suggests that although the bounds on the error probability of the bit channels from Tal and Vardy's method might not be tight with |T | = 16, the resulting information set is the same as the one obtained for |T | = 512.The frozen charts of the codes constructed using the information bottleneck method for |T | = 16 and |T | = 32 are also the same.Hence, for a given code rate and design E b /N 0 , all the methods compared here produce the same information sets, regardless of the choice of |T |.
Figure 13 shows the frozen charts obtained for N = 1024.It is seen that the codes obtained from various construction methods are slightly different for the larger codeword length.Tal and Vardy polar codes for |T | = 512 and |T | = 16 differ in a few positions (cf. Figure 13a,b).The same is true for codes constructed with the information bottleneck method (cf. Figure 13c,d).Interestingly, the codes constructed using Tal and Vardy's method with |T | = 512 in Figure 13a are exactly the same as the ones constructed using the Gaussian approximation (cf. Figure 13e) and differ from the frozen chart of the information bottleneck polar code of Figure 13d in only two positions.Hence, the codes constructed for coarsely quantized decoders do not differ significantly from those designed for high-resolution decoding.The important question here is how much such small differences affect the error correction performance of the polar code.Figure 14 shows the block error rate of using the information sets of Figure 13.It is obvious from Figure 14 that the minor differences in the information sets obtained from various construction methods and different quantization parameters |T | have a negligible effect on the block error rates of polar code under SC decoding and CRC-aided SCL decoding with different list sizes.Hence, we are encouraged to consider the code and the decoder design separately and choose the least computationally intensive code construction method, i.e., the Gaussian approximation.

Information Bottleneck Decoders
In this section, the simulation results on the error correction performance of the coarsely quantized information bottleneck decoders are discussed.In all the simulations in the following discussion, an AWGN channel with BPSK modulation is assumed.The information bottleneck decoders and the information bottleneck quantizer have the same resolution, i.e., |Y | = |T |.The conventional SCL decoders use double-precision, floating-point arithmetic, i.e., they assume virtually no quantization and have a 64-bit resolution.The CRC-aided settings are used where a value for N crc is specified.It should be pointed out that we did not attempt to optimize the design E b /N 0 over the error rate of the information bottleneck decoders in this work.The design E b /N 0 is selected as mentioned in Section 6.1 to obtain the information set A of a polar code.Then, the error correction performance of a conventional SCL (SC) decoder is compared to that of an information bottleneck SCL (SC) decoder generated for that design E b /N 0 .
First, Figures 15 and 16 show the block error rate of double-precision floating-point SCL decoder when used with channel quantizers of various bitwidths designed using the information bottleneck method.Figure 15 shows the results for N = 1024, N crc = 16, list size N L = 32, and quantizer cardinalities |Y | = 4, 8, 16 or 32.It can be seen that a 5-bit quantizer causes virtually no error rate degradation, whereas the 2-bit quantizer causes ≈ 0.6 dB degradation at a block error rate of 10 −3 .Figure 16 shows the results of the same experiment repeated for N = 128 and N L = 32 but without the outer CRC code.The CRC is dropped for the sake of fair comparison with the results of [31] since the author did not use the CRC-aided setting in his experiments.Especially, the results for 2-bit resolution are comparable to those of [31].Firstly, we see that a channel quantizer with |Y | = 16 shows negligible degradation for the shorter codeword length.Secondly, the 2-bit quantizer causes approximately 0.55 dB degradation, which is slightly less than the 0.8 dB degradation reported for the 3-level, i.e., 2-bit, quantizer in Figure 3.1 in [31].It is, however, pointed out that the code construction methods used in [31] are different than the one used in this document.Based on Figures 15 and 16, |Y | = 16 or 32 seems a good choice to minimize the effect of channel quantization on the decoder performance.Figures [17][18][19][20][21][22] present simulation results for the information bottleneck decoders without the message alignment.Figure 17 illustrates the trade-off between the space complexity and the performance of the information bottleneck SCL decoders where the quantized decoders for different choices of |T | are compared to a conventional SCL decoder for N = 1024, list size N L = 32, and N crc = 16.As expected, the information bottleneck decoder having the lowest resolution, i.e., |T | = 8 shows the largest degradation of ≈0.65 dB where the channel quantizer contributes 0.25 dB (as in Figure 15).The performance degradation of the quantized decoding can be reduced to ≈0.33 dB and ≈0.2 dB using decoders constructed for |T | = 16 and |T | = 32, respectively.Figure 18 compares the block error rate of information bottleneck and conventional SCL decoders for N = 128 and N L = 32 without the outer CRC code (cf. Figure 16).It can be seen that the information bottleneck decoders with |T | = 16, 8 and 4 show 0.2, 0.55 and 1.9 dB degradation, respectively, at a block error rate of 10 −3 .Compared to the information bottleneck decoder with 2-bit resolution, the 3-level quantized decoder of [31] exhibits a degradation of 2.6 dB for similar decoder parameters (Figure 3.4 in [31]).However, the enhancements proposed in [31] drastically improve the performance of the 3-level quantized SCL decoder and limit the degradation caused by the very coarse quantization to well below 1.5 dB.For the case of information bottleneck decoders, as presented in this paper, the performance degradation can only be reduced using higher bitwidths, e.g., |T | = 16.Figure 19 shows the block error rate of an information bottleneck SCL decoder for a low code rate of ≈0.145 with N = 256, N L = 32, no CRC, and design E b /N 0 = 4 dB.The information bottleneck decoders with |T | = 16, 8, and 4 show a degradation of 0.3, 0.85, and 2.9 dB, respectively, at a block error rate of 10 −3 .The degradation due to quantization in the information bottleneck decoders is starker (cf. Figure 18) at low code rates in accordance with the theoretical predictions of [28].
In the remainder of this document, the investigations are restricted to information bottleneck decoders having a 4-bit resolution, i.e., |T | = 16.Figure 20 shows the effect of the choice of design E b /N 0 for generating information bottleneck decoders that are used mismatched to the operating E b /N 0 range.The decoders in the figure are used in the CRC-aided setting with N crc = 16, N L = 32, code rate of 0.5, and design E b /N 0 = 3 or 4 dB.For N = 1024, the information bottleneck decoder constructed for design E b /N 0 = 4 dB shows a degradation of 0.15 dB compared to the one designed for E b /N 0 = 3 dB at a block error rate of 10 −3 .This degradation is partly due to the difference in the information sets obtained for the two design E b /N 0 values and partly due to the decoder itself.The evidence for the second part of this statement is provided by the results for N = 128 in the same figure.Although the information sets generated for design E b /N 0 = 3 and 4 dB are the same in the case of N = 128, the decoder designed for E b /N 0 = 4 dB shows a degradation of 0.1 dB.Hence, the design E b /N 0 of an information bottleneck decoder needs to be carefully chosen.We have observed that the sensitivity of information bottleneck decoders to design E b /N 0 depends on codeword length N, code rate, and decoder setting, i.e., with or without the outer CRC.
Figure 21a compares the block error rate of the conventional and information bottleneck SCL decoder for a list size of N L = 8 and various codeword lengths N. The information bottleneck decoder shows a degradation of approximately 0.16, 0.25, and 0.39 dB at a block error rate of 10 −3 for codeword lengths of N = 128, 256, and 1024, respectively.Figure 21b shows the same comparison for a list size of N L = 32, where a gap of 0.13, 0.22, and 0.33 dB is observed at a block error rate of 10 −3 for codeword lengths of N = 128, 256, and 1024, respectively, between the block error rate of the information bottleneck decoder and the conventional scheme.These results suggest that the information bottleneck decoders show a larger performance loss for larger codeword lengths.
The exact computation of the path metric in an SCL decoding scheme according to Equation ( 9) involves logarithm and exponential functions which are not hardware-friendly.Therefore, the approximation of Equation ( 10) is preferred for hardware implementation which causes a negligible effect on the error rate of the conventional SCL decoder [27].Figure 22 shows the effect of using the approximate path metric rule in an information bottleneck decoder for N = 128 and different list sizes.It can be seen that although the use of the approximate path metric update does not affect the error correction performance of the decoder for a list size N L = 2, it does cause a minor performance degradation for larger list sizes.The performance loss caused by the use of the approximate update rule seems to increase with the list size.
In the following, results on the aligned information bottleneck decoder of Section 5 are presented.First, Figure 23 compares the information bottleneck decoder of Section 4 with its aligned version for N = 128 and |T | = 16.An alignment cardinality of |T * | = 16 has been used for the translation table alignment in the aligned decoder.It can be seen that the translation table alignment causes a minor degradation of approximately 0.1 dB for large list size of N L = 32, but the degradation vanishes as the list size becomes smaller.Figure 24 shows that the degradation caused by the translation table alignment can effectively be eliminated if a larger alignment cardinality is used, e.g., |T * | = 32 or 128.The aligned information bottleneck decoder offers several benefits over the information bottleneck decoder of [26].Firstly, the aligned decoder requires only a single translation table for converting the integer cluster labels into LLR values, regardless of the codeword length N. Recall that the information bottleneck decoder without translation table alignment requires N distinct translation tables.If a Q LLR = 6 bit resolution is assumed for the LLR values in the translation tables, the information bottleneck decoder without the table alignment of Figure 23  thereby reducing the memory required for saving the translation tables by more than 99%.Hence, the aligned decoder is space-efficient in the sense that the space required for storing or implementing the translation table is far less and independent of the codeword length.Secondly, the dynamic range of LLR values stored in the single aligned translation table is significantly smaller than that for N distinct translation tables (cf.Figures 10 and 11).For the polar code of Figure 23, the magnitudes of LLR values stored in the N = 128 distinct translation tables ranged between 0 and 63.By contrast, the aligned translation table stores LLR magnitudes between 0 and 10, i.e., the LLR values in the aligned translation table require smaller bit resolution.The aligned translation table for |T * | = 32 or 128 (cf. Figure 24) will require 192 or 786 bits of memory, respectively, which is still a considerable reduction.However, recall that translation table alignment with |T * | > |T | increases the bitwidth of the aligned decoding tables at the decision level, which increases their memory requirement.Unfortunately, this increase in the memory requirement of the aligned decoding tables outweighs the savings achieved from the translation table alignment.
Finally, the fact that only |T * | distinct LLR values appear in the aligned information bottleneck SCL decoder can be exploited to circumvent the performance degradation caused be the use of the approximate path metric update rule.Since a single LLR value leads to a single path metric increment in Equation ( 9) for û = 0, there exist only |T * | path metric increments when a path is extended with û = 0 at any stage.Owing to the odd symmetry of the translation table (cf. Figure 11), the metric increments for û = 1 are readily obtained from the increments for û = 0, as shown in Figure 25.A path metric increment ∆M in (9) for t * = 0, 1, 2, . . ., 15, when the decoding path is extended with û = 0, is equal to ∆M for t * = 15, 14, 13, . . ., 0, respectively, if the path is extended with û = 1.The path metric update values in Figure 25 are computed according to the exact formulation of Equation (9).Therefore, the |T * | LLR values in the aligned translation table can be replaced by their respective path increment values, computed offline, for either value of u, and used to translate cluster values t * ∈ {0, 1, . . . ,|T * | − 1} directly into path increments for both the choices of u ∈ {0, 1}.As a result, the exact metric update in the aligned information bottleneck decoder is implementation-friendly, i.e., just an addition of real numbers, similar to the approximation of Equation (10).Perhaps the biggest question on the mind of the reader at this point is how much complexity reduction is achieved using an information bottleneck decoder.Unfortunately, the answer to this question is not straightforward and depends on how the proposed decoder is implemented.The focal point of this paper is to provide a new way to look at quantized polar decoders.Detailed implementation issues are subject to ongoing work.

Conclusions
This paper presented the application of the information bottleneck method for construction and coarsely quantized decoding of polar codes.The construction of a polar code requires computing the capacities or error probabilities of its bit channels, whose output alphabet grows exponentially in the codeword length.The information bottleneck framework was used in the discrete density evolution of these virtual bit channels in order to restrict their output alphabet to a small finite size.The quantized bit channels are suitable for capacity or error probability computations.To that end, the famous Tal and Vardy method for constructing polar codes turns out to be equivalent to a particular information bottleneck algorithm.However, they did not fully exploit the benefits offered by the framework.We showed that with the appropriate book-keeping, the hard work done during the discrete density evolution can also be used for decoding.More precisely, the intermediate results of the density evolution, in the form of discrete mappings or lookup tables, are use to replace the LLR computations of conventional successive cancellation list (SCL) decoding.All the operations, except the path metric update, in the information bottleneck SCL decoder are table lookups of unsigned integers.This computational simplicity of the coarsely quantized information bottleneck decoder causes only a small block error rate degradation.The number of lookup tables required in the information bottleneck SCL decoder increases with the codeword length N. We used message alignment to reduce the number of translation tables required to translate the unsigned integer messages into LLR values for the path metric update.The aligned information bottleneck decoder requires only a single translation table regardless of the codeword length and fewer bits to represent the LLR values, which is a reduction in their space complexity.Moreover, the aligned information bottleneck decoders also facilitate a hardware-friendly implementation of the path metric update.

Figure 2 .
Figure 2. Structure of a polar code with N = 4.The graph is partitioned into n = log 2 N = 2 levels.The node labels, v i,j indicate the vertical stage, i = 0, 1, . . ., N − 1, and horizontal level, j = 0, . . ., n.For all i, v i,0 = y i , i.e., the channel output at level j = 0, while v i,2 = u i , i.e., the encoder input bits at level j = 2.

Figure 3 .
Figure 3. (a) Information bottleneck setup, where I(X; T) is the relevant information, I(X; Y) is the original mutual information, and I(Y; T) is the compression information.The goal is to determine the mapping p(t|y) which maximizes I(X; T) and minimizes I(Y; T).(b) Information bottleneck graph for the elementary setup of (a).The realizations of Y are mapped onto those of T such that their relevance to X is preserved, i.e., I(X; T) ≈ I(X; Y), and |T | < |Y |.

Figure 4 .
Figure 4. Transition probability of AWGN channel where continuous channel output ỹ is clustered into |Y | = 8 bins or clusters for σ 2 N = 0.5.(a) Randomly initialized symmetric cluster boundaries.(b) Cluster boundaries optimized such that I(X; Y) is maximum for |Y | = 8.

Figure 5 .
Figure 5. Information bottleneck graph for (a) the bit channel of u 0 which maps outputs y 0 , y 1 onto |T | clusters, labeled t 0 , treating u 0 as the relevant variable.(b) The bit channel of u 1 which maps the outputs y 0 , y 1 , u 0 onto |T | clusters, labeled t 1 , treating u 1 as the relevant variable.

Figure 6 .
Figure 6.Information bottleneck graph of a polar code with length N = 4.The outputs of the bit channel experienced by each node v i,j are clustered into a compressed random variable T i,j = t i,j ∈ {0, . . ., |T | − 1}, where i = 0, 1, . . ., N − 1 indicates the stage, while j = 0, . . ., n represents the level in the code structure.
has |Y | elements which we wish to reduce to |T | < |Y |.The target alphabet size |T | is referred to as the fidelity parameter in [23] and denoted by µ.The output alphabet size is reduced by one, if two elements y a , y b ∈ Y are merged into a single element y ab .The output alphabet becomes Y \ {y a , y b } ∪ {y ab }, and the transition probabilities are updated as: p(y|x) = p(y a |x) + p(y b |x) y = y ab p(y|x) otherwise.

4 |T | = 8 |T | = 16 |T | = 32 Figure 7 .
Figure 7. Plot of the cumulative product R cum indicating the amount of mutual information preserved over the levels of a half-rate polar code with N = 512 depending on the number of clusters |T |.

Figure 9 .
Figure 9.Translation tables of an information bottleneck decoder with |T | = 16 for different bit channels of a half-rate polar code with N = 128, design E b /N 0 = 3 dB.u 14 is a frozen bit while u 83 , and u 124 are information bits.

Figure 10 .
Figure 10.Information bottleneck graph for the alignment of the translation tables of a polar code.The decision level cluster indices t i = [t, i] T are clustered into aligned indices t * such that the relevant information I(U; T * ) is maximized.

Figure 11 .
Figure 11.Translation table for aligned decoding tables at the decision level with |T * | = 16 for the polar code with N = 128, rate = 0.5 and design E b /N 0 = 3 dB.

Figure 15 .Figure 16 .
Figure 15.Block error rate of double-precision floating-point SCL decoder with channel quantizers of different resolutions designed using the information bottleneck method.N L = 32, N crc = 16, N = 1024, code rate 0.5, and design E b /N 0 = 3 dB.

Figure 22 .
Figure 22.Effect of using the approximation path metric update rule of Equation (10) on the block error rate of a 4-bit information bottleneck decoder with N L = 2, 8 or 32, 16-bit CRC, N = 128, rate 0.5, and design E b /N 0 = 3 dB.
will require N • |T | • Q LLR = 12,288 bits to store the N translation tables.On the other hand, the aligned decoder of Figure 23 with |T * | = 16 will require only |T * | • Q LLR = 96 bits for the aligned translation table,

1 Figure 25 .
Figure 25.Path metric increments computed for the aligned translation table of Figure 11 according to Equation (9).|T * | = 16 for the polar code with N = 128, rate = 0.5, and design E b /N 0 = 3 dB.
Clustering p(t 0 |y 0 ) that maps the |Y 0 | • |Y 1 | outputs of the bit channel of u 0 to |T | clusters, written as a lookup table.(b) Translation table of u 0 obtained from the conditional distribution p(u 0 |t 0 ) as in Equation (21).
Block error rate of polar codes constructed using Gaussian approximation, Tal and Vardy's method, and the information bottleneck method using a conventional successive cancellation (SC) and successive cancellation list (SCL) decoder with N L = 8, 32, 16-bit cyclic redundancy check (CRC), N = 1024, rate 0.5, and design E b /N 0 = 3 dB.