Minimum-Integer Computation Finite Alphabet Message Passing Decoder: From Theory to Decoder Implementations towards 1 Tb/s

In Message Passing (MP) decoding of Low-Density Parity Check (LDPC) codes, extrinsic information is exchanged between Check Nodes (CNs) and Variable Nodes (VNs). In a practical implementation, this information exchange is limited by quantization using only a small number of bits. In recent investigations, a novel class of Finite Alphabet Message Passing (FA-MP) decoders are designed to maximize the Mutual Information (MI) using only a small number of bits per message (e.g., 3 or 4 bits) with a communication performance close to high-precision Belief Propagation (BP) decoding. In contrast to the conventional BP decoder, operations are given as discrete-input discrete-output mappings which can be described by multidimensional LUTs (mLUTs). A common approach to avoid exponential increases in the size of mLUTs with the node degree is given by the sequential LUT (sLUT) design approach, i.e., by using a sequence of two-dimensional Lookup-Tables (LUTs) for the design, leading to a slight performance degradation. Recently, approaches such as Reconstruction-Computation-Quantization (RCQ) and Mutual Information-Maximizing Quantized Belief Propagation (MIM-QBP) have been proposed to avoid the complexity drawback of using mLUTs by using pre-designed functions that require calculations over a computational domain. It has been shown that these calculations are able to represent the mLUT mapping exactly by executing computations with infinite precision over real numbers. Based on the framework of MIM-QBP and RCQ, the Minimum-Integer Computation (MIC) decoder design generates low-bit integer computations that are derived from the Log-Likelihood Ratio (LLR) separation property of the information maximizing quantizer to replace the mLUT mappings either exactly or approximately. We derive a novel criterion for the bit resolution that is required to represent the mLUT mappings exactly. Furthermore, we show that our MIC decoder has exactly the communication performance of the corresponding mLUT decoder, but with much lower implementation complexity. We also perform an objective comparison between the state-of-the-art Min-Sum (MS) and the FA-MP decoder implementations for throughput towards 1 Tb/s in a state-of-the-art 28 nm Fully-Depleted Silicon-on-Insulator (FD-SOI) technology. Furthermore, we demonstrate that our new MIC decoder implementation outperforms previous FA-MP decoders and MS decoders in terms of reduced routing complexity, area efficiency and energy efficiency.


Introduction
Beyond 5G and 6G wireless communication systems, target peak data rates of 100 Gb/s to 1 Tb/s with processing latencies between 10-100 ns [1]. For such high data rate and low latency requirements, the implementation of a Forward Error Correction (FEC) decoder, which is one of the most complex and computationally intense components in the baseband processing chain, is a major challenge [2]. Low-Density Parity Check (LDPC) codes [3] are FEC codes with capacity approaching error correction performance [4] and are part of many communication standards, e.g., DVB-S2x, Wi-Fi, and 3GPP 5G-NR. In contrast to other competitive FEC codes, like Polar and Turbo codes, the decoding of LDPC codes is dominated by data transfers [2] making very high-throughput decoders in advanced silicon technologies challenging, especially from routing and energy efficiency perspectives. For example, in a state-of-the-art 14 nm silicon technology, the transfer of 8 bits on a 1 mm wire costs about 1 pJ, whereas the cost of an 8 bit integer addition is only 10 fJ, which is two orders of magnitude less than the wiring energy cost. During Message Passing (MP) decoding, two sets of nodes, the Check Node (CN) and Variable Node (VN), iteratively exchange messages over the edges of a bipartite graph (Tanner graph of the LDPC code). High-throughput decoding can be achieved by mapping the Tanner graph one-to-one onto hardware, i.e., dedicated processing units are instantiated for each node and the edges of the Tanner graph are hardwired. Unrolling and pipelining the decoding iterations can further boost the throughput towards 1 Tb/s [5], called unrolled full parallel (FP) decoders in the following. However, FP decoders imply large routing challenges, since every edge in the Tanner graph corresponds to 2 · I · n E wires, with I being the number of decoding iterations and n E being the quantization-width of the exchanged messages. Moreover, to enable good error correction performance, the Tanner graph exhibits limited locality and regularity, which makes efficient routing even more difficult. This problem is even exacerbated in advanced silicon technologies, as routing scales much worse than transistor density [6].
Finite Alphabet Message Passing (FA-MP) decoding has been investigated as a method to mitigate the routing challenges in FP LDPC decoders to reduce the bit-width, i.e., the quantization-width n E , of the exchanged messages and, thus, the number of necessary wires [7][8][9]. In contrast to conventional MP decoding algorithms like the Belief Propagation (BP) and its approximations, i.e., Min-Sum (MS), Offset Min-Sum (OMS) and Normalized Min-Sum (NMS) [10], FA-MP use non-uniform quantizers and the node operations are derived by maximizing MI between exchanged messages. Nodes in state-of-the-art FA-MP decoders have to be implemented as Lookup-Tables (LUTs). Since the size of the LUT exponentially increases with the node degree and n E , investigations were performed to decompose this multidimensional LUT (mLUT) into a chain or tree with only two-input LUTs (denoted as sequential LUT (sLUT) in this paper) yielding only a linear dependency of the node degree but at the cost of a decreased communications performance [11,12]. The Minimum-LUT (Min-LUT) decoder [13] approximates the CN update by a simple minimum search and can be implemented as Minimum-mLUT (Min-mLUT) or Minimum-sLUT (Min-sLUT), i.e., with mLUT or sLUT for VNs, respectively. Other approaches, e.g., Mutual Information-Maximizing Quantized Belief Propagation (MIM-QBP) [14][15][16] and Reconstruction-Computation-Quantization (RCQ) [17,18], are adding non-uniform quantizers and reconstruction mappings to the outputs and inputs of the nodes, respectively, and performing the standard functional operations inside the nodes, e.g., additions for VNs and minimum search for CNs. The reconstruction mappings generally increase the bit resolution required for node internal representation and processing. It can be shown that this approach is equivalent in terms of error correction performance compared to the mLUT, if the internal quantization after the reconstruction mapping is sufficiently large.
Based on the framework of MIM-QBP and RCQ, the proposed MIC decoder [19] realizes CN updates by a minimum search and VN updates by integer computations that are designed to realize the information maximizing mLUT mappings either exactly or approximately. In this paper, we provide more detailed explanations, extend the discussion to irregular LDPC codes and present a comprehensive implementation analysis. The new contributions of this paper (Notation: Random variables are denoted by sans-serif letters x, random vectors by bold sans-serif letters x, realizations by serif letters x and vector-valued realizations by bold serif letters x. Sets are denoted by calligraphic letters X . The distribution p x (x) of a random variable x is abbreviated as p(x). x → y → z denotes a Markov chain, and R, Z, F 2 denotes the real numbers, integers and Galois field 2, respectively.) are summarized as follows: • We provide a novel criterion for the resolution of internal node operations to ensure that the MIC decoder can always replace the information maximizing VN mLUT exactly; • we show that this MIC decoder has the same communication performance compared to an MI maximizing Min-mLUT decoder; • we make an objective comparison between different FA-MP decoder implementations (Min-mLUT, Min-sLUT, MIC) in an advanced silicon technology and compare them with a state-of-the-art MS decoder for throughput towards 1 Tb/s; • we show that our MIC decoder implementation outperforms state-of-the-art FP decoders in terms of routing complexity, area efficiency and energy efficiency and enables the processing of larger block sizes in state-of-the-art FP decoders since the routing complexity is largely reduced.
The remainder of this paper is structured as follows: Section 2 reviews the system model, conventional decoding techniques for LDPC codes such as BP and NMS decoding, and Information Bottleneck (IB) based quantization. Section 3 describes the Min-mLUT and Min-sLUT decoder design for regular and irregular LDPC codes. In Section 4, we introduce the proposed MIC decoder and, in Section 5, we discuss the MIC decoder implementation along with a detailed comparison with state-of-the-art FP MP decoders. Finally, Section 6 concludes the paper.

Preliminaries
This section briefly reviews the transmission model, conventional decoding techniques for LDPC codes, and the quantizer design based on IB.

Transmission Model
The transmission model is shown in Figure 1. An information word u ∈ F K 2 is encoded into the codeword c ∈ F N 2 via a binary LDPC code [3] of rate R = K N . The Binary Phase Shift Keying (BPSK) modulated vector x = 1 − 2c is transmitted over an Additive White Gaussian Noise (AWGN) channel leading to the received vector y ∈ R N given by y = x + n with AWGN n of variance σ 2 n . A particular LDPC code is defined via a sparse parity check matrix H ∈ F M×N 2 . The Tanner graph [20] of an LDPC code is a visual representation of its parity check matrix H and consists of a CN for each parity check equation χ m with m = 1, ..., M and a VN for each codebit c n with n = 1, ..., N. An edge connects VN n and CN m if and only if H m,n = 1. The degree of a node is determined by the number of connected edges. Furthermore, the fraction of edges that is connected to a node of a specific degree is characterized by the edge-degree distributions where λ d V is the fraction of edges that are connected to VNs of degree d V ∈ D V , and ρ d C denotes the fraction of edges that is connected to CNs of degree d C ∈ D C . LDPC BPSK AWGN Quantizer Decoder

Iterative Decoding via Belief-Propagation (BP)
LDPC codes are usually decoded by iterative BP, where extrinsic information for each codebit c n is propagated along the edges of the resulting Tanner graph. Figure 2 shows the CN χ 1 that generates extrinsic information for the VN c n by processing the incoming Variable Node to Check Node (VN-to-CN) messages from the other VNs connected to CN χ 1 , i.e., c 1 and c 2 . For BP decoding, we define the Variable Node to Check Node (VN-to-CN) messages as L n→m ∈ R and the Check Node to Variable Node (CN-to-VN) messages as L n←m ∈ R. In the first iteration, all messages are initialized by the channel LLRs In iteration i, a CN m generates extrinsic information for the connected VNs n ∈ M m via the box-plus operation In case of Normalized Min-Sum (NMS) decoding, the CN update (3) is approximated by where γ is the normalization factor. In the case of γ = 1, (4) is the CN update of the MS decoder.
In similar fashion, a VN n generates extrinsic information for the connected CNs m ∈ N n by adding the corresponding LLRs The final bit decisionĉ

Information Bottleneck Based Quantizer Design
For the design of our proposed MIC decoder, we utilize MI maximizing quantization to design an information optimized processing chain that uses only quantizer labels instead of real valued representations [12]. To that end, we first review the principle idea of the MI based quantizer design approach. The considered system model is visualized in Figure 3. The observed signal y ∈ Y is mapped to a compressed representation z ∈ Z via the scalar quantization function Q : Y → Z. The objective is to find a quantizer function Q that maximizes MI I(x; z) between the relevant source x ∈ X and the quantizer output Q(y) = z ∈ Z under the condition that the three random variables form a Markov chain x → y → z. Given the joint distribution p(x, y) = p(y|x)p(x), the mapping of the information maximizing quantizer Q is determined by solving the optimization problem where the number of possible quantizer outputs is set to 2 n Q . The optimization problem in (7) is a special case of the Information Bottleneck Method (IBM) [12,[21][22][23]. The optimal solution is a deterministic quantization function where the conditional probability of the quantizer output z given the relevant source x is with Y z = {y ∈ Y | Q (y) = z} as the set of observed signals y that are mapped to one specific quantizer output z. Since the maximum of (7) depends only on the cardinality of Z, we utilize a convenient signed integer based representation Z = {− 2 n Q 2 , ..., −1, 1, ..., 2 n Q 2 } that simplifies the MIC decoder processing. For the special case where the relevant source x is a binary random variable (i.e., |X | = 2), the algorithm that finds the optimal quantizer via dynamic programming has been derived in [24]. We denote the LLRs of the quantizer output z ∈ Z by An important property of the MI maximizing quantizer for binary input is that any two different sets of LLRs L z = {L(y) | y ∈ Y z } and L z = {L(y) | y ∈ Y z } for z , z ∈ Z and z = z are separated by a single threshold [19,24,25]. This property will be exploited in the design of the MIC decoder in Section 4.

LUT Decoder Design
This section describes the design of the LUT decoder that is optimized via Discrete Density Evolution (DDE) [11] to maximize extrinsic information between the codebits and its messages, under the assumption that the Tanner graph is cycle free. In contrast to the BP algorithm, the LUT is optimized to process the quantizer labels z in (7) directly and the bit resolution of the message exchange on the Tanner graph is limited to n E bits, e.g., 3 or 4 bits. Furthermore, we exploit the signed integer-based representation to simplify the CN update by using the label-based minimum search [13]. In the Min-mLUT decoder design, the VN update functions are optimized to maximize MI. For the Min-sLUT decoder design, the VN update is decomposed into a sequence of two-dimensional updates that generally results in a MI loss compared to the Min-mLUT decoder design.
In the following, we review the calculation of the CN and VN distributions for each iteration that are required for the design of the MI maximizing VN update. As illustrated in Figure 4, we omit the iteration index i and consider messages of an arbitrary CN and VN for CN degrees d C ∈ D C and VN degrees d V ∈ D V to calculate the distributions that are required for the Min-mLUT design.

Check Node LUT Design
The LUT decoder design is based on discrete alphabets Z, T and A for the channel information, the VN-to-CN and the CN-to-VN messages, respectively. For the first iteration, the VN-to-CN messages t j for j = 1, ..., d V − 1 are initialized by the signed integer valued channel information, i.e., t j = z j ∈ Z. The distribution of the d C − 1 VN-to-CN messages t = [t 1 , ..., t d C −1 ] ∈ T (d C −1) and an arbitrary codebit c of a check equation χ is [11] as the modulo 2 sum of connected codebits. The VN-to-CN messages t j are processed by a CN update function that generates quantized output messages a ∈ A that are represented only by n E bits. Given the distribution in (10), the CN update (We keep the node degrees d C or d V as index of random variables to indicate that the distribution changes with the correspond- As discussed in Section 2.3, the optimal solution of (11) is found via dynamic programming. However, we utilize the minimum update [13] as a CN update for all iterations as an approximation of the MI maximizing CN update in (11). We observed that the output of the minimum update is quite close to the optimal IB update. As visualized for a degree 3 CN in Figure 5, the difference between the optimal IB CN and the minimum update can be interpreted as an additive correction LUT where only a small fraction of entries are nonzero. For the label-based minimum search, the CN update rule reads If the CN update function is given, the conditional distribution of the CN-to-VN messages In the design via DDE, the connections between VNs and CNs are considered on average by the degree distribution [26]. Hence, the design considers only the marginal CN-to-VN message distribution p(a|c) that includes averaging over all possible CN degrees by -8-7-6-5-4-3-2-11 2 3 4 5 6 7 8 (c) correction LUT Figure 5. Graphical representation of a discrete CN update using n E = 4 bit input messages t 1 and t 2 and a color-coded output message contains only a few non-zero elemets and can be interpreted as a correction LUT.

Variable Node LUT Design
For designing the VN update, we require the joint distribution of the discrete channel information z ∈ Z together with the CN-to-VN messages a m ∈ A combined in a = [z, a 1 , ..., where p(a m |c) = p(a|c) for m = 1, ..., d V,max − 1 and V is the set of all possible states of the vector a, i.e., |V | The parameter n E defines the bit-width of the messages exchanged between VN and CN and controls the complexity of the message exchange. The optimization problem in (16) is the channel quantization problem for binary input (Section 2.3). The optimal solution is a deterministic input-output relation that can be stored as a d V dimensional LUT with 2 n Q +(d V −1)n E entries, e.g., for d V = 6 and n E = n Q = 4, we have approximately 16.8 million entries. Furthermore, the communication performance can be increased by considering the degree distribution in the design of the node updates [13,26]. The gain in communication performance generally depends on the degree distribution and the message resolution n E [13]. However, a comparison of the different design approaches in [13,26] is beyond the scope of this paper. The distribution of the VN-to-CN messages for the next iteration in (10) is Again, the marginal distribution is determined by averaging over all possible VN degrees, i.e., In case of a regular LDPC code, there is only one possible degree for all VNs and CNs, i.e., the summation term in (14) and (18) vanishes but all other steps remain the same. For the design of the MI maximizing Min-mLUT decoder, we start with an initial VN-to-CN distribution p(t j |c j ) and iterate over (10), (13)- (18) and declare convergence if I(c; t) approaches the maximum value of one bit for binary input after I number of iterations.

Sequential LUT Design
For the sequential design approach sLUT, the node update is split into a sequence of degree two updates that are optimized independently to maximize MI. This approach serves as an approximation of the mLUT design described in Section 3.2 and reduces the number of possible memory locations within each update. In general, multidimensional optimization without decomposition conserves more MI compared to a design that decomposes the optimization problem into a sequence of two-dimensional updates [11,12] or more general nested tree decompositions [13].

Minimum-Integer Computation Decoder Design
The MI maximizing Min-mLUT decoder realizes the discrete VN updates by LUTs with 2 n Q +(d V −1)n E entries leading to prohibitively large implementation complexity. Nevertheless, determining these multidimensional LUTs in the laboratory is feasible with sufficient computing resources. Thus, the idea is to search for the MI maximizing mLUTs but implement the corresponding discrete functions by relatively simple operations in order to avoid performance degradations. As visualized in Figure 6, the computational domain framework [14,16] replaces the VN update by an operation that is decomposed into (i) mappings φ V and φ of the n E -bit CN-to-VN messages a m and n Q -bit channel information z into node internal n R -bit signed integers with n R ≥ n E and n R ≥ n Q , respectively; (ii) execution of integer additions for n R -bit signed integers; (iii) threshold quantization Q V to n E bits determining the VN-to-CN message t.
For the MIC decoder design, we derive a criterion for sufficient internal node resolution n R such that the mLUT mapping is replaced exactly. Note that the information maximizing mLUT is generated offline and is replaced by an integer function that replaces its functionality exactly or approximately during execution.
To keep the notation simple, we omit the dependency on the iteration index i and node degree in this section. Figure 6. VN update for computational domain framework [14,16]. The n Q -bit channel information z ∈ Z and the n E -bit CN-to-VN messages a 1 , ..., a d V −1 ∈ A are transformed to n R -bit signed integers. This transformation generally increases the required bit resolution for the representation, i.e., n R ≥ n Q and n R ≥ n E . The internal signed integers are summed and quantized back into a n E -bit VN-to-CN message t ∈ T .

Equivalent LLR Quantizer
To motivate the integer calculation of the MIC approach, we review the connection between the equivalent LLR quantizer and the VN update of the BP algorithm. Analogous to the VN update of the BP algorithm in (5), the LLR of the combined message vector a ∈ V equals the addition of the LLRs of the channel output z and of the individual messages a m , i.e., for every possible combination a ∈ V, the LLR of the combined message is The LLRs L(a m ) of the individual messages are determined by (14) during DDE. As described in Section 2.3, the information maximizing quantizer for binary input separates the LLR L(a) by using a |T | − 1 threshold quantizer Q L : R → T , i.e., the relation can be determined that achieves the same output as the information optimal mLUT in (16). However, to ensure that (20) produces the same output as the information optimal mLUT, calculations over real numbers are required. In the next subsection, we show that we can exploit (20) to find a calculation that requires only a finite resolution. We also provide a condition to limit the resolution that is required for exact calculation of the information optimal mLUT.

Computations over Integers
The VN update structure using the computational domain framework is visualized in Figure 6. As suggested by [14,16], a possible choice for the integer mappings φ v (m) and φ ch (z) is given by scaling and rounding the corresponding LLRs L(m) and L(z), respectively. In addition to [14,16], we provide further insights on the optimal choice of the scaling factor based on the relation between the VN update of the BP algorithm and the MI maximizing quantizer design. More precisely, based on the established relation in (20), we define an integer mapping for the channel information z and the CN-to-VN messages a m in order to replace the computations over real numbers by computations over signed integers (With · as round to nearest integer (away from 0 if fraction part is .5)) Compared to (20), the LLRs L(z) and L(a m ) have been multiplied by a non-negative scaling factor s and quantized to the next n R -bit signed integer r and r m , respectively. Subsequently, the sum of integers is limited again to n E bits by threshold quantizer Q V . We can interpret the scaling and rounding operation also directly as a mapping of signed integer messages z and a m to n R -bit signed integer messages that requires n R bits for the representation, depending on the scaling factor s.
In the following, we show that we can always find a threshold quantizer Q V : W → T that maps the summation W s (a) into a VN-to-CN message t ∈ T that is identical to the VN-to-CN message of the information optimal VN update in (20), i.e., t = g MIC (a) = g MI (a). First, we consider the set of messages A t = {a ∈ V : g MI (a) = t ∈ T } that are mapped into a specific output t via the information maximizing VN update g MI (a) in (16). Thus, we can identify a corresponding set of integers W t = {W s (a) ∈ W : a ∈ A t }. By varying the scaling factor s, we can always find a scaling value s ≤ d V ∆ min such that the sets of integer values W t for all t ∈ T are non-overlapping intervals, i.e., with D t = min W t and E t = max W t . Condition (23) ensures that any two different clusters t and t can be separated by a simple threshold operation. The value ∆ min is the minimum separation between the LLRs L(a) of the elements of any two neighbouring clusters in (20) and is always larger than zero since Q L is a threshold quantizer. If we consider a scaled version of the LLRs sL(a) with any real valued scaling factor s > 1, we can always find a threshold quantizers Q L,s that achieves the same output as the information optimal mLUT. Scaling the LLRs L(a) by a factor of d V ∆ min ensures that the minimum separation between any two neighbouring clusters is d V . Since the influence of the rounding operation can be bounded by − d V 2 ≤ W s (a) − sL(a) < d V 2 , scaling with a factor of at least d V ∆ min ensures that any two neighbouring clusters W t and W t+1 are separated by at least one integer and, thus, condition (23) is satisfied. Hence, we can always find a corresponding integer function g MIC (a) in (21) that generates exactly the same output as g MI (a) in (20).
Furthermore, an approximate integer calculation is found if the integer valued range of φ and φ V are limited to n R -bits where n R = log 2 ( s L max ) + 1 is the bit resolution that is required for exact representation if the largest magnitude of the individual LLRs in (20) is L max . If condition (23) is not fulfilled, we select the output cluster that maximizes MI. If (24) is satisfied, the required bit resolution of the summation W s (a) in (21) is limited by To consider the influence of this new mapping in the design of subsequent iterations, we also update the VN-to-CN distribution in (17). We note that the MIC design approach can also be applied for the design of CN operations and can also be used to generate exact or approximate representations of nested tree decompositions similar to the sLUT method. However, the corresponding investigations are beyond the scope of this paper.

Illustrative Example for MIC Calculations
To illustrate the proposed MIC approach, we consider the design of a VN node update for a (d V =3, d C =6) regular LDPC codes at iteration i = 1 with design E b /N 0 = 2.5 dB. Figure 7a shows the equivalent LLR quantizer (20)  In the case of s = 10, all output clusters are separated by using seven integer thresholds, which are indicated by dashed lines in Figure 7b. In this case, the integer computation fully replaces the original mLUT functionality by using only signed integers of low-range. To clarify the example, the numeric values of the corresponding LLRs and integer mappings of (19) and (21) for s = 10 are shown in Table 1. For example, the quantized receive message z = 2 corresponds to an LLR of L(z) = 1.56 leading to the n R -bit signed integer message r = φ(z) = 15.6 = 16. After summation of r and r m , all results 12 ≤ W 10 (a) ≤ 23 are again mapped back to the n E message t = 2.  25] ± [23,12] ± [11,1] For s = 3 and s = 1, the integer range is further reduced, but the original mLUT functionality cannot be represented exactly since some integer additions are mapped to more than one output cluster of the original mLUT (e.g., some values of W s (a) are mapped into cluster t = 1 and t = 2 as highlighted in Figure 7c). If some values of W s (a) are assigned to more than one cluster of the information maximizing mLUT mapping, a merging of these values into a single cluster is required. This merging generally leads to an inevitable loss of information. In order to find a corresponding threshold quantizer for this case, we select the output cluster that minimizes the information loss under the condition that (23) is fulfilled.

FER Results
In this section, we discuss the communication performance of the proposed MIC decoder for an irregular LDPC code from the 802.11n standard [27] of length N = 648 with rate R = 0.75 and edge degree distributions λ(ξ) = 0.2083ξ 1 + 0.3333ξ 2 + 0.25ξ 3 + 0.2083ξ 5 and ρ(ξ) = 1 3 ξ 13 + 2 3 ξ 15 . The realization of the MIC decoder is characterized by three quantization parameters and specified by MIC(n E , n Q , n R ). In contrast, the Min-mLUT decoder with label based minimum operation as CN update has only two parameters and is denominated by Min-mLUT(n E , n Q ). Figure 8 shows the Frame Error Rate (FER) performance of Min-mLUT and MIC for n E = n Q = 3 and I = 10 iterations, but varying resolution of internal messages n R ∈ {4, 5, 6, 7, 11} for MIC. The BP decoder with double precision serves as our benchmark simulation. The Min-mLUT decoder with n E = n Q = 3 bit quantization for the message exchange and channel information results in a minor performance degeneration of only 0.2 dB at a FER of 10 −3 w.r.t. the benchmark simulation. In comparison, the proposed MIC decoder that replaces the VN update of the Min-mLUT decoder by using the computational domain framework with internal messages of size n R = 4 results in a loss of 0.25 dB compared to the Min-mLUT decoder. The performance gain of the MIC decoder by using n R = 5 compared to n R = 4 is around 0.1 dB. The MIC decoder with n R = 7 has basically identical FER performance compared to the Min-mLUT decoder. If n R = 11, the MIC decoder represents the mLUT functionality exactly by meeting the criterion (23), but the gain in communication performance compared to the MIC decoder with n R = 7 is negligible. Additionally, MIC decoding does not require LUTs with up to 262k entries for each iteration.

Finite Alphabet Message Passing (FA-MP) Decoder Implementation
In this section, we investigate the implementation complexity of different LUT-based FA-MP decoders in terms of area, throughput, latency, power, area efficiency, and energy efficiency and compare them with a state-of-the-art Normalized Min-Sum (NMS) decoder. As already stated, we focus on unrolled full parallel (FP) decoder architectures that enable throughput towards 1 Tb/s. The architecture template is shown in Figure 9. Input to the decoder are compressed messages z from the channel quantizer. The decoder uses two-phase decoding. Hence, each iteration consists of two stages: one stage comprises M Check Node Functional Units (CFUs) and the second stage N Variable Node Functional Units (VFUs). The stages are connected by hardwired routing networks, which implement the edges of the Tanner graph. Since the decoding iterations are unrolled, the decoder consists of 2 · I stages. Deep pipelining is applied to increase the throughput. For more details on this architecture, the reader is referred to [5]. In FP decoders that use the NMS algorithm, node operations are implemented as additions and minimum searches on uniformly quantized messages [5]. In contrast, node functionality in Finite Alphabet (FA) decoders is implemented as LUTs. Implementing a single LUT as memory is impractical in Application-Specific Integrated Circuit (ASIC) technologies since the area and power overhead would be too large. Hence, a single LUT is transformed into n E Boolean functions b : B inp → B with inp being the number of inputs of the LUT, which is the node degree multiplied by n E . b can consist of up to 2 inp product terms if b is represented in sum-of-product form. State-of-the-art logic synthesis tools try to minimize b such that it can be mapped onto a minimum number of gates. Despite this optimization, the resulting logic can be very large for higher node degrees and/or n E , making this approach unsuitable for efficient FP decoder implementation. It was shown in [7] that the mLUT can be decomposed into a set of two-input sLUTs arranged in a tree structure, which largely reduces the resulting logic at the cost of a small degradation in error correction performance. To compare these approaches with our new decoder, we implemented four different types of FP decoders: • NMS decoder with extrinsic message scaling factor of 0.75; • Two LUT-based decoders: in these decoders, we implemented the VN operation by LUTs and the CN operations by a minimum search on the quantized messages. The latter corresponds to the CN Processor implementation of [7]. The LUTs are implemented either as a single LUT (mLUT), or as a tree of two-input LUTs (sLUT); • Our new MIC decoder in which the VN is replaced by the new update algorithms, presented in the previous section.
For MIC and LUT based decoders, we investigated message quantization n E = 3 and n E = 4. The reference is an NMS decoder with n E = 4 and n E = 5, respectively. For all decoders, the channel and message quantization were set to be identical, i.e., n E = n Q . We used a different code for our implementation investigation than in the previous sections. This code has a larger block size, which implies increased implementation complexity. The code is a (816, 406) regular LDPC code with d V = 3 and d C = 6 and the number of decoding iterations is I = 8.
We applied a Synopsys Design Compiler and IC Compiler II for implementation in a 28 nm Fully-Depleted Silicon-on-Insulator (FD-SOI) technology under worst-case Process, Voltage and Temperature (PVT) conditions (125°C, 0.9 V for timing, 1.0 V for power). A process with eight metal layers was chosen. Metal layers 1 to 6 are used for routing, with metals 1 and 2 mainly intended for standard cells. The metal layers 7 and 8 are only used for power supply. Power numbers were calculated with back-annotated wiring data and input data for a FER of 10 −4 . All designs were optimized for high throughput with a target frequency of 1 GHz during synthesis and back-end. To assess the routing congestion, we fixed the utilization to 70 % for all designs as a constraint. The utilization specifies the ratio between logic cell area and total area (=logic cell area plus routing area). Thus, by fixing this parameter, all designs have the same routing area available in relation to their logic cell area. Figures 10 and 11 show the FER performance for the different decoders. We compare the NMS decoder with the MIC decoder and the two LUT-based decoders. The LUTs of the FA-MP decoders are elaborated to a design Signal-to-Noise-Ratio (SNR) optimized at an FER of 10 −4 . It should be noted that this may result in an error floor behavior below the target FER. This phenomenon can be mitigated by selecting a larger design SNR at the cost of decreased performance in the waterfall region [13]. For comparison, we also added the BP performance with double precision floating point number representation.

FER Performance of Implemented FA-MP Decoders
In the previous section, we showed that, for the (648, 486) code, the MIC decoder achieves the same error correction performance as the Min-mLUT decoder for n R = 7. A similar observation was made for the (816, 406) code considered here. In our implementation comparison, we reduced n R such that the MIC's FER stays below that of the NMS at the target FER of 10 −4 . In this way, we obtained an n R = 5, which yields a small degradation in the MIC FER compared to the Min-sLUT and Min-mLUT decoders, but outperforms the NMS decoder. We observe that the MIC and Min-mLUT decoders with one bit smaller message quantization n E have better error correction capability than the NMS decoder at the target FER. In addition, due to the low message quantization and the resulting low dynamic range, the NMS runs into an error floor below FER 10 −4 .  Table 2 shows the implementation results for MIC (3,3,5), Min-mLUT(3,3), Min-sLUT (3,3) and NMS (4,4) decoders, whereas Table 3 shows the implementation results for MIC (4,4,5), Min-mLUT(4,4), Min-sLUT(4,4) and NMS(5,5) decoders. As already stated, we fixed the target frequency to 1 GHz and the utilization to 70% for all decoders. Maximum achievable frequency f , final utilization, area A and power consumption P were extracted from the final layout data. From these data, we can derive the important implementation metrics: throughput, latency, area efficiency and energy efficiency. Since the decoders are pipelined, the coded decoder throughput T is f · N. The latency is 1/ f · 26 (each iteration consists of three pipeline stages, decoder input and output are also buffered, yielding 8 · 3 + 2 = 26 pipeline stages in total). The area efficiency is defined as T/A and the energy efficiency as P/T. The Min-mLUT decoder has the largest area, the worst area efficiency, and the worst energy efficiency. We see an improvement in these metrics for the Min-sLUT at the cost of a slightly decreased error correction performance. The difference in the implementation metrics largely increases when n E = 3 changes to n E = 4. The area increases by a factor of 10 for the Min-mLUT (4,4), but only by a factor of 2.7 for the Min-sLUT(4,4) decoder. Moreover, we had to reduce the utilization to 50 % to achieve a routing convergence for the Min-mLUT(4,4) decoder. The large area increase is explainable with the increase of the LUT sizes from 512 to 4096 entries per LUT when increasing n E from 3 to 4. Moreover, the frequency largely breaks down, yielding a very low area efficiency and energy efficiency. The Min-sLUT decoders scale better with increasing n E . Both Min-sLUT decoders outperform the corresponding NMS decoders in throughput and efficiency metrics.

FD-SOI Implementation Results
The MIC decoder has the best implementation metric numbers in all cases. It outperforms all other decoders in throughput, area, area efficiency and energy efficiency while having the same or even slightly improved error correction performance compared to the other decoders. It can also be seen that the MIC decoder has a lower routing complexity compared to the Min-sLUT and the NMS decoder. We observe a large drop in the frequency from 595 MHz down to 183 MHz (70 % decrease) when comparing NMS(4,4) with NMS (5,5) under the utilization constraint of 70 %. The large drop in the frequency is explainable with the increased routing complexity for the given routing area constraint that yields longer wires and corresponding delays. This problem is less severe for the Min-sLUT, where the frequency drops from 670 MHz to 492 MHz (27 % decrease). The MIC achieves the highest frequency for all cases and drops from 775 MHz to 633 MHz (18% decrease), only. This shows that the MIC scales much better with increasing n E .
It should be noted that the CFU implementation is identical for the MIC, Min-mLUT and Min-sLUT decoders. Compared to the corresponding NMS, the CFU implementation is less complex [19] due to: (i) a 1 bit smaller message quantization, (ii) the omission of the scaling unit, and (iii) the omission of the sign-magnitude to two's complement conversion. Hence, the CFU complexity of the FA-MP is always lower than that of the NMS independent of the respective CN degree. Moreover, in contrast to the NMS decoder, the messages from the CFUs to the VFUs are transmitted in sign-magnitude representation via the routing network which reduces the toggling rate and thus the average power consumption. Table 2. Post-layout results of FA-MP decoders with n E = n Q = 3, n R = 5.

MIC
Min-mLUT Min-sLUT NMS  Figure 12 shows the layout of the MIC and the NMS decoder in the same scale. Each color represents one iteration stage, which is composed of CFUs, VFUs, and the routing between the nodes (see also Figure 9). When comparing the same iteration stages (same color) of the two decoders, we can observe that the iteration stages in the MIC decoder are smaller than the corresponding iteration stages in the NMS decoder, although the frequency of the MIC decoder is more than three times higher compared to the NMS decoder. This shows once again that the MIC has a lower implementation complexity, especially from a routing perspective.  Our analysis shows that the new MIC approach largely improves the implementation efficiency and exhibits better scaling compared to the state-of-the-art sLUT and NMS implementations of FP decoder architectures. This enables the processing of larger block sizes, which is mainly due to the reduced routing complexity. Larger block sizes improve the error correction capability and further increase the throughput of FP architectures.

Conclusions
This paper provides a detailed investigation of the Minimum-Integer Computation (MIC) decoder for regular and irregular Low-Density Parity Check (LDPC) codes. The MIC decoder utilizes the computational domain framework to realize Variable Node (VN) updates by an equivalent low-range signed integer computation and Check Node (CN) updates by a minimum search. For the VN update, we provide further insights for the design of an Mutual Information (MI) maximizing signed integer computation. To discuss implementation issues on FA-MP decoding architectures, we exemplified this on different LUT-based decoder designs. Furthermore, we compared MIC to state-of-the-art Normalized Min-Sum (NMS) decoder implementations to show the improvement in area efficiency and energy efficiency. Funding: This research was funded by the German Ministry of Education and Research (BMBF) projects "FunKI" (grants 16KIS1180K and 16KIS1185) and "Open6GHub" (grants 16KISK016 and 16KISK004).
Institutional Review Board Statement: Not applicable.