Low-Complexity High-Throughput QC-LDPC Decoder for 5G New Radio Wireless Communication

: This paper presents a pipelined layered quasi-cyclic low-density parity-check (QC-LDPC) decoder architecture targeting low-complexity, high-throughput, and efﬁcient use of hardware re-sources compliant with the speciﬁcations of 5G new radio (NR) wireless communication standard. First, a combined min-sum (CMS) decoding algorithm, which is a combination of the offset min-sum and the original min-sum algorithm, is proposed. Then, a low-complexity and high-throughput pipelined layered QC-LDPC decoder architecture for enhanced mobile broadband speciﬁcations in 5G NR wireless standards based on CMS algorithm with pipeline layered scheduling is presented. Enhanced versions of check node-based processor architectures are proposed to improve the complexity of the LDPC decoders. An efﬁcient minimum-ﬁnder for the check node unit architecture that reduces the hardware required for the computation of the ﬁrst two minima is introduced. Moreover, a low complexity a posteriori information update unit architecture, which only requires one adder array for their operations, is presented. The proposed architecture shows signiﬁcant improvements in terms of area and throughput compared to other QC-LDPC decoder architectures available in the literature.


Introduction
Low-density parity-check (LDPC) codes [1] were first introduced by R. Gallager in the early 1960s and later rediscovered by MacKay and Neal [2] in 1996. Due to their excellent error correction performance and highly parallel implementation characteristics, LDPC codes have been considered as one of the most popular forward error correction (FEC) codes in the past several decades. Since then, LDPC codes served as a fundamental of modern coding theory, which relies mainly on Shannon theory [3] in addition to extremely sparse code characterization and probabilistic message-passing algorithms [4]. Moreover, LDPC codes inherently possess some of the good characteristics of linear block codes, for instance, the simple and sparseness structure of parity check matrix H, which can be sketched in the shape of a bipartite model called Tanner graph [5]. A graphical approach makes it easier to analyze and visualize all complex mathematical formulations [6].
With higher data rates and acceptable error correction performance, the realizations of channel coding schemes have become crucial for all modern communication systems. The decoding of an LDPC can be deployed with a high degree of parallelism, which is essential to achieve high decoding throughput and low hardware complexity. Therefore, LDPC codes are promising solutions for high data rate applications such as wide-band wireless multimedia communications and magnetic storage systems [7,8]. Generally, wellconstructed irregular LDPC codes show much higher performance than the regular ones [9]. Although construction of very-large-scale integration (VLSI) architecture for irregular LDPC codes consumes higher complexity, many practical applications and standards such

Introduction
According to ITU-R, there are three primary 5G NR use cases defined by the 3GPP as part of its Study on New Services and Markets Technology Enablers (SMARTER) project. The three sets of use cases are [27]: eMBB, Ultra-Reliable Low Latency Communications (URLLC), and Massive Machine Type Communications (mMTC).
The initial phase of 5G deployments focuses on the eMBB use case. The eMBB traffic can be regarded as a direct extension from the 4G broadband service. The eMBB scenario is required to support a wider range of code rates, various code lengths, and modulation orders compared to the 4G Long Term Evolution (LTE). The eMBB offers peak data rates of 20 Gbps and provided user data rates of 100 Mbps to numerous users. Recently, LDPC codes have been selected as the coding scheme for the 5G eMBB data channel [14]. The NR access technology marks a great transition in channel coding for the 3GPP cellular technologies [28]. This section summarizes the basic features of standard 5G QC-LDPC codes. Furthermore, the construction procedures of the parity-check matrix of the target LDPC codes are also presented.

Quasi-Cyclic LDPC Codes
QC-LDPC codes [29], which are a class of structured LDPC codes, are widely used in many practical applications. A binary (N, K) QC-LDPC code is characterized by the null space of an M × N parity-check matrix H, which consists of an array of circulant matrices of the same size [30,31]. The parity-check matrix of a QC-LDPC code can be illustrated by its base graph and shift coefficients. Elements 1s in the base graph are replaced by a circulant permutation matrix of size Z × Z and 0 s are replaced by and a zero matrix of the same size.
Denote Z as the size of a circulant permutation matrix and P m,n as the shift coefficient value. For any integer value P m,n , 0 ≤ P m,n ≤ Z, a Z × Z circulant permutation matrix is defined as the cyclic-shift of the Z × Z identity matrix I to the right by P m,n times for the (m, n)th non-zero element in a base matrix. This binary circulant permutation matrix is denoted as Q(P m,n ). For simple notation, Q(−1) denotes the null matrix (i.e., all elements equal to zero) of the same size. Considering Q(1) as an example, For two positive integers m b and n b , with m b ≤ n b , consider the QC-LDPC code expressed by the following m b × n b array of Z × Z circulants over GF (2): The exponent matrix of H, which is E(H), has the following form: Each entry in the matrix E is denoted as a shift value. It should be noted that the paritycheck matrix H in Equation (2) can be constructed by expanding the m b × n b exponent matrix E(H). This procedure is referred to as protograph construction [32]. It follows that the parity-check matrix H is of size M × N where M = Z × m b and N = Z × n b .

5G New Radio QC-LDPC Characteristics
As stated before, QC-LDPC codes play a significant role in 5G NR communications and have been deployed as the error correction coding scheme for the 5G eMBB data channel in 3GPP standard meeting. Figure 1 shows the general structure of the base graph for the NR QC-LDPC codes. The columns are composed of three parts: information columns, core parity columns, and extension parity columns. The rows are divided into two parts: core check rows and extension check rows. As can be observed, the base matrix is partitioned into five submatrices, namely, A, B, O, C, and I [17]. Submatrix A corresponds to systematic bits. In addition, B corresponds to the first set of parity bits and is a square matrix with a dual-diagonal structure: its first column is of weight 3, whereas the submatrix is composed of other columns after the first column has an upper dual-diagonal structure. Submatrix O is an all-zero matrix. For the efficient support of Incremental Redundancy Hybrid Automatic Repeat Request (IR-HARQ), a single parity-check (SPC)-based extension is used to support lower rates, as shown in Figure 1. Submatrix C corresponds to SPC rows, and I is an identity matrix that corresponds to the second set of parity bits, i.e., the SPC extension. The combination of A and B is referred to as the kernel, and the other parts (O, C, and I) are referred to as extensions. This code structure is similar to the Raptor-like extension, as described in [15].
The 3GPP has finalized two types of rate-compatible base graphs for the channel coding, naming BG1 and BG2. Base graphs BG1 and BG2 have relatively similar structures. However, BG1 is designed for larger block lengths (500 ≤ K ≤ 8448) and higher rates (1/3 ≤ R ≤ 8/9), whereas BG2 is deployed for smaller block lengths (40 ≤ K ≤ 2560) and lower rates (1/5 ≤ R ≤ 2/3). The actual base graph usage and the definition of the two base matrices are provided in the NR standard specification TS 38.212 [15]. The base graph that supports K max should support the following set of shift sizes Z, where Z = a × 2 j for a ∈ {2, 3, 5, 7, 9, 11, 13, 15}, and 0 ≤ j ≤ 7. The number of shift coefficient designs for both base graphs BG1 and BG2 is 8. All lift sizes are categorized into eight sets based on parameter a, where a is used for the definition of the lifting-size a × 2 j . Table 1 presents the set of shift coefficients.
The five steps for constructing the parity-check matrix of the (N, K) QC-LDPC code with a target information block size K and code rate R = K/N are listed below. For a base graph, let k b denote the number of information circulant columns; thus, if the lifting size is Z, K = Z × k b nominally.

•
Step 1: Consider the base graph BG1 or BG2 and select the value of k b for the corresponding K and R. - Step 2: Determine Z by choosing the minimum Z value in Table 2, such that k b × Z ≥ K.

•
Step 3: After the lifting size Z is determined, the corresponding shift coefficient matrix is then picked up from Table 1 {Set 1, Set 2,. . . , Set 8} according to Z.

•
Step 4: Calculate the shifting coefficient value P i,j by the modular Z operation, as defined in Equation (4). • Step 5: Substitute each entry in the final exponent matrix by the corresponding circulant permutation matrix or zero matrix of size Z × Z. The QC-LDPC code construction is accomplished, and a parity-check matrix H of size m b Z × n b Z is achieved. In 5G QC-LDPC codes, a shortening and puncturing process is performed to achieve the desired information lengths and rate adaption. An illustration of the encoding process of 5G QC-LDPC codes is presented in Figure 2.

Combined Min-Sum Algorithm
The standardized LDPC codes for 5G NR channel coding consider two base matrices: BG1 and BG2 [15][16][17], to support the rate compatible and scalable data transmission. An example of BG1 matrix structure with Z = 56 is presented in Figure 3. Efficient LDPC decoder implementation is a significant task while designing the physical layer for any wireless standard. It possesses many research challenges at the very early stage of wireless standards deployment. Therefore, it is time to conceive efficient LDPC decoder architecture that is compliant with the next generation of 5G NR standards. As shown in Figure 3, the significant challenge is to design an LDPC decoder architecture to support a huge BG1 matrix that generates the decoded bits of 3808 encoded information bits. Our work aims to present an efficient LDPC decoder architecture for the 5G NR wireless communication systems.
A considerable characteristic of the 5G NR LDPC codes is that these codes are dramatically irregular for both check nodes and variable nodes. Table 3 shows the distribution of check node degree d c in the two base graphs BG1 and BGs specified in 5G NR LDPC codes. For BG1, the check node degree d c varies largely from 3 to 19. More especially, there are only four rows in the core part of BG1 that possess the highest check node degree of dc max = 19, while most of the rows are with low check node degrees of d c = 5 and 6. Similarly, the check node degree d c in BG2 also varies largely from 3 to 10 and only 2 rows are with dc max = 10. It can be concluded that the check node degrees in the two base matrices of 5G NR LDPC codes vary drastically.
LDPC codes for the channel coding of 5G NR are actually concatenated codes that are derived by combining a pure LDPC part and a low-density generator matrix (LDGM) part. There is a significant difference in the check node degrees of the 5G LDPC codes, where the check node degree of the pure LDPC part is more than ten, and the check node degree is only one in the LDGM part. This creates one of the biggest challenges to design a high-performance decoder for 5G LDPC codes since their inherent numerous degree-1 variable nodes are more prone to be erroneous. For most hardware implementations of the traditional LDPC decoders, the MS algorithm or the modified versions of it, such as normalized min-sum (NMS) and OMS, are adopted. The NMS algorithm is the most straightforward scheme for hardware implementation. However, it is relatively challenging to optimize the normalized factor for the LDPC codes in 5G NR due to the particularly large difference in the check node degrees. Generally, the decoding performance of the OMS algorithm is better than the MS algorithm since it is an enhanced version of the original MS algorithm. Nevertheless, it is not always true for the case of fixed-point decoder for 5G LDPC codes. As stated before, the degree-1 variable nodes in the LDGM part are more sensitive to errors due to the lack of check to variable node messages. Furthermore, the offset factor of the OMS algorithm is relatively difficult to optimize in fixed point LDPC decoders because of limited quantization bits. Therefore, the performance of the OMS algorithm is much lower than the MS algorithm for fixed-point 5G LDPC decoders. Based on the considerable variation characteristic of the check node degree in 5G NR LDPC codes, a combined min-sum algorithm is introduced in this section. The combined algorithm adopts improved error correction performance by using different decoding algorithms for the core and extension parts of the LDPC code. The principal of the CMS algorithm is to apply the OMS algorithm for layers with a high check node degree in the core part and the original MS algorithm for the remaining layers with lower check node degrees. Hence, this combined algorithm holds all the characters of the MS algorithm and improves its decoding performance.
is partitioned into L horizontal decoding block layers. Each decoding block layer contains m b /L consecutive block rows of H, such that any variable node is connected at most once to any block layer. We denote M l as the set of consecutive block rows of H corresponding to block layer l ∈ {1, ..., L}. For the sake of simplicity, each decoding block layer L in this work is assumed to contain only one block row, i.e., L = m b . In addition, R m,n denotes the check-to-variable message conveyed from check m of lth layer to variable node n. L m,n represents the variable-to-check message from variable node n to check node m. In the ith iteration, the LLRs message from layer (l − 2) to next layer l for variable node v is represented by P i n [l − 2]. The CMS decoding based on pipelined layered scheduling algorithm can be summarized in Algorithm 1.

Algorithm 1 Combined Min-Sum Algorithm
Input: y = (y 1 , y 2 , ..., y N ) ∈ Y N Received word Output: Initialization for all n = 1 to N do P n = log P(x n = 0 | y n ) P(x n = 1 | y n ) =

Proposed LDPC Decoder Architecture
The proposed low-complexity high-throughput QC-LDPC decoder architecture for 5G NR wireless standards is developed on the basis of pipelined layered CMS algorithm described in Section 3 for base matrix BG1, which supports the code rate of R = 1/3.

Overall Decoder Architecture
As reported in the literature review, the 3GPP has introduced two base graphs, BG1 and BG2, for 5G NR LDPC codes. In this section, we focus our description on BG1 with a size of m b × n b (m b = 46, and n b = 68), which is the main 5G NR high rate base graph. Denote Z = 56 as the submatrix size of the intended QC-LDPC code. As mentioned in the previous section, the check node degree d c in the BG1 base matrix varies largely from 3 to 19. For simplicity, we denote the maximum check node degree of 19 by d c from this section. The overall decoder architecture is shown in Figure 4. First, the input MUX network aims at selecting between channel intrinsic messages (input LLRs) and LLR messages from the previous layer. It should be noted that intrinsic messages are used only at the initialization stage. Then, the input LLRs of w = 4 bits quantization is buffered in the input register memory banks, named by PMEM. The specific configuration of PMEM is presented in a later subsection. There are n b = 68 output ports of PMEM and each one represents Z = 56 LLR values of w = 4 bits (i.e., Z × w = 224 bits). Consecutively, these 68 outputs are fed to the switch network, which performs a circular shift to the LLR data read from PMEM to align variable nodes with appropriate check nodes based on H-matrix value. A decomposition unit (DCPU) is used for converting the d c = 19 LLR outputs of 224 bits from the switch network into Z = 56 LLR combinations of d c × w = 76 bits. Then, these 56 LLR combinations of 76 bits and the output data fetched from check to variable messages (RMEM) are passed to Z = 56 check node-based processors. After performing check node and variable node processing, each of these Z = 56 check node-based processors (CNBPs) produces 76 bits updated LLR messages P n and 2(w − 1) + log 2 d c + d c = 30 bits extrinsic check-to-variable messages R m,n represented in compressed mode. The CNBP sends P n messages to the combination unit (CMBU) for the next decoding layer and the compressed R m,n messages to check-to-variable node memory RMEM for the next iteration. The CMBU performs the reverse operation of DCPU that converses Z = 56 updated LLR combinations of 76 bits into d c = 19 LLR outputs of 224 bits. It should be noted that the compressed R m,n messages from RMEM are fed back to CNBP units through the decompress units (DECOM). Various blocks in the decoder architecture are explained in detail in the following subsections.

Memory Blocks
In our proposed architecture, two memory blocks are utilized, one for the input LLR values (PMEM) and one for the check to variable messages (RMEM). Assume that all input LLRs and exchanged messages are quantized on w = 4 bits. PMEM memory is implemented by an input register that stores the input LLR values (prior values) or posterior values P n from the previous layer. The memory is organized in n b = 68 register memory blocks, denotes by P i (i = 1, 2, ..., n b ) corresponding to the number of columns of the base matrix. Each register memory block consists of Z = 56 memory locations with w = 4 bits of word-length, i.e., Z × w = 224 bits, and these stored 56 LLRs are read from register memory block in a single clock cycle. Thus, a total of Z × w × m b = 15,232 bits of input memory is implemented in the proposed decoder.
RMEM memory is implemented as a dual port RAM, which stores Z = 56 compressed check to variable messages R m,n for all L = m b layers. With the proposed CNBP architecture, two minimum values of (w − 1) bit-width, a minimum index of bit-width log 2 d c , and d c -sign values are stored in RMEM for each decoding layer. Hence, a total of m b × Z × (2 × (w − 1) + log 2 d c + d c ) = 77,280 bits of RMEM memory is required for all Z = 56 CNBPs.
Therefore, the overall memory size used in our LDPC decoder is 92.512 kb.

Switch Network
A switch network (SN) is an Z-input, Z-output hardware structure that can put the input signals in the arbitrary order at the output. For the implementation of the QC-LDPC decoder, the switch network is an essential module. The proposed decoder consists of d c = 19 check node and d c = 19 variable node circular shift networks. Each of them consists of Z w−bit inputs and Z w−bit outputs. The barrel shifters are used to implement the cyclic shift permutations according to the shift values provided by the cyclic shifter controllers. In this decoder, the switch networks are implemented with log 2 Z -stage Z × Z barrel shifters. There is no re-shuffling network in this architecture as we applied an H-matrix reformulation technique proposed in [33].

Variable Node Units
The proposed decoder consists of Z = 56 VNUs, which implement the variable node update step shown in Algorithm 1. The detailed architecture of a VNU in the proposed decoder is shown in Figure 5. Each VNU takes d c inputs at bit-width w LLR message It should be noted that the check-to-variable node message is stored in RMEM as a compressed message of 2(w − 1) + log 2 d c + d c = 30 bits. For a check node m, the corresponding R m,n message in decompressed format is given by the d c values [R m,n 1 , R m,n 2 , ..., R m,n dc ], where n 1 , n 2 , ..., n d c denote the variable nodes connected to m. For a check node m, the corresponding R m,n message in compressed format is given by the signs of the above d c R m,n i messages, denoted by sign, their first and second minimum, denoted by min1 and min2, and the index of the first minimum, denoted by indx. Figure 6 shows an R m,n i message in decompressed and compressed format. Hence, the decompression unit DECOM aims at converting this compressed version of LLRs into d c = 19 decompressed extrinsic LLRs by using d c equalizers, MUXs and sign-magnitude to two's complement conversion units (SMTCs) as shown in Figure 5.

Check Node Units
The proposed decoder consists of Z = 56 CNUs, which implement the check node update step shown in Algorithm 1. A minimum-finder (19-FMVG) and sign processor unit (SPU) are exploited in this step. In this section, we focus on the computation of the first two minima min1, min2, and index of the first minimum indx since the signs of output messages can be simply calculated by XORing the adequate signs of input messages. The structure of the minimum-finder is based on the tree structure (TS) architecture proposed by Wey [34]. However, it is further modified in our proposed architecture so as to compute only the value and the index of the first minimum in the first clock cycle. The second minimum value is decided in the second clock cycle by re-utilizing the same hardware architecture.
The detailed architecture of a low complexity CNU is shown in Figure 7. As can be observed, the proposed CNU architecture consists of d c = 19 sign-magnitude conversion (SMC) units, a 19 inputs-first minimum value generator unit (19-FMVG), sign processor unit, and the compare and select (CS) unit for generating control signals. Initially, each CNU takes d c inputs with bit-width w = 4 of L i m,n [l] value from the VNU. The 4-bit LLRs in the two's complement format are converted into sign and magnitude format with the aid of sign-magnitude conversion (SMC) units. The sel control signal is set to either 0 or 1 to indicate the operation mode of the CNU. When sel is set to 0, i.e., the first clock cycle, the CNU is carried out to generate the first minimum min1 and index of the first minimum indx. When sel is set to 1, i.e., the second clock cycle, the CNU is re-executed in order to calculate the second minimum value min2. When CNU in the second clock cycle, the maximum value of bit-width w is substituted for the input value at index indx instead of straightforwardly passed all input values to the 19-FMVG units. Moreover, depending upon the clock cycle being processed, the en control signal is set to pass the corresponding minimum value to the output.
The architecture of the FMVG for d c = 19 inputs is also presented in Figure 7. Since d c = 19 can be decomposed into the sum of 16 and 3, the result of d c -FMVG block is realized by combining corresponding blocks as described in [34]. The index generator block in Figure 7 is deployed to create the index of the first minimum value. The architecture of 3-FMVG is shown in Figure 8a. Moreover, the 2 k -FMVG block, which computes the value and the index of the first minimum among the 2 k input messages of bit-width w, can be constructed by cascading multiple 2-FMVGs as shown in Figure 8c. The 2-FMVG unit, as detailed in Figure 8b, is exploited for the comparisons discussed earlier. The comparison signal and output value of the 2-FMVG are defined as follows: For the 19-FMVG block, an input word contains w − 1 bits, then the comparator and MUX also require w − 1 bits. In this section, we denote MUX as 1-bit 2-to-1 multiplexer, and MUX w−1 as a (w − 1)-bit 2-to-1 multiplexer. Table 4 summarizes the number of components required for the proposed minimum-finder based on FMVG and the conventional method, which determines both min1 and min2 values, denoted as two minimum values generator (TMVG). Results show that the proposed FMVG approach requires a much smaller number of comparisons and MUXs than the conventional method TMVG.

A Posteriori Information Update Units
The proposed decoder consists of Z = 56 APUs, which implement the operations in a posteriori information update step shown in Algorithm 1. The detailed architecture of a low-complexity APU with only one adder array is shown in Figure 9. Each APU takes d c inputs at bit-width w P i n [l − 2] and two values from check-to-variable node memory RMEM, i.e., R i−1 m,n [l − 1] and R i m,n [l − 1]. Similarly, the sel control signal in the VNU is set to either 0 or 1 to indicate the operation mode. When sel is set to 0, i.e., the first clock cycle, the APU is carried out to calculate the sum of P i v [l − 2] and −R i−1 m,n [l − 1]. When sel is set to 1, i.e., the second clock cycle, the APU is re-executed in order to calculate P i n [l − 1] by adding R i m,n [l − 1] into the output from previous clock cycle. At the input, two MUXs are utilized to appropriately select the input data according to the clock cycle being processed. In addition, a deMUX is allocated at the output to indicate the truth output value of APU. Moreover, depending upon the clock cycle being processed, the en control signal is sufficiently set to pass the computation result from the adder array to the output or feedback to the input multiplexer. All outputs from APUs are given to input multiplexers to start the processing of the next layer, as shown in Figure 4. As in the case of VNU architecture, the proposed APU uses two decompression units DECOM to convert the compressed LLRs taken from RMEM into d c = 19 decompressed extrinsic LLRs.

Controller Block
This block generates control signals, such as data_sel to indicate the step being processed; cnt_layer_cnu and cnt_layer_vnu to indicate the layer being processed by CNUs, VNUs, and APUs; and mem_en, to enable write access to the check-to-variable node memory RMEM.

Implementation Results and Comparisons
The simulation of proposed low-complexity high-throughput pipelined layered QC-LDPC decoder architecture for 5G NR wireless standard was carried out using BPSK modulation in an AWGN channel environment. Figure 10 illustrates the FER performance of the proposed QC-LDPC decoders on the N = 3808, R = 1/3 for BG1 base matrix of 5G NR wireless standard. The results indicate that the proposed QC-LDPC decoder architecture scheme with bit-width w = 4 and ten decoding iterations deliver a FER of 10 −5 at 2.75 dB. It can be observed that the CMS decoding provides a better error correction performance compared with the original MS and OMS algorithms. The major advantage to be realized from our proposed design comes at the decoder complexity reduction. It should be noted that more clock cycles are required to finish a decoding iteration compared to the conventional design. By applying the hardware reuse approach, the critical path is reduced, and, therefore, the operating frequency is enhanced. Hence, the throughput would remain the same in the case of an ideal clock frequency. The reported throughput is given by Equation (6), where f max denotes the maximum operating frequency, and I max is the maximum number of iterations to decode one codeword. Based on the pipeline chart in Figure 11, the number of clock cycles per decoding iteration is 2 × L in which two clock cycles are required to decode each layer. The proposed decoder consumes a total of (2 × L × I max + 2) clock cycles. In order to confirm the efficiency of our solution, we have conducted our proposed decoder on the expansion size Z = 56, code rate R = 1/3 using 5G base matrix BG1 LDPC code. The proposed LDPC decoder architecture was modeled by the Verilog hardware description language (HDL) and simulated to verify its functionality using a test pattern generated from a C simulator. After successful verification of the design functions, it was then synthesized using sufficient time and area constraints. Both simulation and synthesis steps were executed by using Synopsys design tools and TSMC 65-nm CMOS standard cell technology. The post-synthesis results are reported in Table 5. The proposed low complexity layered LDPC decoder architecture occupies an area of 1.49 mm 2 and achieves a throughput of 3.04 Gbps at 750 MHz. The power and energy dissipations are 259 mW and 85.20 pJ/bit, respectively.
In addition, Table 5 shows the implementation and performance comparisons of the proposed low-complexity high-throughput decoder with various other LDPC decoders. It is confirmed that the proposed design helps reduce the decoder complexity. Since these designs have different implementation parameters, including code length, CMOS technology, and the quantization bits, it is necessary to apply performance normalization for a fair comparison. The normalization method in [35] is adopted. In order to keep the hardware performance comparison on an equal basis with respect to technology, area, and the number of iterations, it is usually evaluated by the throughput-to-area ratio (TAR) metric, which was defined by TAR = Throughput/Area (Gbps/mm 2 ). It can be observed that the proposed architecture is found to be the most efficient in terms of area efficiency among the reported decoders, yielding a normalized throughput-to-area ratio (NTAR) value of 2.04 Gbps/mm 2 . Specifically, the NTAR of our work is 10.3% better than that of the next most efficient decoder [36] and 38.7% better than the rest of the reported decoders. The decoder in [37], which offers the highest normalized throughput at 7.31 Gbps, occupies a very large-scale design area. Hence, its NTAR is significantly low compared to our proposed design. Specifically, the NTAR ratio of our work is about 46.08% better than that of [37]. High operating frequency usually comes at the cost of increased power consumption. Despite this, though, as shown in Table 5, the proposed enhanced decoder achieves very good results in terms of energy efficiency, very close to that of the work presented in [38]. Moreover, it can be seen that the energy efficiency of our proposed LDPC decoder yields 70.13% better than that of the next most area-efficient decoder in [36].
Based on the implementation results presented above, it is clear that the design method offers a significantly low complexity, high area efficiency, and high throughput, which is sufficient for the 5G NR wireless communication standard. However, it is very challenging to design a low error correction decoder for 5G LDPC codes since the offset factor is relatively difficult to optimize in quantized LDPC decoders. In order to improve the error correction performance as well as achieve cost efficiency and high throughput issues, further enhancements should be proposed in future research for adjusting the offset factor of the OMS decoder in the pure LDPC part.

Conclusions
Enhanced versions of CNBP architectures are proposed to improve the complexity of the LDPC decoders in this paper. First, an efficient minimum-finder for CNU architecture that reduces the hardware required for the computation of the first two minima and a low complexity a posteriori information update unit architecture, which only requires one adder array for their operations are introduced. Finally, an area-efficient pipelined layered QC-LDPC decoder architecture for 5G NR communication systems is described in detail. Simulation results show that the proposed architecture achieves low complexity and high throughput compared to other QC-LDPC architectures available in the literature. Therefore, the proposed QC-LDPC decoder can be applied in 5G NR wireless communication standard applications.