Efﬁcient QC-LDPC Encoder for 5G New Radio

: This paper presents a novel efﬁcient encoding method and a high-throughput low-complexity encoder architecture for quasi-cyclic low-density parity-check (QC-LDPC) codes for the 5th-generation (5G) New Radio (NR) standard. By storing the quantized value of the permutation information for each submatrix instead of the whole parity check matrix, the required memory storage size is considerably reduced. In addition, sharing techniques are employed to reduce the hardware complexity. The encoding complexity of the proposed method was analyzed, and indicated a substantial reduction in the required area as well as memory storage when compared with existing state-of-the-art encoding approaches. The proposed method requires only 61% gate area, and 11% ROM storage when compared with a similar LDPC encoder using the Richardson–Urbanke method. Synthesis results on TSMC 65-nm complementary metal-oxide semiconductor (CMOS) technology with different submatrix sizes were carried out, which conﬁrmed that the design methodology is ﬂexible and can be adapted for multiple submatrix sizes. For all the considered submatrix sizes, the throughput ranged from 22.1–202.4 Gbps, which sufﬁciently meets the throughput requirement for the 5G NR standard.


Introduction
Low-density parity-check (LDPC) codes [1], which were first proposed by Gallager in the early 1960s and rediscovered by MacKay and Neal [2] in 1996, have attracted widespread attention thanks to their remarkable error correction capabilities near the Shannon limit, with advancements in very large-scale integration (VLSI).Moreover, LDPC codes are among the most widely used types of forward error correction (FEC) codes in several communications standards such as the wireless local area network (WLAN, IEEE 802.11n), wireless radio access network (WRAN, IEEE 802.22), digital video broadcast (DVB), and the Advanced Television System Committee (ATSC).Recently, the fifth generation (5G) communication has been a hotspot of research and development [3].More specially, LDPC codes play an important role in 5G communication and have been selected as the coding scheme for the 5G enhanced Mobile Broad Band (eMBB) data channel [4].To support compatible rate and scalable data transmission, 3rd Generation Partnership Project (3GPP) has agreed to consider two rate-compatible base graphs, BG1 and BG2, for the channel coding [5].Accordingly, several studies have been conducted on the 5G LDPC codes.In [6], a low-cost and flexible demonstration platform is designed and implemented to evaluate the real-time performance of LDPC over the air interface as defined by 5G New Radio (NR) specifications.An algebra-assisted method for constructing 5G LDPC codes is presented in [7].
Over recent years, research on LDPC codes has been focused on structured LDPC codes known as quasi-cyclic low-density parity-check (QC-LDPC) codes [8][9][10][11][12], which exhibit advantages over other types of LDPC codes with respect to the hardware implementations of encoding and decoding using simple shift registers and logic circuits.A low-complexity encoder can be realized by using QC-LDPC codes, due to the sparseness of the parity check matrix.However, it is not straightforward to encode with low complexity as LDPC codes are defined by their parity check matrix, and the generator matrix is generally unknown.Various approaches have been suggested to improve the hardware complexity of LDPC encoders [13][14][15][16][17][18][19][20][21].One of the most conventional approaches is systematic encoding, in which the generator matrix is derived from the parity check matrix by exploiting Gaussian elimination.The main drawback related to this method is that the storage overhead is dramatically increased for large block sizes, which limits its practical applicability.The Richardson-Urbanke (RU) algorithm is a widely-used LDPC codes encoding scheme developed by Richardson and Urbanke [13].The underlying principle of the method is the transformation of the parity check matrix into an approximate lower triangular (ALT) form by using only row and column permutations, which preserves the sparseness of the matrix.This method suffers from a long critical path, which could make the LDPC encoder unsuitable for high throughput applications.To overcome the limitations of the previous approaches, the design proposed in this paper, which is referred to as a low-complexity high-throughput LDPC encoder architecture for the 5G standard, requires significantly less area and memory storage while maintaining a high throughput.
This paper targets the design of low-complexity high-throughput QC-LDPC encoders for the 5G NR standard.In LDPC encoders, the memory and interconnecting blocks are considered as the major influencing factors of the overall area, delay, and power performance of the hardware design.Hence, the size of the read only memory (ROM) was decreased by storing the quantized value of the permutation information for each submatrix instead of the entire parity check matrix H.The proposed architecture requires less matrix multiplications than the RU method, by exploiting the characteristics of the 5G NR base matrix.In addition, the proposed algorithm does not require the inverse of the component matrix, which presents a primary advantage over the RU method.Moreover, block-memories are not required to store the generator matrix G, and the number of required components is reduced.The ROM size of the proposed method is 98.2% and 88.9% lower than those of the G matrix method and RU method, respectively.
To assess the benefits of the proposed encoding approach, we further implement and synthesize several QC-LDPC encoder architectures with different submatrix sizes Z = 30, 64, 96, 144, and 352.The application specific integrated circuit (ASIC) post synthesis implementation results on TSMC 65-nm complementary metal-oxide semiconductor (CMOS) technology revealed an area efficiency up to 597 Gbps/mm 2 when the proposed encoding method was implemented.Hence, it can be concluded that a promising encoding architecture design for 5G NR LDPC codes was developed in this study.
The remainder of this paper is organized as follows.Section 2 gives a brief overview of the characteristics of 5G NR QC-LDPC codes.In Section 3, two conventional LDPC encoding algorithms from the literature are outlined.A novel 5G NR QC-LDPC encoding approach and a low-complexity high-throughput QC-LDPC encoder architecture are described in Section 4. Section 5 presents the implementation and comparison results, followed by the conclusions in Section 6.

5G NR QC-LDPC Codes
The NR access technology marks a transition in FEC coding for the 3GPP of cellular technologies [22].In this section, the QC-LDPC codes are reviewed, and the characteristics of standard 5G QC-LDPC codes are summarized.In addition, procedures are presented for the construction of the parity check matrix of the target LDPC codes.

Preliminary
Let Z be the size of a circulant permutation matrix and P i,j be the shift value.For any integer value P i,j , 0 ≤ P i,j ≤ Z, a Z × Z circulant permutation matrix shifts the Z × Z identity matrix I to the right by P i,j times for the (i, j)-th non-zero element in a base matrix.This binary circulant permutation matrix is denoted as Q(P i,j ).Considering Q(1) as an example, For simple notation, Q(−1) denotes the null matrix (all elements equal to zero) of the same size.

Introduction to QC-LDPC Codes
A binary QC-LDPC code can be characterized by the null space of an array of sparse circulants of the same size [7,23,24].Taking into account the implementation, the parity-check matrix H of a QC-LDPC code can be defined by its base graph and shift coefficients (P i,j ).Elements 1s and 0s in the base graph are replaced by a circulant permutation matrix and a zero matrix of size Z × Z, respectively.For two positive integers m b and n b , with m b ≤ n b , consider the QC-LDPC code expressed by the following m b × n b array of Z × Z circulants over GF (2): The exponent matrix of H, which is E(H), has the following form: Each entry in the matrix E is referred to as a shift value.It should be noted that the parity check matrix H in Equation (2) can be constructed by expanding the m b × n b exponent matrix E(H).This procedure is referred to as protograph construction [25].

5G NR QC-LDPC Characteristics
As mentioned above, QC-LDPC codes play an important role in 5G communications and have been accepted as the channel coding scheme for the 5G eMBB data channel in 3GPP standard meeting.Figure 1 illustrates the general structure of the NR QC-LDPC base graph.The columns are divided into three parts: information columns, core parity columns, and extension parity columns.The rows are partitioned into two parts: core check rows and extension check rows.As shown in the figure, the base matrix is composed of five submatrices, namely, A, B, O, C, and I [22].Submatrix A corresponds to systematic bits.In addition, B corresponds to the first set of parity bits and is a square matrix with a dual-diagonal structure: its first column is of weight 3, whereas the submatrix composed of other columns after the first column has an upper dual-diagonal structure.Submatrix O is an all-zero matrix.For the efficient support of incremental redundancy hybrid automatic repeat request (IR-HARQ), a single parity-check (SPC) based extension is used to support lower rates, as shown in Figure 1.Submatrix C corresponds to SPC rows, and I is an identity matrix that corresponds to the second set of parity bits, i.e., the SPC extension.The combination of A and B is referred to as the kernel, and the other parts (O, C, and I) are referred to as extensions.This code structure is similar to the Raptor-like extension, as described in [26].
The 3GPP agreed to consider two rate-compatible base graphs, denoted by BG1 and BG2, for the channel coding.Base graphs BG1 and BG2 have similar structures.However, BG1 is targeted for larger block lengths (500 ≤ K ≤ 8448) and higher rates (1/3 ≤ R ≤ 8/9), whereas BG2 is targeted for smaller block lengths (40 ≤ K ≤ 2560) and lower rates (1/5 ≤ R ≤ 2/3).The actual base graph usage and the definition of the two matrices are detailed in the NR standard specification TS 38.212 [27].The base graph that supports K max should support the following set of shift sizes Z, where Z = a × 2 j for a ∈ {2, 3, 5, 7, 9, 11, 13, 15} and 0 ≤ j ≤ 7.For base graphs BG1 and BG2, the number of shift coefficient designs is 8.All lift sizes are divided into eight sets based on parameter a, where a is used for the definition of the lifting-size a × 2 j .The set of shift coefficients are listed in Table 1.
Table 1.Relationship between exponent matrices and sets of lifting size.

Exponent Matrix
Lifting Size Set The shift value P i,j can be calculated using the function P i,j = f (V ij , Z), where V i,j is the shift coefficient of the (i, j)-th element in the corresponding shift design.The function f is defined as Equation ( 4), in which mod denotes the modulo arithmetic: The following procedures are the steps of constructing the parity check matrix of the target (N, K) QC-LDPC code with a given information block size K and code rate R = K/N.For a base graph, k b denotes the number of information circulant columns; thus, if the lifting size is Z, K = Z × k b nominally.
Step 1: Obtain the base graph BG1 or BG2 and determine the value of k b for the given K and R.
Step 2: Determine Z by selecting the minimum Z value in Table 2, such that k b × Z ≥ K.
Step 3: After the lifting size Z is determined, the corresponding shift coefficient matrix is then selected from Table 1 {Set 1, Set 2,. . ., Set 8} according to set Z.
Step 4: Calculate the shifting coefficient value P i,j by the modular Z operation, as discussed in Equation (4).
Step 5: Replace each entry in the final exponent matrix with the corresponding circulant permutation matrix or zero matrix of size Z × Z.The QC-LDPC code construction is completed and a parity check matrix H of size m b Z × n b Z is obtained.In 5G QC-LDPC codes, shortening and puncturing is carried out to obtain the desired information lengths and rate adaption.Figure 2 presents an illustration of the encoding process of these codes

LDPC Encoding Algorithms
Given a parity check matrix H, the objective of LDPC encoding is to solve parity equations: where C is the systematic codeword, which consists of the information bit vector S and parity code vector P.This section presents a review on two generic encoding methodologies for the implementation of the LDPC encoder: the Gaussian elimination method and the RU method.

LDPC Encoding with Gaussian Elinination
The Gaussian elimination is the most conventional method of encoding LDPC codes, which is carried out by the multiplication of the generator matrix G, and contains a complexity quadratic in the block length [19].The unknown generator matrix G can be derived from the parity check matrix H.A generator matrix for code with a parity check matrix H can be obtained by carrying out Gauss-Jordan elimination on H in the following form: where A is an (N − K) × K binary matrix and I N−K is the identity matrix of order (N − K).
The generator matrix is as follows: The codeword C is then obtained by multiplying the generator matrix G by the systematic bits S as follows: The sequential LDPC encoder based on the multiplication of the G matrix requires a ROM to store the generator matrix used to compute the codeword C. The main drawback of this approach is that, unlike parity check matrix H, the corresponding generator matrix G will most likely not be sparse.The complexity of this straightforward encoding algorithm is O(N 2 ), where N is the number of bits in a codeword.Therefore, the implementation of the matrix multiplication at the encoder results in a very high complexity.For an arbitrary parity check matrix, the construction of G should be avoided and encoding should be carried out using back substitution with H.

LDPC Encoding with the RU Method
Instead of determining a generator matrix for H, an LDPC code can be directly encoded using the parity check matrix by transforming it into a lower triangular form and applying back substitution.The RU encoding method, which was proposed by Richardson and Urbanke [13], is a linear time encoding method for sparse parity check matrices.The underlying principle is transformation using only row and column permutations, to reformulate a parity check matrix H into a sparse matrix.Therefore, this approach can reduce the complexity more than the G matrix multiplication method.The RU algorithm consists of two steps: a pre-processing step and actual encoding step.
First, in the pre-processing step, the parity check matrix H is converted into the approximate lower triangular (ALT) form, as shown in Figure 3.The parity check matrix H is given by the M × N matrix, where N is the block length of the code and M is the number of parity check equations.Given that the matrix transformation is realized solely by row and column permutations, the H matrix remains a sparse matrix: Here, the matrix T has a lower triangular form with 1s along the diagonal, and all the entries above the diagonal are 0s.By multiplying H from the left by the following is obtained: where The actual encoding step is performed by matrix-multiplication, forward-substitution and vector addition operations.Let the codeword C = [s p 1 p 2 ] where s represents the information bits, p 1 denotes the first G parity check bits, and p 2 contains the remaining (M − G) parity check bits.The codeword C must satisfy the parity check equation HC T = 0 T .The two equations are then expressed by: Figure 3.The parity check matrix H in approximate lower triangular form.
Using the RU method, the calculation of the parity bits in the first parity portion p 1 is only dependent on the information bits, given that E was cleared.Hence, it can be calculated independently of the parity bits in p 2 .If D is non singular, then p T 1 can be obtained from Equation ( 13): If D is singular in GF(2), then it is necessary to further permute the columns of H to eliminate this singularity.Once p 1 is known, p 2 can be determined using Equation (13): Given that T is the lower triangular form, p 2 can be found using back substitution.The complexity of this encoding procedure can be kept low since A, B and T are sparse.Tables 3 and 4 present the complexity of calculation of p T 1 and p T 2 , respectively.The complexity of the RU algorithm is given by O(N + G 2 ), where N is the block length and G is the gap to linear encoding.The gap is actually the number of rows of the parity check matrix that cannot be set into a triangular form using only row and column permutations.With a small gap G, the lower encoding complexity for the code is achieved.
Table 4. Complexity analysis of p T 2 calculation.

Operation Comment Complexity
As T Multiplication by sparse matrix The disadvantage of encoding using the RU method is that there is no exact programmable step-by-step algorithm.The multiple matrix calculations in this algorithm significantly limit the development of a rapid flexible encoder [28].In addition, the RU method is subjected to a long critical path and odd constraints, which could render the LDPC encoder non-systematic [19].

Proposed QC-LDPC Encoding Algorithm
This section presents an efficient scheme developed in this study for the construction of efficient encoders for 5G NR QC-LDPC codes.The proposed encoding method is based on the special characteristics of 5G NR QC-LDPC codes, which are presented in Figure 1.The proposed architectures target low-complexity, while ensuring high-throughput.As reported in the literature review, base graphs BG1 and BG2 have similar structures.In this paper, we focus our description on BG1 with a size of m b × n b (m b = 46, and n b = 68), which is the main 5G NR high rate base graph.
Let the codeword C = [s p a p c ], where s denotes the systematic portion, which is divided into 22 groups of Z bits, since the base graph BG1 has k b = n b − m b = 22 information bit columns.Moreover, s = [s 1 , s 2 , . . ., s k b ], where each element of s is a vector of length Z.The information messages received by the encoder are stored in registers that are organized by k b blocks, denoted by s i (i = 1, 2, . . ., k b ), which correspond to the systematic blocks, where each consists of Z bits.Given that the encoder was designed to read Z bits per clock cycle, it requires k b cycles to store all the information blocks.Moreover, the parity sequence can be grouped into sets of Z bits.Suppose that the parity portion of each message p is split into two sub-components as follows: the first g = 4 parity bits p a = [p a 1 , p a 2 , . . ., p a g ], and the remaining (m b − g) = 42 parity bits p c = [p c 1 , p c 2 , . . ., p c m b −g ] .More precisely, the encoded codeword can be expressed as: C = s 1 , s 2 , . . . ,s k b , p a 1 , p a 2 , . . . ,p a g , p c 1 , p c 2 , . . . ,p c m b −g .( 16) The parity check matrix H of 5G NR QC-LDPC codes can be partitioned into six matrices and presented in the following form: where Moreover, I is an identity matrix with dimensions of (m b − g) × (m b − g).The encoding of LDPC codes is carried out using the following defining equation: Equation ( 18) can also be expressed as: Equation ( 19) is then naturally split into two equations, as follows: The proposed algorithm is performed in two steps.In the initial step, the parity bits in the first portion p a are computed by solving Equation (20).The second step in the encoding process includes the computation of the p c parity portions using Equation (21).
The first step in the encoder implementation is the determination of the p a part.Initially, Equation ( 20) is re-written in block form as follows: This can then be expanded into the following set of equations: a 1 denotes the α th (right) cyclic shifted version of p a 1 for 0 ≤ α ≤ Z.By adding up all the above equations, the following is obtained: It should be noted that a straightforward implementation of a i,j s j can be done with the use of Z-bit cyclic shifters.Since a i,j s j is a circular right shift of s j with the shift coefficient defined by a i,j , the hardware complexity is trivial.Based on the definition below, the following can be obtained: From Equation ( 28), each λ i value is computed by accumulating all the a i,j s j values.In Modulo 2, λ i is obtained by carrying out XOR operations on all the elements of a i,j s j .The λ i values can be estimated per clock cycle in g = 4 cycles.The first block of the parity bits p a 1 is then calculated by accumulating all the λ i values.The remaining parity bits p a i can be obtained using a method that can be easily derived from Equations ( 30)-(32).This process can be done in two clock cycles since there is dependency between p a 3 and p a 4 .All the parity bits p a in the first parity portion are stored in registers.
In a second step, the p c portion can be easily determined based on Equation ( 21), where matrices C 1 and C 2 are given by Upon the application of Equation ( 21), the elements of p c can be computed using the following equations: Similarly, c i,j s j represents a circular shift of s j with the shift coefficient defined by c i,j , and c i,k b +j p a j represents a circular shift of p a j with the shift coefficient defined by c i,k b +j .As soon as c i,j s j and c i,k b +j p a j have been obtained, they can be used to determine the value of the corresponding parity bits in the second parity portion p c .This step can be performed in a single clock cycle.Hence, all the p c parity bits can be acquired in (m b − g) clock cycles.The encoded codeword is then a combination of the original message s and the two calculated parity portions p a and p c .29)-( 32) and (34).In the first step, the computation of the parity bits in the first portion p a is carried out.From Equation (28), each λ i value is computed by accumulating all the cyclic shift results of s j .Since the information message s consists of k b blocks of Z bits, a total of k b = 22 barrel shifters of size Z, which are denoted by CS j , j = 1, 2, ..., k b , are required for the

Performance Analysis and Comparison
This section reports the implementation results of the proposed LDPC encoder architecture as well as a detailed comparison between the proposed method and other encoder implementations in terms of area and speed for 5G NR standard.First, the design characteristics of different LDPC code encoders are discussed.Thereafter, an analysis of the proposed LDPC encoder, with respect to its implementation on ASIC, is presented.
Table 5 presents a comparison between the area and speed of the proposed encoding method and those of other state-of-the-art approaches.As shown in Table 5, the matrix size was utilized to determine the ROM storage, and the Hamming weights of the matrices were used in computing the gate count.Since all the systematic bits and parity check bits in the first parity portion are stored in registers, the number of flip-flops required was estimated by the bit sizes of K and p a .In Table 5, the time interval between input frames was exploited in order to compare the processing speed of different encoding methods.The time between two consecutive input frames is based on the total number of clock cycles between the arrivals of the first Z bits of a frame up to the cycle wherein the encoder is ready to receive another frame.
To make it clear, a target LDPC code with a base graph BG1 and submatrix size Z = 16 was considered in Table 6.As can be observed from Table 6, the proposed encoder gains a significant reduction in the storage overhead.In the Gaussian elimination method, the entire generator matrix G is stored in the memory.In the RU method, the location of the edges (ones) of each row is stored, with an extra bit indicating the end of a row.By only storing the values of shift coefficients for each submatrix, the proposed method dramatically reduces the ROM size by 98.2% and 88.9% when compared with the G matrix method and RU method, respectively.Moreover, the proposed encoder reduces the number of XOR logic gate counts by 1.65 times compared with the RU method.This leads to a significant reduction in the hardware complexity for the proposed encoder as these components are the main contributors of logic resources in the encoder architecture.Hence, the proposed encoding structure shows a significant advantage over other LDPC encoding methods with respect to hardware complexity.As can be seen from Table 6, the Gaussian elimination approach requires only 23 clock cycles to generate the encoded codeword for a given LDPC code.However, this method suffers from a significant storage overhead which makes it less of an idea for implementation.From the analysis of the RU design, it was found to require 471 clock cycles per codeword.This is significantly higher than that of the proposed encoder, which only requires 70 clock cycles.From Table 6, it can be observed that the number of clock cycles required per codeword for the encoding of the proposed encoder design decreased to 14.8% of that of RU method.

Area
Flip-flops The ASIC post synthesis implementation results on TSMC 65-nm CMOS technology are shown in Table 7, for various QC-LDPC encoders with expansion factors Z = 30, 64, 96, 144, and 352, which are indicated in the table as BG1-Z30, BG1-Z64, BG1-Z96, BG1-Z144, and BG1-Z352, respectively.In Table 7, q size denotes the word length required to store the shift sizes while CPC stands for the number of clock cycles required per codeword for encoding.Note that all input data bits were assumed to be available for encoding, and the serialization factors are not included in the results.In the proposed design, the CPC is equal to the maximum number of clock cycles required for the calculation of the p a and p c parity check bits.The computation of p a requires (g + 2) clock cycles, in which g clock cycles are used to compute all the λ values and p a 1 , and two extra clock cycles are required for estimation of the remaining parity bits in the p a portion.The computation of p c requires (m b − g) clock cycles.Hence, this method requires (m b + 2) clock cycles in total.The information throughput reported in Table 7 is given by the formula where f max is the maximum operating frequency (post synthesis).For different submatrix sizes, the throughput varied from 22.1-202.4Gbps.In Table 7, the occupied areas are also reported.
It should be noted that there is a significant increase in the core area when processing higher submatrix sizes.Since encoder architecture of a higher submatrix size Z requires a higher q size, additional memory and hardware components are required.It is shown that the encoding complexity of the proposed design is linearly proportional to the submatrix size Z of the code.To keep the throughput comparison on equal basis, the throughput-to-area ratio metric was further defined as TAR = Throughput/Area (Gbps/mm 2 ).For all the considered submatrix sizes in Table 7, the TAR ranged from 520-597 Gbps/mm 2 .Based on the implementation results presented above, it is clear that the design methodology is applicable to different submatrix sizes and offers a significantly high area efficiency and high information throughput, which is more than enough to satisfy the throughput requirement for the 5G NR standard.

Conclusions
In this paper, a novel low-complexity high-throughput encoder approach for the 5G NR standard is proposed.Based on the proposed encoding algorithm, five encoder architectures with different submatrix sizes were implemented.The derived architecture exhibited a significantly lower hardware complexity, as it decreased the memory and logic component requirements.The proposed design demonstrates a superior performance to the alternative methods.Moreover, the synthesis results revealed that the proposed design is appropriate for the high throughput 5G standard.

Figure 1 .
Figure 1.Sketch of base parity check structure for the 5G NR QC-LDPC codes.

Figure 2 .
Figure 2. Shortening by zero padding and puncturing of standard 5G QC-LDPC codes.

Figure 4
Figure 4 details the overall block diagram for the proposed low complexity 5G NR QC-LDPC code encoder.The hardware architectures were designed to conduct the encoding process through steps defined in Equations (29)-(32) and (34).In the first step, the computation of the parity bits in the first portion p a is carried out.From Equation(28), each λ i value is computed by accumulating all the cyclic shift results of s j .Since the information message s consists of k b blocks of Z bits, a total of k b = 22 barrel shifters of size Z, which are denoted by CS j , j = 1, 2, ..., k b , are required for the

Table 2 .
Lifting size Z supported by standard 5G QC-LDPC codes.

Table 3 .
Complexity analysis of p T 1 calculation.

Table 5 .
Comparison between Gaussian method, RU method, and proposed method.

Table 6 .
Comparison between Gaussian method, RU method, and proposed method for submatrix size Z = 16.