High Area-Efﬁcient Parallel Encoder with Compatible Architecture for 5G LDPC Codes

: This paper presents a novel parallel quasi-cyclic low-density parity-check (QC-LDPC) encoding algorithm with low complexity, which is compatible with the 5th generation (5G) new radio (NR). Basing on the algorithm, we propose a high area-efﬁcient parallel encoder with compatible architecture. The proposed encoder has the advantages of parallel encoding and pipelined operations. Furthermore, it is designed as a conﬁgurable encoding structure, which is fully compatible with different base graphs of 5G LDPC. Thus, the encoder architecture has ﬂexible adaptability for various 5G LDPC codes. The proposed encoder was synthesized in a 65 nm CMOS technology. According to the encoder architecture, we implemented nine encoders for distributed lifting sizes of two base graphs. The eperimental results show that the encoder has high performance and signiﬁcant area-efﬁciency, which is better than related prior art. This work includes a whole set of encoding algorithm and the compatible encoders, which are fully compatible with different base graphs of 5G LDPC codes. Therefore, it has more ﬂexible adaptability for various 5G application scenarios.


Introduction
Low-density parity-check (LDPC) codes have been recognized for their excellent error correction abilities near the Shannon limit [1], and LDPC codes are advantageous in hardware implementation [2]. At present, many communication systems have taken these codes as standards, such as the Digital Video Broadcasting Satellite (DVB-S2/S2X, Europe) [3], the Consultative Committee for Space Data Systems (CCSDS) [4], Wireless Local Area Network (WLAN, IEEE 802.11n) [5], the China digital radio standard [6], and the 5th Generation Mobile Communication Technology (5G) [7,8].
Currently, research on mobile communication systems has entered the 5G phase [9]. Channel encoding is one of the 5G core technologies; it is mainly used to ensure the correct transmission of channel information and to improve communication quality [10]. The Third Generation Partnership Project (3GPP) organization has finally decided to take the LDPC code as the data channel coding scheme for 5G enhanced Mobile Broadband (eMBB) [11,12]. Compared to 4G, 5G has higher requirements in terms of the data transmission rate and information transmission reliability [13,14]. Therefore, it has important significance and application value for exploring a novel LDPC encoding scheme and implementing 5G LDPC codes [15,16]. To achieve scalable data transmission and flexibility, 3GPP has decided to take two kinds of base graphs for 5G channel encoding-BG1 and BG2 [17].
Presently, some research initiatives have focused on 5G LDPC codes [18,19]. 5G New Radio (NR) has higher performance demands on channel coding solutions [20]; one study discusses the design concept of the new quasi-cyclic low-density parity-check (QC-LDPC) codes that have different structural characteristics and meet the multiple requirements of 5G NR channel coding [21]. Designed as structured LDPC codes, QC-LDPC codes have been research hotspots in the recent past. QC-LDPC codes possess obvious advantages in to the QC-LDPC encoding algorithm. Reference [30] describes that the throughput of QC-LDPC codes could be improved by trimming the full-base matrix into the requested matrix size.
For high performance and compatibility of 5G LDPC encoding requirements, this work presents a highly area-efficient parallel QC-LDPC encoder core with compatible architecture, which is compatible with the latest 5G standard. It has high encoding performance and a low hardware cost.
The remaining sections of this paper are organized as follows: Section 2 briefly analyzes the characteristics of 5G LDPC codes. Section 3 proposes a high parallel LDPC encoding algorithm compatible with 5G LDPC codes. Section 4 shows a high area-efficient parallel QC-LDPC encoder with compatible architecture. Section 5 gives experimental results and comparative analysis. Section 6 summarizes our work and provides the conclusions.

Analysis of 5G LDPC Codes
LDPC codes are adopted as the data encoding scheme of 5G because its higher encoding throughput and lower latency can better adapt to the data transmission of high-speed services. The main content of 5G LDPC standards is analyzed progressively as follows.
The LDPC codes in 5G standards are QC-LDPC codes. For one QC-LDPC code, the structural characteristics of the check matrix can be denoted by a base graph (BG) or a base graph matrix (H BG ), as exampled by the check matrix in Equation (1).
In the matrix above, each 1 in H BG represents a 4 × 4 binary circulant permutation matrix (CPM), and each 0 represents a 4 × 4 zero matrix; that is, the size Z of the matrix in Equation (1)  H BG can represent the structural characteristics of the check matrix of one QC-LDPC code. However, it cannot reflect the cyclic shift value of each CPM. Therefore, it is essential to define a cyclic shift coefficient matrix P (exponent matrix) to represent the cyclic shift value of the corresponding base graph matrix.
There are two values of P m,n . When 0 ≤ P m,n < Z, it denotes the cyclic permutation matrix obtained with a Z × Z submatrix right-shifting by P m,n bits. When P m,n = −1, it denotes a Z × Z zero matrix.
Equation (1) corresponds to the cyclic shift coefficient matrix P, which is shown as follows: Therefore, the check matrix is unique if the lifting size Z and the exponent matrix P of one QC-LDPC code are determined. The description of 5G LDPC codes often adopts this representation method.
In the 5G standard, LDPC codes have two types of base graphs, named BG1 and BG2. Check matrices of BG1 and BG2 both have the characteristic structure shown as Figure 1. These check matrices are termed H matrices.
Therefore, the check matrix is unique if the lifting size Z and the exponent matrix P of one QC-LDPC code are determined. The description of 5G LDPC codes often adopts this representation method.
In the 5G standard, LDPC codes have two types of base graphs, named BG1 and BG2. Check matrices of BG1 and BG2 both have the characteristic structure shown as Figure 1.
These check matrices are termed H matrices. The 5G LDPC code has two types of base matrices, namely, HBG1 and HBG2. Their information comparison is shown in Table 1. BG1 has a total of 316 elements 1 while BG2 has a total of 197 elements 1. The element 1 indicates that the corresponding submatrix is an identity matrix or a cyclic right shift identity matrix. The element 0 indicates that the corresponding submatrix is a zero matrix. Cyclic shift coefficients of submatrices in H are stored in the corresponding coefficient matrix. The coefficient matrix has the same size as the HBG matrix. In the cyclic shift coefficient matrix, the non-negative element i corresponds to the element 1 in the HBG, indicating that the submatrix is the matrix obtained after an identity matrix is cyclically right shifting by i bits. The element −1 corresponds to the element 0 in the HBG, indicating that the corresponding submatrix is a zero matrix.
The size Z of the submatrix in the H is not fixed. The 5G standards specify the values of Z. For BG1 and BG2, the value ranges of Z are the same, as shown in Table 2.
In 5G standards, the cyclic shift coefficients of the submatrices vary in different situations. First, BG1 and BG2 have different coefficient matrices; furthermore, different values of 'a' in Table 2 will result in diverse coefficient matrices even in the same base matrix. The 5G LDPC code has two types of base matrices, namely, H BG1 and H BG2 . Their information comparison is shown in Table 1. BG1 has a total of 316 elements 1 while BG2 has a total of 197 elements 1. The element 1 indicates that the corresponding submatrix is an identity matrix or a cyclic right shift identity matrix. The element 0 indicates that the corresponding submatrix is a zero matrix. Cyclic shift coefficients of submatrices in H are stored in the corresponding coefficient matrix. The coefficient matrix has the same size as the H BG matrix. In the cyclic shift coefficient matrix, the non-negative element i corresponds to the element 1 in the H BG , indicating that the submatrix is the matrix obtained after an identity matrix is cyclically right shifting by i bits. The element −1 corresponds to the element 0 in the H BG , indicating that the corresponding submatrix is a zero matrix.
The size Z of the submatrix in the H is not fixed. The 5G standards specify the values of Z. For BG1 and BG2, the value ranges of Z are the same, as shown in Table 2. In 5G standards, the cyclic shift coefficients of the submatrices vary in different situations. First, BG1 and BG2 have different coefficient matrices; furthermore, different values of 'a' in Table 2 will result in diverse coefficient matrices even in the same base matrix.
As shown in Table 2, each row of the size Z has the same cyclic shift coefficient matrix, so the size Z can be divided into 8 sets, each of which shares a common coefficient matrix.
The Z values after division are shown in Table 3. When a Z value in one set is taken as the size of each submatrix, the coefficient matrix corresponding to the set can be taken to represent the entire H matrix. Equation (5) indicates the final cycle shift coefficients corresponding to different Z values in the same set: where V ij denotes the element in the i-th row and the j-th column of the coefficient matrix corresponding to one set. P ij denotes the actual cyclic shift coefficient of the submatrix corresponding to the elements in the i-th row and the j-th column of the H BG for a selected Z in one set.

Parallel LDPC Encoding Algorithm Compatible with 5G LDPC Codes
This paper proposes a high parallel QC-LDPC encoding algorithm, which is compatible with 5G LDPC standards. Based on this algorithm, a novel encoder architecture for LDPC codes is designed to satisfy the requirements of 5G LDPC codes mentioned above. There are two base graphs for 5G LDPC codes, BG1 and BG2. These base graphs have different structures, as shown in Figures 2 and 3. Our research is compatible with both BG1 and BG2, this work has wide applicability to the new 5G LDPC standards. Further-more, we present an integrated solution of the parallel LDPC encoding algorithm and the area-efficient compatible encoder architecture.   The encoding of 5G LDPC codes is defined by H × C T = O T , which can be expressed as the following.
(1) First, Equation (6) is decomposed into the following equation set.   The encoding of 5G LDPC codes is defined by H × C T = O T , which can be expressed as the following.
(1) First, Equation (6) is decomposed into the following equation set.  Since the lifting size of H BG is Z, the codeword C is uniformly divided by the size Z to match the base matrix H BG . According to the sub-block structure of H BG , C can be denoted as C = [S 1 , . . . , S kb , P a1 , . . . , P a4 , P b1 , . . . , P b(Mb− 4) ]. The column number of information sequence S is as same as the column numbers of block A and block C. The column number of the check sequence P a is as same as those of block B and block D. The column number of the check sequence P b is as same as those of block O and block I.
The encoding of 5G LDPC codes is defined by H × C T = O T , which can be expressed as the following.
(2) Then, P a and P b are calculated as follows.
Equation (8) shows that there are factors in the α·β T form during the calculation of the check bits P a and P b , such as A·S T , C·S T , and D·P T . However, during the calculation of P a , there is an additional matrix multiplication between B −1 and A·S T . Considering the characteristics of the cyclic shift coefficients in the B block, let Var = A·S T , A is the submatrix of H with known coefficients, and S is the bit sequence of input information.
For BG1 and BG2, there are four different cases of circular shift coefficients corresponding to the B submatrices of H BG matrices, which are shown as Figure 4a  In Equation (7), A·S T + B·Pa T = 0, so [var1, var2, var3, var4] and [Pa1, Pa2, Pa3, Pa4] following equation relationships.
The computational process of Pa and Var corresponding to the left submatri ure 4a is as follows: In Equation (7), A·S T + B·P a T = 0, so [var 1 , var 2 , var 3, var 4 ] and [P a1 , P a2 , P a3 , P a4 ] have the following equation relationships.
The computational process of P a and Var corresponding to the left submatrix of Figure 4a is as follows: The computational process of P a and Var corresponding to the left submatrix of Figure 4b is as follows: The computational process of P a and Var corresponding to the right submatrix of Figure 4a is as follows: The computational process of P a and Var corresponding to the right submatrix of Figure 4b is as follows: Finally, the check information bits P a and P b are obtained, and the encoded codeword C is the output.
For the check matrix H with a special structure in 5G LDPC codes, our scheme elides the matrix inversion operations in encoding. The scheme directly utilizes the linear mathematical relationship between Var and P a to obtain the check sequence P a by computing the intermediate variable (Var represents A·S T ), which simplifies the encoding process. Through the above equations, the scheme uses var1, var2, var3, and var4 to solve P a1 , P a2 , P a3, and P a4 . Because P a has been solved, C and D are both submatrices with known coefficients in H BG , and S is the known information sequence. We can then obtain P b by P b T = C·S T + D·P a T . Finally, we can obtain the encoded codeword C = [S, P a , P b ] by combining S, P a, and P b . With this LDPC encoding scheme, the presented encoder mainly includes α·β T operation units in its hardware implementation, it greatly reduces the hardware complexity of the encoder architecture, laying a foundation for the realization of the high area-efficiency encoder in this paper.

Area-Efficient Parallel Pipelined QC-LDPC Encoder with Compatible Architecture
Based on the encoding scheme of this paper, α·β T units are the main operation units of the proposed encoder. A·S T , C·S T , and D·P a T are all operation units in the form of α·β T . According to the quasi-cyclic characteristics of 5G QC-LDPC, one operation unit can process the α·β T operation (Z-bit) in parallel, that is, to complete the computation of a Z-bit sequence in Var T or P b T . For the H BG1 , P a T and P b T have up to 46 column sequences, and each sequence is Z-bit, so H BG1 needs 46 α·β T operation units. For the H BG2 , P a T and P b T have up to 42 column sequences, and H BG2 then needs 42 α·β T operation units. In order to make the proposed encoder fully support 5G LDPC codes, this encoder sets 46 operation units to compatible with H BG1 and H BG2, and the operation units are distributed in Var generation module and P b generation module. Furthermore, the cyclic shift operation and the XOR operation are combined to replace the α·β T computation, avoiding complicated multiplication-accumulation operations required by a direct α·β T operating process. This design not only significantly reduces the complexity of encoder and the hardware costs, but also greatly improves the computing efficiency.
Based on the above analysis, this work has designed a high area-efficient parallel QC-LDPC encoder with compatible architecture, which is shown in Figure 5. The encoder mainly consists of a serial-to-parallel information input buffer, a Var generation module, a configurable P a generation module, a parallel P b generation module, a cyclic shift coefficient memory module, and an encoding controller.

Area-Efficient Parallel Pipelined QC-LDPC Encoder with Compatible Architecture
Based on the encoding scheme of this paper, α·β T units are the main operation units of the proposed encoder. A·S T , C·S T , and D·Pa T are all operation units in the form of α·β T . According to the quasi-cyclic characteristics of 5G QC-LDPC, one operation unit can process the α·β T operation (Z-bit) in parallel, that is, to complete the computation of a Z-bit sequence in Var T or Pb T . For the HBG1, Pa T and Pb T have up to 46 column sequences, and each sequence is Z-bit, so HBG1 needs 46 α·β T operation units. For the HBG2, Pa T and Pb T have up to 42 column sequences, and HBG2 then needs 42 α·β T operation units. In order to make the proposed encoder fully support 5G LDPC codes, this encoder sets 46 operation units to compatible with HBG1 and HBG2, and the operation units are distributed in Var generation module and Pb generation module. Furthermore, the cyclic shift operation and the XOR operation are combined to replace the α·β T computation, avoiding complicated multiplication-accumulation operations required by a direct α·β T operating process. This design not only significantly reduces the complexity of encoder and the hardware costs, but also greatly improves the computing efficiency.
Based on the above analysis, this work has designed a high area-efficient parallel QC-LDPC encoder with compatible architecture, which is shown in Figure 5. The encoder mainly consists of a serial-to-parallel information input buffer, a Var generation module, a configurable Pa generation module, a parallel Pb generation module, a cyclic shift coefficient memory module, and an encoding controller.
The upper and middle encoding modules (Var module and Pa module) of the encoder correspond to the high code rate of LDPC encoding used to generate the Pa check bits. The Pb encoding module corresponds to the extended matrix region in H to generate extended check bits. By selecting the number of enabled Pb operation units, the length of Pb check bits can be adjusted to determine the code rate of the encoded codeword. Thus, the encoder architecture can adapt to different code rates of LDPC encoding.  The upper and middle encoding modules (Var module and P a module) of the encoder correspond to the high code rate of LDPC encoding used to generate the P a check bits. The P b encoding module corresponds to the extended matrix region in H to generate extended check bits. By selecting the number of enabled P b operation units, the length of P b check bits can be adjusted to determine the code rate of the encoded codeword. Thus, the encoder architecture can adapt to different code rates of LDPC encoding.

Information Input Buffer
The information sequence S is input into the buffer, and the buffer sends the Z-bit information sequence to the Var operation units in the form of parallel output. The input buffer register is implemented by a register set, which contains two Z-bit registers. Due to the controller signal, when one register supplies the Z-bit S i sequence to Var generation units in parallel, another register can preload the next S i+1 sequence. This structure enables the encoder to read in the next information frame to be encoded during the encoding of the present information frame, which saves the information reading time and improves the throughput of the encoder.

Cyclic Shift Coefficient Memory Module
Based on the structure characteristics of H matrices for 5G LDPC codes, the proposed encoder only needs to store the cyclic shift coefficient values corresponding to the A, C, and D submatrices in the Flash ROM. The memory module for other submatrices can be omitted. The A, C and D submatrices correspond to A_Block, C_Block, and D_Block, respectively. Moreover, the proposed encoder does not need to store the specific content of the H matrix.
The structure of the H BG1 is shown in Figure 2.

Information Input Buffer
The information sequence S is input into the buffer, and the buffer sends the Z-bit information sequence to the Var operation units in the form of parallel output. The input buffer register is implemented by a register set, which contains two Z-bit registers. Due to the controller signal, when one register supplies the Z-bit Si sequence to Var generation units in parallel, another register can preload the next Si+1 sequence. This structure enables the encoder to read in the next information frame to be encoded during the encoding of the present information frame, which saves the information reading time and improves the throughput of the encoder.

Cyclic Shift Coefficient Memory Module
Based on the structure characteristics of H matrices for 5G LDPC codes, the proposed encoder only needs to store the cyclic shift coefficient values corresponding to the A, C, and D submatrices in the Flash ROM. The memory module for other submatrices can be omitted. The A, C and D submatrices correspond to A_Block, C_Block, and D_Block, respectively. Moreover, the proposed encoder does not need to store the specific content of the H matrix.
The structure of the HBG1 is shown in Figure 2.  The structure of the H BG2 is shown as Figure 3. The sizes of the A, C, and D submatrix are 4 × 10, 38 × 10, and 38 × 4, respectively. The coefficients of each row in A or Set [C, D] are respectively stored in a ROM block. A total of 42 ROM blocks are required as the memory module. Among them, the coefficients corresponding to the A submatrix require 4 ROM blocks for storage, and each block stores 10 coefficients. The coefficients corresponding to Set [C, D] requires 38 ROM blocks for storage, and each block stores 14 coefficient values; the first 10 values denote the coefficients corresponding to the C submatrix while the last 4 values denote the coefficients corresponding to the D submatrix, as shown in Figure 6.
In summary, each coefficient matrix of the H BG1 requires 46 ROM blocks for the coefficient memory; each coefficient matrix of the H BG2 requires 42 ROM blocks for the coefficient memory. The coefficient memory of the encoder is composed of 46 ROM blocks so that it can be compatible with the two matrices of H BG1 and H BG1 . The bit width of the coefficients in ROM blocks is determined by the coefficient values of the known encoding algorithm.

Encoding Operation Unit
As the core operation unit of the encoder, the encoding operation unit is mainly used to realize operations in the α·β T form. The Var operation module containing α·β T encoding units is mainly used to execute the A·S T operation, while the P b generation module containing α·β T units is mainly used to execute the operation process of [C D] · [S P a ] T . The circuit structure of the encoding operation unit is shown in Figure 7. The operation unit consists of a Z-bit barrel shift register, a row of XOR gates and a Z-bit state register. In summary, each coefficient matrix of the HBG1 requires 46 ROM blocks for the coefficient memory; each coefficient matrix of the HBG2 requires 42 ROM blocks for the coefficient memory. The coefficient memory of the encoder is composed of 46 ROM blocks so that it can be compatible with the two matrices of HBG1 and HBG1. The bit width of the coefficients in ROM blocks is determined by the coefficient values of the known encoding algorithm.

Encoding Operation Unit
As the core operation unit of the encoder, the encoding operation unit is mainly used to realize operations in the α·β T form. The Var operation module containing α·β T encoding units is mainly used to execute the A·S T operation, while the Pb generation module containing α·β T units is mainly used to execute the operation process of [C D]·[S Pa] T . The circuit structure of the encoding operation unit is shown in Figure 7. The operation unit consists of a Z-bit barrel shift register, a row of XOR gates and a Z-bit state register.

Var Generation Module
The module integrates four encoding units, which are used to realize the operation of A·S T in Equation (9). The four encoding units consist of four α·β T operation units (Z-bit granularity), which correspond to var1, var2, var3 and var4 in turn. When the Var generation module receives the Z-bit data sequence from the information input buffer, the input buffer first transmits the Z-bit data sequence to the Z-bit barrel shift registers in var1, var2, var3, and var4 synchronously, by way of corresponding data bits transmission. At the same time, the barrel shift registers read the coefficients in the corresponding position of the A_ROM, and each barrel shift register then shifts the corresponding bits of the data sequence. This means that the Aij·Sj T operation of a column of Z-bit data is completed, that is, the four operations of A1j·Sj T , A2j·Sj T , A3j·Sj T , and A4j·Sj T are computed in parallel (i denotes the row of the A submatrix; j denotes the column of the A submatrix). Each result data of the Aij·Sj T operation takes an XOR operation with the current value of the Z-bit state register (equivalent to a binary addition operation), and one var replaces the value in its state register with the new XOR result, which represents the execution of an Aij·Sj T + Ai(j+1)·S(j+1) T operation. The four var1, var2, var3, var4 operation units execute Aij·Sj T + A(i+1)(j+1)·S(j+1) T operations in parallel, namely, four expressions (A11·S1 T + A12·S2 T , A21·S1 T + A22·S2 T , A31·S1 T + A32·S2 T and A41·S1 T + A42·S2 T ) are achieved in parallel, and the initial value in each state register has been set to 0. The Var generation module continues to repeat the above operation process, until the completion of four-way ∑Aij·Sj T computation, which would realize the computation processes of the four equations ∑A1j·Sj T , ∑A2j·Sj T , ∑A3j·Sj T ,

Var Generation Module
The module integrates four encoding units, which are used to realize the operation of A·S T in Equation (9). The four encoding units consist of four α·β T operation units (Z-bit granularity), which correspond to var1, var2, var3 and var4 in turn. When the Var generation module receives the Z-bit data sequence from the information input buffer, the input buffer first transmits the Z-bit data sequence to the Z-bit barrel shift registers in var1, var2, var3, and var4 synchronously, by way of corresponding data bits transmission. At the same time, the barrel shift registers read the coefficients in the corresponding position of the A_ROM, and each barrel shift register then shifts the corresponding bits of the data sequence. This means that the A ij ·S j T operation of a column of Z-bit data is completed, that is, the four operations of A 1j ·S j T , A 2j ·S j T , A 3j ·S j T , and A 4j ·S j T are computed in parallel (i denotes the row of the A submatrix; j denotes the column of the A submatrix). Each result data of the A ij ·S j T operation takes an XOR operation with the current value of the Z-bit state register (equivalent to a binary addition operation), and one var replaces the value in its state register with the new XOR result, which represents the execution of an A ij ·S j ) are achieved in parallel, and the initial value in each state register has been set to 0. The Var generation module continues to repeat the above operation process, until the completion of four-way ∑A ij ·S j T computation, which would realize the computation processes of the four equations ∑A 1j ·S j T , ∑A 2j ·S j T , ∑A 3j ·S j T , and ∑A 4j ·S j T .

Configurable P a Generation Module
As shown in Figure 5, the Var generation module can generate four output results of var1, var2, var3, and var4 after computing. By inputting the results to the corresponding interfaces of the configurable P a computation network, it can generate four check information blocks, that is, P a1 , P a2 , P a3 , and P a4 .
The new 5G LDPC standards correspond to two sets of base graphs, BG1 and BG2. As shown in Figure 4, each base matrix (H BG ) has two kinds of B submatrices. In other words, the H BG of 5G standards can be further divided into four base matrices. The P a computation network is innovatively designed as a configurable circuit structure in line with the specific parameters of the four kinds of B submatrices, so that the proposed encoder can be fully compatible with the four base matrices. The four submatrices correspond to four different P a computation processes. To implement the compatible encoder, the configurable P a computation network is designed after a detailed analysis of the computational processes and path characteristics. The computation network consists of XOR units, configurable circular shift registers, data multiplexers, and a configurable circuit network. The circuit structure is shown in Figure 2 above. It can be compatible with the computational requirements of the four B submatrices, which means that the computation network can flexibly adapt to BG1 and BG2. Based on the configurable P a computation network and the intermediate result Var, this encoder can flexibly implement the following four different computation processes to obtain the P a sequence.
The four computation processes have been listed in the proposed algorithm, which will not be repeated here. The circuit paths of P a computation network are as follows: The computing paths of P a corresponding to the left submatrix of Figure 4a are as follows: The computing paths of P a corresponding to the left submatrix of Figure 4b are as follows: The computing paths of P a corresponding to the right submatrix of Figure 4a are as follows: The computing paths of P a corresponding to the right submatrix of Figure 4b are as follows: P a sequence register (P a SR): P a SR accepts the P a check blocks from the configurable P a computation network in parallel, and then registers P a check blocks into a register set composed of 4 dual-port RAMs. According to the output signal, P a SR outputs the four check information blocks (P a1 , P a2 , P a3 , and P a4 ) to the output port of the encoder, which will be stored in the corresponding positions of the encoding memory, belonged to the peripheral main system.

Parallel P b Generation Module
The P b generation module is mainly composed of (M b − 4) encoding operation units, M b represents the number of rows corresponding to the H BG . The module is used to implement the operations of P b T = C·S T + D·P a T in Equation (18), including (M b − 4) α·β T units with the Z-bit granularity. The structure of the encoding units in the P b generation module is similar to that of the Var generation module. During the generation of P b check sequences, the computing process of the P b generation module can be mainly divided into two steps: (1) The first step is used to complete the computation of C·S T , which is executed synchronously with the computation process of the Var generation module. The P b generation module receives the Z-bit data from the information input buffer and transmits to (M b − 4) operation units in parallel. At the same time, due to the control signal, the cyclic shift coefficients are transmitted to the P b generation module, and these are obtained from the corresponding positions in the C_ROM. The (M b − 4) operation units compute the data sequence in parallel with corresponding coefficients to complete the C ij ·S j T process (i denotes the row of the C submatrix, and j denotes the column of the C submatrix). Namely, the units have completed the parallel computation of C 1j ·S j T , C 2j ·S j T , C 3j ·S j T , . . . , C (Mb−4)j ·S j T . The obtained values of all C ij ·S j T will be taken XOR operation with the current values of the state registers in the corresponding operation units, thus realizing the accumulation operation of each C ij ·S j T value. The result is as follows: The (M b − 4) operation units will execute the operation of Equation (23) in parallel: The computation process of Equation (24) is executed in parallel. Taking BG1 as an example, the size of the submatrix C is 42 × 22: Lastly, the computational results of the (M b − 4) operation units are obtained. In this way, the C·S T operations of a data sequence S = {S 1 , S 2 , S 3 , ···, S 22 } is completed in parallel. The length of a data unit S j is Z bits.
(2) The second step is used to execute the D·P a T operation and complete the computation of P b T = C·S T + D·P a T . The P b operation units from 1 to (M b − 4) receive the check bits P a(j) generated by the P a generation module in parallel. At the same time, the cyclic shift coefficients at the corresponding positions of the D_ROM are sent to the P b operation units. The cyclic shift coefficients of one column in the D_ROM are read each time (the coefficients' number of one column corresponds to the number of P b operation units). The coefficients of the column are then sent to each P b operation unit synchronously. The operation units will accurately execute cyclic-shift operations to the P a check bits (Z-bit) immediately. Such operations are used to replace the multiplication of D·P a T to obtain the D 1j ·P a(j) , D 2j ·P a(j) , ···, D (R-4)j ·P a(j) . Then, the (M b − 4) sequence values will be performed XOR operations in parallel with the current values of the state register in each P b operation unit. The completion of P b T = C·S T + D·P a T only requires 4 clock cycles after the operations in the first step are finished.
Finally, all check bits of P b , namely, the check sequences of P b = {P b1 , P b2 , P b3 , ···, P b(Mb−4) } are obtained in parallel. The P b generation module transmits P b to the output port which is then output to the encoding memory of the peripheral main system.
In the encoding memory, an encoded codeword consists of the information bits (S), the check bits (P a ) and (P b ), and the codewords will be transmitted in the form of {S, P a , P b } in a batch manner.

Controller Module
The controller module is responsible for the control function of the encoder. It generates the corresponding signals to control function modules of the encoder to execute the relevant encoding works correctly. Its main signals include encoding control signal, memory control signal, input/output control signal, and circuit configuration signal.

Comparison of Encoding Methods
The proposed encoder architecture can be fully compatible with BG1 and BG2. It can adapt to LDPC codes with different lifting sizes Z in these two base graphs. This work implements BG1 and BG2 with two sets of lifting sizes Z, which fully verifies the performance indicators of the new encoder architecture, as well as the area efficiency of each encoder. The ASIC post synthesis implementation results on 65 nm CMOS technology are shown in the Section 5.2.
Firstly, the proposed encoders have been compared with other LDPC encoding schemes. Table 4 quantitatively compares the proposed encoder to related prior art. We normalised the processes of the ASIC implementations to 65 nm. The comparative results of the ASIC implementations are based on normalized data.
Reference [4] introduces a new structure for the multiply operation of a bit vector with a dense QC matrix, which is the basic operation for LDPC encoding. According to a improved scheduling, the encoding architecture utilizes the parallelism of the LDPC codes by processing multiple bits concurrently. Based on the design, it proposes encoder architecture for CCSDS codes with the applicable encoding methods. It is pending further research on the compatibility of different CCSDS codes, which is expected to improve the flexibility and reduce the cost. Compared with the architecture, this wok has about 2-30 times throughput. Its implenentation occupies 8945 LUTs and 12,420 Flip-Flops of the FPGA (Virtex-7). In terms of its occupied resources, its resource efficiency is also lower than this work.
The paper introduces a encoder scheme of LDPC generator matrix in frequency modulation-China digital radio (CDR) [6]. It utilizes the feature of the parity matrix to parallelize encoding operations for rows and columns. An approach is introduced to control memories; it can be applied to the code with some rates to improve the utilization of circuit resources. Its implenentation occupies 32,479 LUTs, 32,313 Registers, and 36 Block RAM of the FPGA (Spartan-6). Compared to the encoder, this work has obvious promotion in the throughput and the resource efficiency.
Although Reference [26] proposes a QC-LDPC encoder structure for 5G NR, its encoding scheme only considers the case of one submatrix B in a single base graph. The scheme has not researched the obvious requirements of encoding compatibility for practical applications of 5G LDPC. However, for 5G or B5G communication, 3GPP and major corporations have focused on the compatibility for the requirements of diverse applications and integrate the compatibility into the formulation of the 5G LDPC standard to serve the needs of various scenarios. Two kinds of base graphs are involved in 5G LDPC codes; H BG1 and H BG2 have significant difference, which can be further divided into four base matrices, considering four different submatrices B. This work includes a whole set of the high-compatible algorithm, and the proposed encoder also has wide compatibility, which is fully compatible with different base graphs of 5G LDPC codes. Thus, this work has more flexible adaptability for various 5G application scenarios. Besides, this work has obvious advantages of higher performance and higher area-efficiency. Compared with the scheme, this wok has about 1.6-2.8 times throughput and 1.4-2.5 times area efficiency, when implementing the proposed encoders in different sizes. These advantages represent lower latency and lower application cost. Reference [27] proposes two kind encoding hardware designs for Irregular Repeat Accumulate (IRA) LDPC codes, which can be used in communication and memory systems. One proposed architecture is a reconfigurable architecture; it is suitable for applications requiring the transition among finite codes frequently. The second proposed architecture utilizes the sparse feature of the parity-check matrices, and reduces the circuit cost by storing its matrix in memory in the sparse form. Compared with the architectures, this wok has more than 9 times throughput and obvious resource efficiency.
Reference [31] takes the Gauss Elimination method to design the check matrix; the encoded codeword is obtained by matrix multiplication based on the generator matrix and the codeword. It proposed a regular LDPC encoder with pipelining structures to obtain a compact encoding process. However, matrix multiplication causes relative complexity in terms of a small block size. The memory overhead will be rapidly raised for larger blocks and for multiple data frames. Compared with the method, this work has about 10-100 times throughput and 4-8 times area-efficiency (Throughput/Area), when implementing the LDPC encoders.
A LDPC encoder is presented for CMMB based on RU algorithm [32]. The RU method takes an modified greedy algorithm for the sparse marix with approximate triangulation. This optimized algorithm can reduce the complexity of encoding. But the implementation of the RU method requires a set of calculations, where data dependence exists in computation steps, limiting the parallelism. Besides, the method has a long critical path, that would cause the encoder implementation unsuitable for high performance scenarios. The LDPC encoder is implemented with Stratix II FPGA, and it consumes 60% the memory resource and 4% the logic resource of the chip. Compared to the design, this work has significant advantages in the throughput and the resource efficiency than its implementation results.
Reference [33] proposes a nonbinary QC-LDPC encoding architecture; it introduces two methods taking advantage of finite Fourier transform to reduce the hardware complexity. In the paper, a GF(2 2 ) QC-LDPC encoder is implemented, and the relevant parameters of the result are normalized to 65 nm process. Compared with the scheme, this wok has more than 12.6 times throughput and 1.8-3.7 times area efficiency.

Implementation Comparison of BG1 and BG2
As the core of one LDPC code, BG determines the macro characteristics and encoding performance of the LDPC code. There are two sets of base graphs in the 5G NR standard, namely, BG1 and BG2. In order to satisfy the needs of different communication scenarios, 5G LDPC codes should be able to flexibly support different encoding parameters. Considering the diversity of future communication scenarios, new encoders should be compatible with BG1 and BG2.
In Tables 5 and 6, Length denotes the word length bits required to match lifting sizes Z. ECC denotes the clock cycles required by encoders to complete the encoding process of a codeword. By default, all data bits of input information need to be encoded. As for the proposed encoder, ECC is equal to the total number of clock cycles required to generate the P a and P b check bits corresponding to a codeword.  Based on the parallel pipelined encoding architecture, the proposed encoder needs a total of k b + 4 clock cycles to generate check sequences (P a and P b ). In addition, it needs another 2 clock cycles to input the information sequence and output the encoded codeword. Therefore, the proposed encoder needs a total of k b + 6 clock cycles to complete the encoding of an information sequence. The throughput rate of check bits is an important index, which represents the encoder performance. The throughput rates of different Z sizes are shown in Tables 5 and 6. Since the actual output sequences of the encoder are check sequences, the throughputs of check sequences are recorded as T-P in the tables, and the throughputs of corresponding information sequences are recorded as T-S. Their computation equations for the proposed encoder are as follows: where M b denotes the number of rows corresponding to the base matrix, Z denotes the expansion size, and f denotes the work frequency of the encoder. ECC denotes clock cycles required to complete the encoding of a codeword, and ECC = k b + 6. For BG1, T-P ranges from 34.3 Gbps to 362.7 Gbps for different Z sizes. For BG2, T-P ranges from 121.8 Gbps to 541.5 Gbps for different Z sizes. The comparison shows that when the Z sizes are similar, T-P (BG2) is significantly higher than T-P (BG1) . This is mainly because in the case of BG1, the encoders require k b + 6 = 28 clock cycles to complete the encoding of a codeword. In the case of BG2, the encoder only needs k b + 6 = 16 clock cycles to complete the encoding codeword, that is, ECC (BG1) is greater than ECC (BG2) . Therefore, for the same Z size, T-P (BG2) is higher than T-P (BG1) . The computational equation of the corresponding data information is as follows: where k b is equal to the number of columns in the submatrix A, and the submatrix A corresponds to S sequences. Z and f have the same meanings as defined in T-P. For BG1, T-S ranges from 16.4 Gbps to 173.5 Gbps for different Z sizes. For BG2, T-S ranges from 29.0 Gbps to 128.9 Gbps for different Z sizes. The comparison shows that when the Z sizes are similar, T-S (BG1) is larger than T-S (BG2) . This is because k b(BG1) is significantly larger than k b(BG2) , resulting in a gap in information encoding performance between them. It is also in line with the reason why 5G LDPC standards formulate two sets of base graphs. BG1 is mainly used for scenarios that require high data throughput; BG2 is mainly used for scenarios with lower requirements for data throughput. The two tables show the synthesized areas of the encoder architecture in different Z sizes. It can be found that the encoders' areas gradually become larger as the increase of Z sizes. This is because the codeword length of the encoder increases with the Z sizes synchronously, making the hardware cost of one encoder needs larger memory modules and more logic units. The comparison shows that the complexity of the proposed encoder is positively related to the lifting size Z. In order to fairly compare the area efficiency of the proposed encoder in different Z sizes, Tables 5 and 6 use AE to represent the area efficiency, which is expressed as follows: In addition, AE-P denotes the area efficiency of the check bits generated by the encoder. AE-S denotes the area efficiency of the information data bits encoded by the encoder. For BG1, AE-P ranges from 879 Gbps/mm 2 to 710 Gbps/mm 2 . For BG2, AE-P ranges from 1450 Gbps/mm 2 to 1245 Gbps/mm 2 . It can be known that when the Z sizes are similar, AE-P (BG2) is higher than AE-P (BG1) , which is mainly because T-P (BG2) is higher than T-P (BG1) . For BG1, AE-S ranges from 421 Gbps/mm 2 to 339 Gbps/mm 2 . For BG2, AE-S ranges from 345 Gbps/mm 2 to 296 Gbps/mm 2 . It is clear that AE-S (BG1) is higher than AE-S (BG2) , which is due to that T-S (BG1) is higher than T-S (BG2) .
By analyzing the experimental data, it can be concluded that the new architecture encoder is compatible with two sets of base graphs, BG1 and BG2. The implemented encoders can flexibly adapt to submatrix sizes with various granularities. Their performance and area-efficiency are significantly high. The encoder architecture can not only meet the encoding requirements of 5G NR, but also achieve higher encoding performance.

Conclusions
This paper presents a parallel LDPC encoding algorithm with high compatibility, which is compatible with 5G LDPC standards, and this work has implemented the high area-efficient parallel encoder with compatible architecture for 5G LDPC codes based on the proposed algorithm. The proposed encoder has the advantages of parallel encoding and pipeline operation, and it takes a configurable encoding structure. Therefore, the encoder architecture has flexible adaptability with 5G LDPC codes. Based on the encoder architecture, we implemented nine encoders for different Z sizes distributed in two base graphs. The experimental results show that the proposed encoder has high performance and significant area-efficiency. It is better than the related prior art. These indicate that the encoding scheme can satisfy the requirements of current 5G LDPC codes, and it can be further applied to future communication scenarios with higher encoding requirements.