Low-Complexity Multi-Size Circular-Shift Network for 5G New Radio LDPC Decoders

This paper presents a low-complexity multi-size circular-shift network (MCSN) structure for 5th-generation (5G) New Radio (NR) quasi-cyclic low-density parity-check (QC-LDPC) decoders. In particular, a fine-coarse approach-based multi-size cyclic shift network, which decomposes the cyclic shift size into fine part and coarse part, is introduced. The proposed MCSN structure is composed of a pre-rotator performing the fine part of the cyclic shift, and a main rotator executing the coarse part of the cyclic shift. In addition, a forward routing circular-shift (FRCS) network, which is based on the barrel shifter and the forward routing process is presented. The proposed switch network is able to support all 51 different submatrix sizes as defined in the 5G NR standard through an efficient forward routing switch network and help reduce the hardware complexity using a cyclic shift size decomposition method. The proposed MCSN is analyzed, and indicates a substantial reduction in the hardware complexity. The experimental results on TSMC 65-nm CMOS technology show that the proposed MCSN structure for 5G NR LDPC decoder offers an area saving up to 56.75% compared to related works in the literature.


Introduction
Low-density parity-check (LDPC) codes [1], which were first proposed by R. Gallager in the early 1960s and rediscovered by MacKay and Neal [2] in 1996, have received significant attention thanks to their powerful error correction capability. Recently, LDPC codes have been adopted in the enhanced Mobile Broad Band (eMBB) as the error correction coding scheme for the fifth generation (5G) data channel [3]. The 3rd Generation Partnership Project (3GPP) introduced two rate-compatible base graphs, BG1 and BG2 [4], to support the rate compatible and scalable data transmission. Accordingly, several studies have been conducted on the 5G New Radio (NR) LDPC codes [5][6][7][8]. In [5], an adaptation of the MinSum decoders for the LDPC used in the eMBB scenario in the 5G mobile networks is proposed. In [6], a hardware-friendly LDPC decoding schedules for 5G hybrid automatic repeat request applications is presented. An algebra-assisted method for constructing 5G LDPC codes can be found in [7]. An efficient encoding method and a high-throughput low-complexity encoder architecture for the 5G NR standard is presented in [8].
In recent years, research on LDPC codes has centered around structured LDPC codes known as the quasi-cyclic low-density parity-check (QC-LDPC) codes [9][10][11][12][13], which have demonstrated great advantages over other types of LDPC codes in terms of hardware implementation and excellent error performance over noisy channels. In QC-LDPC codes, the parity-check matrix consists of either cyclic permutation submatrixes or zero matrixes of the same size, determining the interconnection between the check node processing units (CNUs) and variable node processing units (VNUs). The interconnection network for the QC-LDPC codes is highly structured and can be characterized by the submatrix size and the circular-shift value of each cyclic permutation submatrix. The decoding process of a QC-LDPC code requires circular-shift network (CSN) because of its cyclic characteristic. However, a QC-LDPC decoder for emerging wireless communication systems must support multiple code rates, various block lengths, and different submatrix sizes. The QC-LDPC codes adopted by the WiMAX standard (IEEE 802.16e) [14] supports 19 different submatrix sizes with four distinct code rates and six different class codes (distinct distributions of the number of variable nodes per column of check nodes per row). The submatrix sizes vary from 24 to 96. In the 5G NR LDPC code specification, 51 different lifting sizes are defined, and there are eight different permutation matrix designs per base graph [4]. For this reason, the switch network between CNUs and VNUs should be reconfigurable to support any submatrix sizes defined in the standards of applications. A switch network with varying sizes is called a multi-size circular-shift network (MCSN).
Over the past few years, various works have been introduced in order to improve the signal congestion and hardware complexity of the MCSN for QC-LDPC decoders [15][16][17][18][19]. Benes network [15], which is capable of routing any Z × Z permutation where Z is a power of 2, is a widely used structure. Benes network has been proven to be an optimal non-blocking permutation network. The main drawback of this method is its long critical path and high complexity, which makes it less efficient for implementation with the large size of submatrixes. One of the simplest approaches is assigning a multiple-input multiplexer (MUX) to each output. A MUX is assigned to an output, and each MUX has as many inputs as the number of input data items of the MCSN. This method suffers from high complexity since very wide-input MUXs is required. To overcome the overhead incurred by wide-input MUXs, Rovini [16] proposed a pre-rotator structure, in which small pre-rotators reduce the input width of the MUXs considerably. Furthermore, many other structures have been presented such as: rotator-in-series (RIS) structure [17], rotator-in-parallel (RIP) structure [16,18], and fine-coarse (FC) structure [19]. However, the switch network structures in [15][16][17][18][19] are not suitable for 5G NR LDPC decoders where the submatrix sizes cover a very wide range from 2 to 384. To overcome the limitations of those approaches, this paper considers the design of low-complexity multi-size circular-shift network for 5G NR LDPC decoders, which requires significantly low hardware complexity compared to the conventional architectures. To the best of our knowledge, this is the first work proposing a low-complexity MCSN for the 5G NR LDPC decoders in the literature. This paper targets the design of a novel low-complexity MCSN structure for 5G NR LDPC decoders using the fine-coarse approach. The proposed MCSN architecture can support all the cyclic shifts for all submatrix sizes defined in the 5G NR standard. The simulation results show that the area of the proposed structure is reduced more than 18% compared to the existing works. Therefore, our study provides a promising multi-size cyclic-shifter design for 5G NR LDPC decoders.
The rest of this paper is organized as follows. Section 2 gives a brief overview of the characteristics of 5G NR LDPC codes and outlines the conventional LDPC multi-size cyclic-shifter structures in the literature. A novel low-complexity MCSN architecture is described in Section 3. Section 4 presents the implementation results and comparison. Finally, conclusions are drawn in Section 5.

5G NR LDPC Characteristics
QC-LDPC codes play a crucial role in 5G communications. In the 3GPP standard meeting, QC-LDPC codes were accepted as the channel coding scheme for 5G eMBB data channel. The brief structure of NR LDPC base graph is illustrated in [8]. In particular, the columns of the base graph are partitioned into three parts: information columns, core parity columns, and extension parity columns. The rows of the base graph are divided into two parts: core check rows, and extension check rows. Specifically, the base matrix consists of five submatrixes : (1) submatrix A is reserved for systematic bits; (2) submatrix B, a square matrix with dual-diagonal structure, corresponds to the first set of parity bits; (3) submatrix O is an all-zero matrix; (4) submatrix C corresponds to single parity-check (SPC) rows; and (5) an identity matrix I corresponds to the second set of parity bits. The combination of submatrix A and submatrix B is defined as the kernel, and three submatrixes O, I and C are called extensions. This code structure is similar to the Raptor-like extension as described in [20].
The 3GPP has agreed to consider two rate-compatible base graphs, BG1 and BG2, which have a similar structure for the channel coding. While BG1 is designed for larger block lengths (500 ≤ K ≤ 8448) and higher rates (1/3 ≤ R ≤ 8/9), BG2 is targeted for smaller block lengths (40 ≤ K ≤ 2560) and lower rates (1/5 ≤ R ≤ 2/3). The exact usage of base graphs and the definition of the two matrixes can be found in the 5G NR standard specification [4].

Multi-Size Circular-Shift Network
In addition to computation units, the switch network is an essential element which occupies significant portion of the LDPC decoder area. In QC-LDPC codes, the parity-check matrix is composed of cyclic permutation submatrixes that determine the interconnection network between the CNUs and VNUs. Accordingly, the decoder needs to circularly shift an array of Z messages where Z being the size of the submatrix, also referred to as block size. QC-LDPC codes simplify the switch network by a great deal as the requirement of all random shifting (or permutation) is eliminated and only cyclic shifting is required.
In many modern communication standards, the LDPC codes are defined for different codeword lengths and block sizes Z k , with 0 ≤ k ≤ K, where K is the number of admissible sizes. Therefore, an MSCS network, which is capable of circular shift over an arbitrary number of data, is mandatory. An MSCS network block rotates a specified amount of data. If P w-bit data (d 0 , d 1 , d 2 , . . . , d P−1 ) are inputted with a shift size Z and circular-shift amount c, the first Z input data are rotated by c to produce the output . The output data beyond the specified size of Z are redundant. The sizes and shift amounts are determined by the adopted QC-LDPC codes.
Generally, the Benes network is a very well-known non-blocking multistage interconnection network. A P × P Benes network is capable of performing any desired permutation over P data units, where the input size P of the Benes network is a power of 2. The network comprises of (2log 2 P − 1) stages of 2 × 2 switches. Each switch element consists of two 2-input MUXs and can operate either in the BAR or in the CROSS state, as depicted in Figure 1a. An example of a Benes network with P = 8 is shown in Figure 1b. The switch network designed for QC-LDPC decoders is generally optimized for a given size of permutation matrix with the Benes network. To properly control all the switch units and realize the desired permutation result, the control signals for all 2 × 2 switches must be provided. Those control signals are pre-computed and stored in a dedicated LUT. However, as the submatrix size increases, the required size of LUT becomes a dominant area consuming factor. Therefore, the Benes network with a large size of submatrix is inefficient as it suffers from significant hardware overhead. Furthermore, Benes network offers a limited application, as the number of inputs should be a power of 2.
For the QC-LDPC decoders, switch networks are only required to implement the circular shifts. It was shown that only the first or last log 2 P stages of the P input Benes network are needed to perform all cyclic shifts for P inputs [21]. Therefore, the implementation of the half or reduced Benes network referred to as Banyan network [22] in Figure 1c, is a preferable switching network structure. When the submatrix size is not a power of 2, it is possible to use an P × P Banyan network where P = 2 log 2 P . However, an additional multiplexing stage must be carried out either at the switch network input or output. In recent years, many techniques were proposed to modify the Benes network to make it more efficient for the reconfigurable QC-LDPC decoders [15]. The MSCS network structures can be implemented with typical barrel rotators. The structures exploit two barrel rotators which can be placed in parallel [23] or in series [17], as shown in Figure 2a,b, respectively. The output data of the MSCS network are combined from the outputs of two barrel rotators. In the Rotator-In-Parallel (RIP) structure [23], the left rotator performs the left shift by c to the input data, and the right rotator performs the right shift by P − c to the input data. After that, the obtained results from the barrel rotators are combined by MUXs to become the output data. In the Rotator-In-Series (RIS) structure, the first barrel rotator, rotates the input data by c. The result obtained from the first barrel rotator is then rotated by the second barrel rotator in the same direction by P − Z [17]. In addition, the output data are selected from the outputs of the barrel rotators and then combined by MUXs to become the output data. One of the most straightforward ways to employ an MSCS network block would be using a MUX network. An P-to-1 (w-bit) MUX can be used to connect P inputs to an output since the output data can be selected from P input data. The main drawback of this structure is that many wide-input MUXs are required. These wide-input MUXs are normally implemented by cascading a lot of 2-to-1 MUXs. To improve the performance of the MUX network structure, Rovini [16] introduced a pre-rotator structure exploiting the size-common-divisor property, as shown in Figure 3. The input data are rotated by pre-rotators so that the amount of input data is the greatest common divisor (GCD) of all block sizes before being processed by a MUX network. More precisely, with the maximum block size P, the MSCS network is composed of P g g-input pre-rotators and additional stage of P P g -to-1 MUXs, also referred to as adaptation network (AN), where g is the GCD and P = g × P g . This structure substantially reduces the MUX input width from P to g. However, the MUX network still occupies a significant area. Specifically, the equivalent number of 2-to-1 MUXs is log 2 g × P + (P g − 1) × P , where the first term and the second term are for the pre-rotators and the MUX network, respectively. In [19,24], the MUX network in the AN is substituted with the Benes network and the RIP structure, respectively, to enhance the complexity of the MSCS network block.
The previous MSCS network structures are proposed to support multiple submatrix sizes in the QC-LDPC decoders. However, previously proposed structures are not suitable for 5G NR LDPC decoders where the submatrix size varies largely from 2 to 384. The considerable variation in submatrix sizes would result in the high inefficiency in hardware use and the high complexity in the interconnecting network in conventional switching network structures. To overcome the limitations of existing MSCS network structures, the design considered in this paper, referred to as low-complexity multi-size circularshift network for 5G NR LDPC decoders, requires significantly low hardware complexity compared to conventional structures.   Figure 4a. The set of lifting sizes Z supported for the base graph are all values of the form Z = a × 2 j for a = 2, 3,5,7,9,11,13,15 where 0 ≤ j ≤ 7 and includes all such Z distributed over the range from 2 to 384. The lifting sizes Z can be organized into eight sets, one for each value of a. To make it more convenient for our proposed MSCS network structure, the lifting size table is efficiently modified as shown in Figure 4b. Precisely, the two rows for j = 6 and 7 are removed from the modified table. Instead, three new columns for a = 8, 10, and 12 are created in the table. Accordingly, the value of lifting sizes Z = 128, 192, 256, 320, and 384 are sufficiently reallocated into new positions in the modified table.

Structure of Forward Routing Network
In this section, an efficient reconfigurable message passing network based on the barrel shifter and an additional forward routing stage is presented. Please note that the maximum input size that can be supported by the circular-shift network is P, the block size of the input data for the circular shift is Z, and the value of the required circular shift is c, where 0 ≤ c < Z ≤ P. The structure of the Forward Routing Circular-Shift (FRCS) network is illustrated in Figure 5a, which includes 3 stages. The first stage is the barrel shifter, the second stage is the forward routing module, and the final stage is the selection scheme. The barrel shifter with input size P in the first stage can be easily constructed from log 2 P stages of 2-to-1 MUXs. The barrel shifter circularly shifts the input message shi f t_in according to the specific shift amount c. The control signals for the barrel shift network can be easily determined from the value of the shift amount c. In the next step, the forward routing module simply takes the value of the (P − Z)th to (P − 1)th shifted message BS_out as input data and fetches into next stage. The shifted message from the barrel shifter BS_out and the forward routing message f w_out is fed into the final stage to generate the expected output. The selection stage is in charge of selecting a signal from the barrel shifter or the forward routing network and routing a proper signal to the output. In the selection stage, the expected output message shi f t_out is determined by a MUX network where the control signal FRCS ctrl is defined as follows:   According to the controller design, the final output messages shi f t_out[i] comes from the forward routing module if 0 ≤ i < Z − c and from the barrel shifter if Z − c ≤ i < Z. An output block shi f t_out[i] with i > Z can have any value because its value is ignored in the following blocks. Figure 5b illustrates the operation of the FRCS network with Z = 8, P = 12 and c = 3. The control signal FRCS ctrl is defined as follows:

Fine-Coarse Structure for Circular-Shift Network
The fine-coarse structure for circular-shift network is designed to support different block sizes, in which the maximum block size is defined by P. Let Z p denotes the GCD between all block sizes, and Z m = P/Z p . To support multi circular shifts with low complexity, the proposed low-complexity MSCS network for 5G NR LDPC decoders is decomposed into two parts: the fine part and the coarse part. More precisely, the MSCS network architecture comprises of Z m Z p -input pre-rotators, plus an additional stage of Z p Z m -input main rotators. Let Z denotes the block size of the circular-shift message and c denotes the circularshift value which is distributed over the range from 0 to Z − 1. The fine-coarse structure decomposes the circular shift into a pre-circular shift c p and a main circular shift c m . The circular-shift amount c can be described by the sum of two terms as c = c p + c m × Z p , where 0 ≤ c p < Z p . In the fine-coarse structure, the output data are generated in two steps, which are listed below: • Step 1. Pre-rotator: Each sub-block of Z p input data are processed by a pre-rotator subnetwork to generate the pre-circular shift by performing c p shift operation. • Step 2. Main rotator: The k-th ouput datum of each pre-rotator network is distributed into the k-th main rotator subnetwork, where 0 ≤ k < Z p . The main circular shift is generated by performing c m shift operation on each main rotator subnetwork.
To overcome the inside-group rotation at the pre-rotator network, the cross-group movement is performed by the c m + 1 circular shift at the k-th main rotator subnetwork where k + c p ≥ Z p . As shown in Figure 6, the fine-coarse structure has two components, i.e., the pre-rotator network, and the main rotator network, which corresponds to step 1, and 2 described above. In this structure, the support of multiple sizes is achieved by rearranging the system into smaller subnetwork with fixed block sizes.

Proposed Low-Complexity Multi-Size Circular-Shift Network
The principal of our proposed low-complexity MSCN structure for 5G NR LDPC decoders is based on the fine-coarse structure and the forward routing network. The design target of the proposed architecture is focused on multi-size message-passing network with lower complexity. In general, the fine-coarse structure decomposes the system into smaller subnetworks with fixed block sizes. The block size of the subnetworks is determined by the GCD of all supported block sizes. For 5G NR LDPC codes, the block sizes vary over a very large range from 2 to 384. In this case, the MSCN structure would be made of 192 2-input pre-rotators, plus an additional stage of 2 192-input main rotators as the GCD is 2. When the size of the submatrix is 384, this approach reduces the hardware complexity compared with the direct implementation using 384 384-to-1 MUXs. However, it still suffers from significant hardware overhead. Therefore, the same technique is not directly applicable in the case of 5G NR LDPC codes. There is a need to accommodate all 51 block sizes while achieving lower complexity. To overcome the drawback, a novel MSCS network architecture based on fine-coarse structure, in which the subnetworks are reconfigurable, is presented in this section. Figure 7 illustrates the proposed low-complexity MSCS network structure. As shown in Figure 7b, the proposed structure can be constructed from two stages of the sub-MSCS network in which FRCS structure is conducted. First, denote the maximum block sizes of the MSCS network is P, the size of the submatrix for the circular shift is Z, and the value of the required circular shift is c. For this fine-coarse structure, (P p , Z p , c p ) and (P m , Z m , c m ) which are the set of maximum block size, the size of the submatrix for the circular shift, and the circular-shift value for the sub-MSCS network in the pre-rotator and main rotator stage, respectively, are further defined.  Based on the characteristics of the modified lifting sizes Z table in Figure 4b, the pre rotator block can be constructed from 32 FRCSs, named as Pre[i], in which the first 16 FRCSs (for 0 ≤ i < 16) are with input size Z p = 15 (FRCS15) and the remaining 16 FRCSs for 16 ≤ i < 32 are with input size Z p = 12 (FRCS12). Similarly, the main rotator is decomposed into 15 FRCSs, denoted as Main [j]. The first to the 12th FRCS (FRCS32) (for 0 ≤ j < 12) have the same number of Z m = 32 input blocks while the remaining 3 FRCSs (FRCS16) only comprise of Z m = 16 input blocks for 12 ≤ j < 15. In this structure, all the control signals can be implemented with small area and determined on-the-fly. The control signal for the proposed MSCS network would be calculated from the values of P, c and a, as illustrated in Figure 7b. Please note that the values of a in the modified table in Figure 4b are handled instead of the original ones. The MSCS network controller consists of two parts: a circular-shift sizes generator and a circular-shift amount generator. The values of the circular shift sizes for pre rotator and main rotator subnetworks are defined as follows: In the fine-coarse structure, the circular-shift amount c can be described by the sum of two terms as c = c p + c m × Z p , where 0 ≤ c p < Z p as mentioned in previous section. Therefore, the circular-shift values for the two rotators can be determined as follows: The MSCS network concurrently routes all messages through the two-stage network. As first, the pre-circular shift with a size of c p is performed on the Z messages, and then the main circular shift with a size of c m is finally performed on the result of the pre-circular shift. First, the input data of size Z are evenly distributed among the first Z m pre rotators of the fine part. The rest of the pre rotator stage is ignored. For instance, the first pre rotator Pre[0] takes the first Z p signals from the input messages, the second pre rotator Pre [1] selects the next set of Z p signals, and the last set of Z p signals will be fed into the final pre rotator Pre[Z m − 1]. Each pre rotator block carries out the circular shift of Z p messages by c p [i]. Before the main circular shift is exploited, the proper outputs from the pre rotator stage are assigned to the inputs of each main rotator block. In this stage, the jth (0 ≤ j < Z p ) main rotator block takes all the jth output signal of each pre rotator in previous stage. Similarly, each main rotator block performs the circular shift on the selected Z m -message with the circular-shift value c m [i], which is defined in Equation (5). To achieve the final circular shift message, each Z m -message from the first to the Z p th main rotator block is sequentially routed to the output of the MSCS network. To confirm the efficiency of our solution, we have compared the conventional RIP structure [23], the fine-coarse architecture with MUXs network [16], the fine-coarse structure with RIP network [24] with our proposed MCSN network for multi-size message passing network of the 5G LDPC decoder where the full lifting set Z is given by {2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 18, 20, 22, 24, 26, 28, 30, 32, 36, 40, 44, 48, 52, 56, 60, 64, 72, 80, 88, 96,  104, 112, 120, 128, 144, 160, 176, 192, 208, 224, 240, 256, 288, 320, 352, 384}. The proposed MCSN network structures are synthesized using the TSMC 65-nm CMOS technology. The ASIC post synthesis implementation results on are reported in Table 2.

Implementation Results and Comparison
As can be seen from Table 2, the proposed MCSN structure occupies an area of 0.301 mm 2 and works at a high frequency of 580 MHz. Since the proposed work and other architectures are implemented with different parameters such as maximum shift size P, CMOS technology, and the quantization bits, it is necessary to apply performance normalization to 65-nm CMOS technology for a fair comparison. Specifically, the normalization method presented in [25] is used. We also consider the maximum shift size P and the quantization bits in the normalization. From the normalized results, it is confirmed that the proposed solution helps reduce the complexity in comparison to the works introduced in [16,23,24]. Specifically, the area reduction from the RIP structure [23], the fine-coarse architecture with MUXs network [16], and the fine-coarse structure with RIP network [24] are 56.75%, 18.87%, and 49.75%, respectively. The proposed MCSN can be well operated at the high frequency of 580 MHz. Therefore, the proposed MCSN structure is a promising solution to reduce the complexity of the QC-LDPC decoders for 5G NR.

Conclusions
In this paper, a novel reconfigurable message-passing network architecture has been proposed for 5G NR QC-LDPC decoders. The proposed MCSN structure using finecoarse approach with low complexity is capable of supporting 51 different submatrix sizes of the 5G NR standard and exhibits low complexity. Simulation results prove that the proposed MCSN structure requires small area over its predecessors. Therefore, the proposed design is appropriate for applications with multimode QC-LDPC decoders in 5G standard requirements.

Conflicts of Interest:
The authors declare no conflict of interest.