LDPC Decoder Design Using Compensation Scheme of Group Comparison for 5G Communication Systems

: This paper presents a dual-mode low-density parity-check (LDPC) decoding architecture that has excellent error-correcting capability and a high parallelism design for ﬁfth-generation (5G) new-radio (NR) applications. We adopted a high parallelism design using a layered decoding schedule to meet the high throughput requirement of 5G NR systems. Although the increase in parallelism can efﬁciently enhance the throughput, the hardware implementation required to support high parallelism is a signiﬁcant hardware burden. To efﬁciently reduce the hardware burden, we used a grouping search rather than a sorter, which was used in the minimum ﬁnder with decoding performance loss. Additionally, we proposed a compensation scheme to improve the decoding performance loss by revising the probabilistic second minimum of a grouping search. The post-layout implementation of the proposed dual-mode LDPC decoder is based on the Taiwan Semiconductor Manufacturing Company (TSMC) 40 nm complementary metal-oxide-semiconductor (CMOS) technology, using a compensation scheme of grouping comparison for 5G communication systems with a working frequency of 294.1 MHz. The decoding throughput achieved was at least 10.86 Gb/s without evaluating early termination, and the decoding power consumption was 313.3 mW.


Introduction
The market for fifth-generation (5G) communication systems is rapidly growing. 5G technology requires low-power and low-cost hardware equipment support because it considers the application requirements of enhanced mobile broadband (eMBB) and massive machine-type communication (mMTC). Immediate applications, such as in telemedicine and autonomous driving, will require ultra-reliable and low-latency communication (URLLC) support. Additionally, the corresponding specifications of the polar code and low-density parity-check (LDPC) code were proposed in the 15th edition of the 5G specification [1] published in 2018.
The error correction capability is of great importance when considering the advances in these techniques. The LDPC code [2], which comprises a forced error-correcting capability and fully parallel decoding architecture, was proposed by Gallager in 1962. Currently, LDPC codes have been introduced in different communication standards, such as DVB-S2 [3], IEEE 802.11n [4], IEEE 802.16e [5], and 5G new-radio (NR).
LDPC is valued by researchers because it has powerful error-correction capabilities that are extremely close to the Shannon limit. However, LDPC has been difficult to implement in previous generations due to the complex calculations. Subsequently, process technology has evolved, and LDPC has been discussed and recently applied to most of the popular 5G communication fields. The requirement to achieve a throughput of 10 Gb/s for the 5G NR standard will also be a significant challenge for LDPC implementation.
LDPC is a type of simultaneous equation that consists of a sparse matrix H. Each row of the H matrix represents a check node (CN) and each column represents a variable node (VN). The 1s elements in the H matrix are dependent on the CNs and VNs. The 0s elements in the H matrix are ignored. The data transmission relationship between the CNs and VNs can be expressed using the Tanner graph [12].

Quasi-Cyclic (QC) LDPC Codes
Quasi-cyclic (QC) LDPC codes [13] are more efficient in the hardware implementations of LDPC codes. In the (N, M) QC-LDPC H matrix, N represents the number of base columns, and M represents the number of base rows. The maximum number of VNs connected to a row is defined as w r (row weight), and the maximum number of CNs connected to a column is defined as w c (column weight). Each element of the H matrix can be treated as an identity matrix, which can be expanded to provide a z × z sub-matrix with a right shift. Therefore, the H matrix is a simplified version of the (N, M) QC-LDPC. The code rate (R) is the proportion of real information (N − M) in the transmission data (N), and is regarded as R = N−M N . Each element in the H matrix represents the number of right shifts. In the H matrix, all elements in a single layer are independent, enabling the operations in a single layer to run synchronously. The maximum parallelism of the hardware design can be up to a maximum value of z. This efficiently speeds up the overall operations and improves the overall throughput. The QC-LDPC is composed of a cyclic-shifted identity matrix; thus, a memory address can be obtained using the cyclic-shifted characteristic. Only the starting point of the memory address is required to make the design of the reading memory address more flexible.

Parity Check Matrices for 5G NR Standard
The H matrix of the LDPC code in the 5G NR specification conforms to the QC-LDPC code, so we designed the dual-mode LDPC decoder according to the characteristics of the QC-LDPC code. The 5G NR specification is typically divided into BGs 1 and 2. It is subdivided into eight different lifting sizes for each BG. We adopted two different specifications for H, achieved without rate-matching. The first matrix comes from index 1 of the BG1. As shown in Figure 1a, the base matrix is a 4 × 26 QC-LDPC code with a code rate of 11/13, w r = 19, and w c = 4. The second matrix comes from index 1 of the BG 2. As shown in Figure 1b, the base matrix is a 4 × 14 QC-LDPC code with a code rate of 5/7, w r = 10, and w c = 3. The code lengths were 9984 and 5376 in BG 1 and BG 2, respectively. For the details of the 5G NR QC-LDPC encoder design, readers can refer to [14,15]. The base matrices of (a) rate-11/13 BG 1 and (b) rate-5/7 BG 2 in the 5G NR standard. The element * is a zero matrix and 0 is an identity matrix. The element of non-zero integer denotes a cyclic-shifted matrix with a shifted value.

LDPC Decoding Algorithms
The original LDPC decoding algorithm was the sum-product algorithm (SPA) [16]. The SPA can almost achieve the theoretical decoding ability, but it is difficult to implement because of the excessively complex calculations. Therefore, alternative algorithms have been published, such as the NMSA and OMSA. Both algorithms use approximate values adjusted by a normalized or offset coefficient to simplify the overall operation complexity and maintain the approximate decoding capability of the SPA. It is noteworthy that the NMSA was widely used for a decade in the hardware implementations of LDPC decoders. Furthermore, additional discussions and research on simplifying the hardware complexity of the NMSA have been published, such as the NPMSA, approximate extrinsic minima (rExMin) algorithm [17], second minimum approximation min-sum algorithm (SMA-MSA) [18], and single variable weight min-sum algorithm (svwMSA) [19]. These algorithms are dedicated to alleviating the problems of high hardware complexity and area cost, which are caused by the operations of comparators. Although these studies successfully alleviated the above problems, the decoding capability was weakened. Below, we introduce the details of the algorithms and their differences.

NMSA
The operation flow of the NMSA using a layered scheme can be divided into four steps. In the NMSA, we initially define y j as the received channel information, C j as the initial prior message, Q k i,j as prior messages, and R k i,j as the extrinsic message, where i is the index of the row of H, j is the index of the column of H, and k is the index of the decoding iteration. Subsequently, the core decoding processes are stated as follows.
In the first step, all parameters in the decoder must be initialized. The received channel information y j is then imported and stored in the memory for subsequent operations. The logarithm expression of the initial prior message C j for each VN is stated as: The prior messages Q 0 i,j are updated as C j , which is the log likelihood ratio (LLR), and the extrinsic messages R 0 i,j are zeros. In the second step, the prior message Q k i,j is updated by subtracting the R k−1 i,j from the LLR j to recover the decoding capability in the last iteration. The inclusion of the two conditions above in the updates is called a prior message process. The formula for the updated prior message is shown as: In the third step, the check nodes R k i,j are updated to find the bits that have the lowest reliability in every layer. The check node process contains the product of the total sign excluding the updated node, the first minimum excluding the updated node, and the scaling factor γ. The set of VNs corresponding to CN is denoted as N(c). The formula for updating the check nodes is expressed as: In the fourth step, the variable node LLR j is updated with Q k i,j added to an extrinsic message R k i,j to increase the reliability of the correct nodes and correct the bits that have lower reliability and a higher error rate in every layer. The formula for updating the variable nodes is expressed as: Subsequently, the NMSA iteratively processes the second to fourth steps until k reaches the maximum iterations or the early termination scheme is satisfied.

NPMSA
Generally, the critical path of the decoder hardware design employing the NMSA is the operation of the check node process. This is because numerous comparator operations cause severe operational delays. Therefore, the NPMSA was proposed to speed up the overall operation and relieve the hardware burden of the NMSA, which includes area and routing costs. The NPMSA adopts a grouping search that divides the prior message Q k i,j into G groups, to replace the sorter used by the NMSA. In each group of the NPMSA, the architecture only needs to be implemented using 2-to-1 comparators [20]. A significant reduction in the architecture overhead is noted when compared to the sorter using 2-to-2 comparators. The gap of the hardware burden between the grouping search and sorter increases as the number of prior messages Q k i,j increases. However, in alleviating the effects of hardware burden, a grouping search cannot accurately determine the correct second minimum. This is because the correct second minimum is typically lost after every 2-to-1 comparator operation in the hardware design. This causes the accuracy of the algorithm to decrease and directly affects the decoding capability. The formula for updating the check nodes with a grouping search is expressed as: where: and the probabilistic second minimum is: The index j index denotes the position of the first minimum. G L and G R represent the left and the right groups of the prior message Q k i,j , respectively, when there are two groups. Table 1 lists a comparison of the check node process executed using the NMSA, the split-row threshold algorithm (SRTA) [21], and NPMSA. The two-input comparison was used as the basis for comparison. From an algorithmic perspective, the NPMSA can reduce the number of two-input comparisons by 49.1%, with a 0.05 dB loss in the error-rate performance, compared to the NMSA. Compared to the SRTA, the NPMSA can reduce the number of two-input comparisons by 35.4%, with a 0.15 dB enhancement in error-rate performance. It is noteworthy that the probabilistic second minimum value in Equation (7) is equal to the correct second minimum value in Equation (3) when the correct second minimum value is not in the same group of the correct first minimum value. This means that the correct second minimum value can be easily achieved when the number of groups becomes larger. However, a large number of groups intensifies the hardware burden because of the increment of 2-to-2 comparators. Additionally, the occurrence of differences between the correct and probabilistic second minimum values results in a decline in the decoding capability. The NPMSA suffers from a high error floor region, which was revealed in [18]. Thus, the difference between the correct and probabilistic second minimum values must be reduced when the difference occurs.

rExMin Algorithm
Based on the NPMSA, the rExMin algorithm was proposed to revise the problem caused by a decline in the decoding capability due to a difference between the second minimum and the probabilistic second minimum. The rExMin algorithm uses a negative factor θ to reduce the value of the probabilistic second minimum, which reduces the difference and recovers the decoding capability. In addition, the negative factor θ may cause overcorrection when the new second minimum is smaller than the first minimum. Therefore, a comparator was added to determine whether overcorrection occurred. The new second minimum is the same as the probabilistic second minimum when overcorrection occurs. The new second minimum is the sum of the probabilistic second minimum and the negative factor θ. The formula for updating the check nodes with a grouping search is expressed as:

SMA-MSA
The concept of SMA-MSA is the same as that of the rExMin algorithm. The SMA-MSA treats the probabilistic second minimum as the upper bound and the first minimum as the lower bound. Combining the first minimum and probabilistic second minimum, the SMA-MSA obtains the new second minimum, as shown below: where both ρ 2 and σ are used to weigh the first minimum and probabilistic second minimum for the new second minimum. The values of ρ 2 and σ are set as 2 −x or (1 − 2 −x ), depending on the code, where x is an integer. The author of SMA-MSA noted that the new second minimum auto-adjusted the correction value applied to the first minimum by a factor ρ. The definition of ρ is expressed as: The formula of the check nodes updated with the grouping search is shown below:

Proposed Compensation Scheme for NPMSA
The NPMSA is adopted in a LDPC decoder design to reduce the hardware area cost, to relieve the complexity of routing, and to accelerate the operation of numerous comparators. However, the accuracy of the algorithm is decreased compared to the NMSA. Therefore, the balance between hardware complexity, overall calculation speed, and decoding capability is important. To resolve the algorithm accuracy problem, the rExMin algorithm [17] and the SMA-MSA [18] were proposed as the alternative methods for the NPMSA. Their concepts are based on reducing the gap between the accurate second minimum and the probabilistic second minimum to recover the decoding capability. We continue these concepts and modify the probabilistic second minimum using a flexible adjustment.

Difference in Extrinsic Messages
The NMSA derived from the simplified SPA leads to larger values due to the simplified process. Larger values cause the accuracy of the algorithm to decrease and directly affect the decoding capability. Therefore, NMSA combines the larger values with the normalized scaling factor γ to reduce the difference in the check node process. Similarly, the NPMSA derived from the simplified NMSA leads to a loss in accuracy of the accurate second minimum due to the adoption of a grouping search. In a grouping search, 2-to-1 comparators are used to search for the first minimum and the probabilistic second minimum. Because the probabilistic second minimum in the NPMSA would be greater than the second minimum in the NMSA, the gap between the first and second minimum in the NPMSA is greater than that gap in the NMSA. This gap in the NPMSA causes the accuracy of the algorithm to decrease and directly affects the decoding capability. Thus, the basic idea of the proposed improved normalized probabilistic min-sum algorithm (INPMSA) [10] is to reduce the gap or close the gap between the first minimum and the second minimum in the NMSA, thus enabling the decoding capability to be recovered. With regard to hardware implementation, the INPMSA maintains its advantages by reducing the area cost and complexity of routing, and accelerates the overall operation speed.

INPMSA with Compensation Scheme
To reduce the hardware area cost and the complexity of routing, the proposed method adopts a grouping search in check node updating. Additionally, we focus on the gap between the first minimum and the probabilistic second minimum to recover the decoding capability. Because the grouping search can still find the accurate first minimum in check node updating, we used the first minimum as the basis for adjusting the value of the probabilistic second minimum. An increase was noted in the reliability of the check nodes as the number of iterations increased. To deal with the change in reliability, INPMSA adopts proportional adjustment to improve the probabilistic second minimum. Additionally, the first minimum and the improved probabilistic second minimum need to be multiplied by the scaling factor to reduce the difference from the NMSA. We took α and β to represent the proportions of the first minimum and the probabilistic second minimum, respectively. The new second minimum is expressed as: The decoder loses error correction capability when the new second minimum is less than the first minimum, or greater than the probabilistic second minimum. We call this the reversal problem, which results in overcorrection or correction failure. To avoid these problems, we define the upper and lower bounds as the probabilistic second minimum and first minimum, respectively. Initially, we set the scaling factor γ = 1 for Equations (3), (5) and (12) to simplify the following deduction. We then derive the range of α and β using the upper (Case 1) and lower (Case 2) bounds, which are shown below: Case 1-The new second minimum must be less than or equal to the probabilistic second minimum, as shown below: Case 2-The new second minimum must be greater than or equal to the first minimum, as shown below: new min 2nd ≥ min 1st .
Combining the two cases above, we observed that the relationship between α and β is related to the ratio of the first minimum and the probabilistic second minima. Then we can have: Because the probabilistic second minimum must be greater than the first minimum, the ratio of the first minimum and the probabilistic second minima must be less than 1. Then we can have: By combining Equations (19) and (20), we can conclude that this case is only valid when the first minimum is equal to the probabilistic second minimum. Subsequently, we can achieve: where α and β must be greater than or equal to 0 to avoid changing the original gap between the first minimum and the probabilistic second minimum.
The number of groups directly affects the error probability of the probabilistic second minimum and the values of α and β. The relationship between the error probability of the probabilistic second minimum and the number of groups is calculated as: where Q denotes the total number of Q k i,j in a layer (i.e., the number of row weights in a layer), and G represents the number of groups. Table 2 lists the calculated error probabilities of the probabilistic second minimum for different groups and inputs. We can observe that the higher the number of groups, the lower the error probability of the probabilistic second minimum. The basic idea of the proposed INPMSA is that an increase in the error probability of the probabilistic second minimum is due to an increase in the proportion of the first minimum used for improvement. For hardware implementation, we defined α as 2 −x and β as (1 − 2 −x ), where x is a positive integer. In this case, α is the error probability of the probabilistic second minimum or smaller.

Simulation Results
The proposed methods implemented the floating-point bit-error rate (BER) performance on NMSA and NPMSA using improved methods based on layer decoding under the rate-5/6 (1944, 324) QC-LDPC code; with early termination and different improving algorithms, row weight w r = 20, scaling factor γ = 0.75, and maximum iteration = 20 for the IEEE 802.11n, as shown in Figure 2. The NPMSA-2 and NPMSA-4 represent NPMSAs with 2 Group and 4 Group searches, respectively. The BER performance of INPMSA-4 with α = 0.5 and β = 0.5 is worse than that of the NPMSA-2. However, the INPMSA-4 with α = 0.125 and β = 0.875 significantly improves the BER performance when α is selected using a smaller value of error probability listed in Table 2. The BER performance between the INPMSA-4 with α = 0.125 and β = 0.875, and the rExMin algorithm using the 4 Group search with θ = 0.5, are approximately the same. There is a similarity in the decoding performance between SNR = 4.0 dB and SNR = 4.5 dB, showing that the INPMSA-4 with α = 0.125 and β = 0.875 can achieve better decoding performance than the NPMSA-4, without improvement in the SNR. In comparison to the SMA-MSA, the INPMSA-4 can achieve a coding gain of about 0.15 dB when the BER = 10 −5 . This is because the SMA-MSA does not consider the reversal problem, which causes the degradation of the decoding performance due to the use of the inappropriate new second minimum.
In the NAND flash channel model, the BER performance is based on the rate-0.89 (9728, 1024) LDPC code, which uses early termination and different improvement algorithms, a scaling factor γ = 0.75, row weight w r = 36, and maximum iteration = 8. The dashed lines in Figure 3 denote the results of the lower page. The solid lines denote the results of the upper page. The proposed decoder [9] suggests that dividing Q k i,j into four groups can achieve the best decoding performance and hardware implementation. Therefore, we adopted four groups of minima finders. There is a reversal problem in the performance of the SMA-MSA, which caused the decoding performance to decline. The performance of the INPMSA is similar to that of the rExMin algorithm when Q k i,j is divided into four groups. Compared to the NPMSA, the INPMSA can reach approximately 0.028 and 0.02 dB in the upper and lower pages, respectively, when the BER = 10 −5 . When the BER performance drops below 10 −6 , the decoding capability starts to converge, with the exception of the NMSA. From the above simulation results, the INPMSA is capable of avoiding the reversal problem and efficiently achieves the correction effect to improve the decoding capability.  In addition, we applied the INPMSA to models implemented in related studies using the NPMSA to observe the practicability of the NAND flash channel model. We referred to the partially stopped probabilistic min-sum algorithm (PS-PMSA) [22], which discards the unnecessary operations in check node updating using error-correcting degradation, as shown in Figure 4. Comparing the NPMSA and the PS-PMSA shows a 0.05 dB coding gain when the BER = 10 −5 . After combining the PS-PMSA with the INPMSA, the improved PS-PMSA can further achieve decoding capabilities consistent with the NPMSA, and still discard the unnecessary operations in check node updating. For the design of the hardware to the 5G NR standard, we simulated and analyzed the decoding performance using the 5G NR BGs 1 and 2. In BG 1, the analysis of the BER performance was based on the rate-11/13 (9984, 1536) LDPC codes that have early termination and different improving algorithms, a scaling factor γ = 0.75, row weight w r = 19, and maximum iteration = 6. Figure 5a shows that the performance between the NPMSA-2 and NPMSA-4 has a 0.38 dB coding gain when the BER = 10 −6 . Figure 5a also shows that the INPMSA-2 significantly improves the BER performance when α is selected using a smaller value of error probability listed in Table 2. However, the BER performance of INPMSA-4 with α = 0.5 and β = 0.5 is worse than that of the NPMSA-4. The coding gain of the INPMSA-4 with α = 0.25 and β = 0.75 can be achieved. Figure 5a indicates that the decoding capability of the NPMSA starts to converge after a SNR = 4.5 dB, and this characteristic ensures the decoding capability of the INPMSA with α = 0.25 and β = 0.75, SMA-MSA, and the rExMin algorithm, is the same.
In BG 2, the BER performance was based on the rate-5/7 (5376, 1536) LDPC codes, which have early termination and different improving algorithms, a scaling factor γ = 0.75, a row weight w r = 10, and a maximum iteration = 6. Figure 5b shows that the performance between the NPMSA-2 and NPMSA-4 has a 0.4 dB coding gain when the BER = 10 −5 . Figure 5b also shows that the INPMSA-2 significantly improves the BER performance. The BER performance of the INPMSA-4 with α = 0.5 and β = 0.5 is worse than that of the NPMSA-2. However, the INPMSA-4 with α = 0.25 and β = 0.75 improves the BER performance when α is selected using a smaller value of error probability listed in Table 2.

VLSI Implementation of Dual-Mode LDPC Decoder for 5G NR Systems
The proposed dual-mode LDPC decoder is designed to support two different 5G NR matrices with the same extension factor of 384. To support two completely different matrix designs, we adopted two controllers to manage the data flow and achieve shared memory. In the calculation unit, multiplexers are used to support different matrix operations. The details of the VLSI implementation are introduced in this section. The proposed LDPC decoder uses a 4 × 26 base matrix in BG 1 index 1, with a code rate = 11/13, and w r = 19. The 4 × 14 base matrix in BG 2 index 1 has a code rate = 5/7 and w r = 10. Both are QC-LDPC codes and the expending factor is equal to 384.
The total bits that contain one sign bit, integer bits, and fraction bits for the posterior and extrinsic messages of the proposed decoder are extracted from fixed-point simulations for the 5G NR BGs 1 and 2. First, we use six to nine bits for the posterior and extrinsic messages in 5G NR BG 1. The simulation shows that eight bits are sufficient for BG 1. In BG 2, the simulation shows that seven or eight bits are sufficient. Thus, we decided to use eight total bits for both the posterior and extrinsic messages. According to Equations (3) and (4), the posterior message is summed by an extrinsic message and a prior message. Hence, we can reduce the unnecessary integer bits of the extrinsic message to relieve the hardware burden based on the integer bits of the posterior message. In addition, we can add fraction bits of extrinsic messages to enhance the accuracy and improve the decoding capability. Four to seven integer bits, and one fraction bit of extrinsic messages, can be performed very close to the floating-point simulation when the SNR = 4.5 dB. Eventually, we chose eight total bits for posterior messages and five total bits for extrinsic messages.

Architecture of Dual-Mode LDPC Decoder
The proposed overall decoding architecture with layered decoding is shown in Figure  6. This architecture is composed of an input buffer, an output buffer, a posterior message memory, an extrinsic message memory, a network, a de-network, an early termination, and 64 arithmetic units (AUs). The AU contains a prior message unit (PMU), a check node unit (CNU), a variable node unit (VNU), and first in first out (FIFO) buffer. The posterior message memory and extrinsic message memory are composed of a two-port registerbased memory. For the posterior message memory, the quantization of posterior messages is eight bits (one sign bit plus seven total bits) and the maximum number of posterior messages is 384 × 26 = 9984. Hence, the total number of bits of the posterior message memory is 8 × 9984 = 79, 872. Because the decoder only reads 64 consecutive posterior messages from a single block at a time, we divided the posterior message memory into 26 memory blocks and 64 memory banks for each memory block. For the extrinsic message memory, the quantization of the extrinsic message is five bits. To reduce the bandwidth of the extrinsic message memory, the extrinsic messages are stored as individual signs in a layer: min 1st , min 2nd , and index min . As shown in Figure 7, we divided individual signs into nine bits and ten bits because w r = 19 and 10 in BGs 1 and 2, respectively. The min 1st , min 2nd , and absolute value of the extrinsic message are scaled so the data length is 4 bits. According to the maximum w r = 19, the index min is stored as five bits. The maximum number of layers is 4 × 384 = 1536, but there are 32 bits to store extrinsic messages for each layer, so the total number of bits of the extrinsic message memory is 32 × 1536 = 49, 152. Because the parallelism is 64, we combine 64 consecutive extrinsic messages into a string. Because the base columns and w r in BG 2 are smaller than those in BG 1, some posterior and extrinsic memories are unused. Hence, the gated clock technique is adopted to reduce the power of the unused memory.   Figure 8a illustrates the architecture of the CNU and Figure 8b illustrates the architecture of the min and index finder, and the scaler, in the proposed INPMSA. First, the prior message is stored in the D-flip-flop for the pipeline after receiving prior messages from the PMU. The prior message is then transmitted to a sign bit and absolute value. The individual sign bit for each row weight is then generated by the "XOR sign" module after which the Index min , min 1st , and min 2nd are generated by the Min & Index Finder. In the scaler, the min 1st and min 2nd are multiplied by the scaling factor (γ = 0.75) and improved factor (γα = γβ = 0.375), respectively. Finally, this data is combined with an extrinsic message and transferred to the VNU. Additionally, Table 2 listed in [10] reveals that the scaler in the INPMSA is simpler and faster than that in the rExMin by approximately one comparator. It demonstrated that the hardware implementation result of the CNU used 36 inputs and divided them into four groups: the NPMSA, the INPMSA, and the rExMin algorithm. The INPMSA has a critical path that is increased by approximately 0.3 ns due to the addition of a compensation scheme. Compared to the rExMin, the INPMSA is approximately 0.08 ns faster. By comparing the comprehensive decoding ability and hardware operation speed, we found that the INPMSA can maintain the same decoding capability as the rExMin, and has more advantages during hardware implementation.

Post-Layout Implementation Results
Using the ASIC design flow, the decoder is synthesized and implemented using the Taiwan Semiconductor Manufacturing Company (TSMC) 40 nm complementary metaloxide-semiconductor (CMOS) process with a working frequency of 294.1 MHz. There are three major costs located in the posterior and extrinsic memory, with a cell area of approximately 42.2%; arithmetic units with 64 parallelisms and a cell area of about 25.2%; and the network/de-network in pipelines with a cell area of approximately 22.7%. The proposed dual-mode LDPC decoder operates with two different code lengths, and the highest throughput meets the throughput requirement of the 5G NR standard, which is higher than 10 Gb/s. To evaluate the throughput of the proposed dual-mode LDPC decoder, the throughput estimation equations used for the decoder are as follows: where CL and f denote the code length and frequency, respectively; z and P are the expending factor and number of parallelisms, respectively; and N pipe , N row , and N iter represent the number of pipeline stages, the base row, and the maximum iterations, respectively. Without employing the early termination, the throughput reaches at least 10.86 and 5.84 Gb/s in BG 1 and BG 2, respectively. The highest throughput of the decoder in BG 1 mode can exceed the throughput requirement of 10 Gb/s for the 5G NR systems. Figure 9 illustrates the chip layout and Table 3 lists the summary of post-layout results of the proposed dual-mode LDPC decoder. Table 3 also lists a comparison of recent studies for the 5G NR systems. The studies in [23,24] achieved post-layout and post-synthesis (prelayout) implementations, respectively. Compared to the post-synthesis implementation, the post-layout implementation contains more hardware information, including the logic gate placements, clock tree synthesis, and wire routing. It is worth noting that the studies in [23,24] focused on the low code-rate (high correcting ability) LDPC decoding and this study focused on the high code-rate (high transmission rate) LDPC decoding for the 5G NR systems. Thus, it is hard to compare these application-specific integrated circuits (ASICs) because their design parameters are very district. To achieve a fair evaluation of the LDPC decoder ASICs using the distinct CMOS processes, however, the normalized area efficiency (NAE) and energy efficiency (NEE) are widely used [25] and evaluated in terms of the 40 nm CMOS process in this study. The NAE denotes how many decoded bits per unit area can be achieved in the decoder. The NEE denotes how much energy per decoded bit is consumed in the decoder. In early 2021, the study in [23] proposed the LDPC decoder ASIC for the 5G NR standard using a layered combination min-sun algorithm (CMSA) decoding to achieve the features of low complexity and high throughput. It supports a low rate-1/3 BG 1 with a code length of 3808. Using the TSMC 65 nm CMOS process, it achieves a maximum throughput of 3.04 Gb/s with a NAE of 7.18 bits/mm 2 and a NEE of 42.48 pJ/bit. To further support the low rate-1/3 BG 1 with the maximum code length of 25,344 and low rate-1/2 BG 2 with a maximum code length of 19,200 in the 5G NR standard, the study in [24] proposed a reconfigurable LDPC decoder ASIC using a layered NMSA decoding accompanied with instruction-level reordering and data-level rescheduling. Because it achieves 33.2 Gb/s/iteration using the TSMC 28 nm CMOS process, the throughput rates are 6.64 Gb/s at average iterations of 5, and 7.92 Gb/s at average iterations of 4.92, for the rate-1/3 BG 1 and rate-1/2 BG2 LDPC decoding, respectively. Compared to [23,24], the proposed dual-mode decoder using the layered INPMSA decoding achieves a high throughput of 10.86 Gb/s, which meets the requirement of 5G NR application with a high NAE of 11.40 bits/mm 2 and low NEE of 28.85 pJ/bit. This reveals that the INPMSA using the proposed compensation scheme achieves a low silicon area with low energy consumption.  & Normalized area efficiency (NAE) = Throughput/(Area × Frequency) × Normalized area factor (=(S/40) 2 ). $ Normalized energy efficiency (NEE) = (Power/Throughput) × Normalized energy factor (=(40/S) × (0.9/U) 2 ). % Average iteration is achieved using the early termination scheme.

Conclusions
In this paper, we proposed a dual-mode LDPC decoder for 5G NR systems. This design can support the rate-11/13 BG 1 and rate-5/7 BG 2 matrices. We adopted the layered decoding schedule with 64 parallelisms for hardware implementation to reach the throughput requirement of the 5G standard. Additionally, the sorter in the CNU was replaced by a grouping search to reduce the critical burden of the CNU. This saves half of the comparator demand at the same calculation speed, with a decline in the decoding capability. To alleviate the decline in the decoding capability, we proposed a compensation scheme to revise the probabilistic second minimum using the first minimum and proportion fixing. Compared to the SMA-MSA, the proposed compensation scheme is more complete and exhibits a better decoding performance. The proposed INPMSA also achieves the same decoding capability as the rExMin algorithm and accelerates operations at the comparator stage. Finally, the post-layout implementation of the proposed dual-mode LDPC decoder was achieved using the TSMC 40 nm CMOS process at an operating frequency of 294.1 MHz, a core area of 3.24 mm 2 , and power consumption of 313.3 mW. The decoding throughput can reach at least 10.86 and 5.84 Gb/s when the mode is BG 1 and BG 2, respectively. The throughput does not consider early termination, so the higher throughput can be achieved at a higher SNR.