Energy-Efficient Partial LDPC Decoding for NAND Flash-Based Storage Systems

: A new decoding method for low-density parity-check (LDPC) codes is presented to lower the energy consumption of LDPC decoders for NAND flash-based storage systems. Since the channel condition of NAND flash memory is reliable for most of its lifetime, it is inefficient to apply the maximum-effort decoding with the full parity-check matrix (H-matrix) from the beginning of the lifespan. As the energy consumption and the decoding latency are proportional to the size of the H-matrix used in decoding, the proposed algorithm starts the decoding with a partial H-matrix selected by considering the channel condition. In addition, the proposed partial decoding provides various error-correcting capabilities by adjusting the partial H-matrix. Based on the proposed partial decoding algorithm, a prototype decoder is implemented in a 65 nm CMOS process to decode a 4 KB LDPC code. The proposed decoder reduces energy consumption by 93% compared to the conventional LDPC decoding architecture at maximum.


Introduction
NAND flash memory is extensively used in many storage solutions, such as solidstate drives (SSDs) and secure digital (SD) cards, due to its fast accessibility, low-power consumption, and compact size [1,2].Recently, advanced structures such as 3D-stacked NAND flash for storing more information in a limited area have been widely employed, which provide a more error-prone environment [3][4][5].In storage systems built with NAND flash memories, error-correction codes (ECCs) are commonly applied to ensure data reliability.Algebraic codes such as BCH and RS codes have widely been employed because of their guaranteed performance and moderate hardware complexity, but the codes are not adequate when the NAND flash channel worsens.For that reason, the LDPC code has been employed in many NAND flash-based storage systems recently, as its error-correcting capability resulting from iterative belief propagation is far superior to the algebraic codes.However, the LDPC decoding necessitates high computational complexity and frequent memory accesses, and consumes considerably higher energy than BCH and RS decoding processes [6,7].
Since the NAND flash channel is reliable for most of its lifetime, the maximum-effort decoding with the full parity-check matrix (H-matrix) is inefficient when the channel is reliable.Providing multiple error-correcting capabilities may be a solution, since the errorcorrecting capability can be adjusted depending on the channel condition.As a matter of fact, multi-rate LDPC codes are commonly used to provide various error-correcting capabilities in the wireless communication systems [8,9].The use of multi-rate codes is effective only when the channel condition at the time of encoding is consistent with that of decoding, which means that the traditional multi-rate codes are not suitable for storage systems in which the data writes and reads can occur far apart.
A new partial decoding is proposed to provide various error-correcting capabilities with a single H-matrix.The decoding strength is adjusted by changing the column degree of the partial H-matrix, which is verified through intensive simulations over the additive white Gaussian noise (AWGN) channel.Based on partial decoding, we present a novel energy-efficient decoding algorithm.The proposed algorithm starts the decoding with a partial H-matrix selected by considering the channel condition at the time of decoding.When the decoding with the partial H-matrix fails, the proposed algorithm increases the size of the partial H-matrix to enhance the decoding strength and tries the decoding again.Since the energy consumption of an LDPC decoder is mainly related to the size of the H-matrix used for decoding, the proposed algorithm can reduce the energy consumed in the decoding process.In addition, it is effective in reducing the decoding latency and enhancing the decoding throughput.
The rest of this paper is organized as follows.Section 3 introduces the proposed partial decoding of the LDPC codes and Section 4 analyzes simulation results of the proposed energy-efficient decoding algorithm.Theoretical analysis is explained in Section 5.The details of the hardware design and the implementation results are presented in Sections 6 and 7, respectively, and conclusions are made in Section 8.

Backgrounds
This section provides an overview of LDPC decoding algorithms, including an indepth explanation of the Sum-Product algorithm (SPA) [10] and that of the Min-Sum algorithm (MSA) [11].Moreover, the structure of the Quasi-Cyclic (QC) LDPC H-matrix will be introduced to explain proposed algorithms.

LDPC Decoding Algorithms
The SPA and the MSA are two prominent methods used for decoding LDPC codes, which are essential for error correction in NAND flash-based storage systems.They are also known as the Belief Propagation (BP) algorithms of LDPC codes.It operates by passing probabilistic messages along the edges of a Tanner graph to estimate the likelihood of bit values.Both algorithms operate iteratively decoding the likelihood messages until they converge to a stable solution or reach a predefined number of iterations.In each iteration, the SPA combines the messages from neighboring nodes using a product operation, followed by a normalization process to update the beliefs of each bit's value.In contrast, the MSA estimates these probabilities by considering the minimum value of the incoming messages, hence the name.This method, while an approximation, significantly reduces the need for complex calculations without drastically affecting decoding accuracy.The SPA typically requires floating-point precision and involves trigonometric functions, making it computationally intensive.
The advantage of using MSA lies in its simplicity, as it can be implemented using integer arithmetic and simple comparison operations, making it suitable for hardware with limited processing capabilities.Both algorithms benefit from the inherent error detection and correction capabilities of LDPC codes, which feature a redundant structure enabling the identification and rectification of errors in data transmission.The practical implementation of these algorithms also considers factors such as channel noise characteristics and the required level of error correction.Tailoring the algorithm to specific needs can result in various modifications and optimizations, such as the normalized MSA and the offset MSA, which aim to bridge the performance gap with the SPA.

Quasi-Cyclic LDPC Codes
The array LDPC code is suitable for adjusting the column degree of the H-matrix, as it is one of regular LDPC codes that have fixed column and row degrees.Moreover, it is one of the quasi-cyclic (QC) LDPC codes composed of shifted identity matrices of the same size [12].Therefore, the number of check nodes can be controlled easily by eliminating some block-rows, each of which having the same size as the identity matrix.Three parameters, w c , w r , and p, define an array LDPC code, where p is a prime number denoting the size of the identity matrix, and w c and w r represent the column and row degrees of the H-matrix, respectively.The H-matrix of the (p, w r , w c ) array LDPC code is where I is the p × p identity matrix and A is a matrix obtained by shifting every row of I cyclically by one.When p = 3, for example, the corresponding matrix A is Based on (2), A 2 is calculated as

Proposed Partial Decoding of LDPC Codes
The partial decoding of an LDPC code is newly introduced to provide various errorcorrecting capabilities, which can be adaptively applied according to the channel condition.The decoding strength is adjusted by changing the number of check nodes to be used for decoding.The number of check nodes relevant to a variable node is called the column degree of the H-matrix.Since each variable node collects the local messages come from the connected check nodes, the LDPC decoding works normally with some check nodes removed.Therefore, the error-correcting capability can be adjusted by changing the column degree.

Construction of a Partial H-Matrix
The H-matrix shown in (1) can be decomposed into w c sub-matrices, h 1 to h w c , where h i is a p × w r p sub-matrix denoting To support various error-correcting capabilities, a partial H-matrix is organized by including some of the above sub-matrices, h 1 to h x .A set of sub-matrices is denoted as H x , where x is an integer ranging from 2 to w c , since the column degree of a partial H-matrix should be at least 2 in order to decode an LDPC code.When w r = 4, for example, a partial H-matrix H 3 is constructed as

Decoding of a Partial H-Matrix
The message is encoded with the full H-matrix of the LDPC code, while the received codeword is decoded by using a partial H-matrix in the partial decoding.Iterative decoding algorithms such as the SPA or MSA can be used to update variable nodes and check nodes based on the partial H-matrix.Before starting a decoding iteration using the partial Hmatrix, the syndromes of the updated codeword are checked with respect to the full H-matrix.If the syndromes are all zeros, then the codeword is correct so that the decoding process is finished.Otherwise, we repeat the decoding iteration until we reach the number of maximally allowed iterations (MAI).The detailed procedure of the partial decoding is described in Algorithm 1.
1: Initialization: load the initial LLR values to each variable node.2: Iterative Decoding: Perform the following steps in accordance with the SPA or MSA.
for all check nodes included in the full H, do ▷ Syndrome check The error-correcting capability resulting from a partial H-matrix is investigated based on a (149, 61, 6) array LDPC code that is designed to protect a message of 1 KB.The SPA is employed to decode the received codeword with setting MAI to 30. Figure 1 shows how the error-correcting capability changes over the channel SNR.The uncorrected bit-error rate (BER) performances resulting from H 2 and H 6 correspond to the weakest and strongest error-correcting capabilities, respectively.The decoding strength is stronger when the partial H-matrix becomes larger.Therefore, it is possible to support diverse error-correcting capabilities by constructing several partial H-matrices from a single H-matrix.Though H 2 shows the weakest decoding strength, it removes two thirds of memory accesses compared to the full H-matrix.This enables a tradeoff between decoding capability and energy consumption, since the number of memory accesses dominates the energy consumption of an LDPC decoder [13].

Proposed Energy-Efficient Decoding of a Partial H-Matrix
The proposed decoding algorithm increases energy efficiency in the LDPC decoding, and is effective in reducing the energy consumption of storage systems built with NAND flash memory, since the NAND flash is reliable in the beginning stage.Applying high voltages to a cell repeatedly to program or erase the cell decreases the SNR of the flash channel monotonically [3,4].As the wear-leveling technique makes the SNR of a page almost the same as that of the other page [5], the NAND flash channel is reliable in a considerable amount of time.Since the NAND flash channel in the beginning does not induce many erroneous bits, the maximum-effort decoding with the full H-matrix is inefficient.Therefore, the proposed algorithm selects a proper partial H-matrix depending on the channel condition.
Considering the channel SNR, the proposed algorithm selects a specific partial matrix from a set of partial H-matrices defined as The selected partial H-matrix is the initial partial H-matrix that is first used for decoding.The initial partial H-matrix for a specific SNR can be determined in advance by conducting simulations over the flash channel or by analyzing the decoding algorithm.The proposed energy-efficient decoding algorithm is described in Algorithm 2.

Simulation Results
The (149, 61, 6) array LDPC code is used to validate the proposed energy-efficient LDPC decoding algorithm.The average number of iterations required to decode a codeword is shown in Figure 2, which is obtained by applying the SPA with setting the MAI to 30.In the simulation, the flash memory is regarded as an AWGN channel.The SNR is defined as σ 2 /N, where N is the noise power, and σ 2 is the signal power.It is assumed that the distribution for a Single Level Cell is similar to that of Binary Phase Shift Keying (BPSK).The Error Rate was considered based on the assumption that an all-zero code transmitted as '1' would result in an error if the outcome was non-zero.For a specific SNR, there are partial H-matrices that provide almost the same decoding performance as that of the full H-matrix.For an SNR of 6 dB, for example, the decoding with H 3 leads to almost the same number of iterations as that of the full H-matrix.Based on the simulation results, the proposed algorithm selects an initial partial H-matrix with which the decoding starts.Decoding may continue with H 2 , but if the average number of iterations begins to increase, it can switch to decoding with H 3 .This inference is exploited by simulation results, and implementation is feasible through an SSD controller that tracks the number of iterations at the end of the previous decoding process.
The energy consumption of an LDPC decoder is mainly dominated by memory accesses resulting from frequent updates of internal messages to be exchanged between variable and check nodes [13].Reducing the number of memory accesses is therefore highly effective in lowering the overall energy consumption.Moreover, it decreases the decoding latency as well as the decoding throughput.As the number of memory accesses is proportional to the size of the H-matrix used in decoding, reducing its size lessens the energy consumed in the LDPC decode in effect.The average number of memory accesses resulting from the proposed partial decoding algorithm and the conventional one that decodes with the full H-matrix are compared in Figure 3.It is clear that the proposed algorithm considerably reduces the number of memory accesses when the SNR is not small.Since the large number of memory accesses leads to high energy consumption, the proposed decoding algorithm significantly reduces the energy consumed in the high SNR region.For the (149, 61, 6) array LDPC code, the energy consumption caused by memory accesses is reduced down to 33.1% even compared to the conventional decoding algorithm that employs the early stopping method [14].As the memory accesses are mainly required to calculate V2C and C2V messages, the computational operations are also reduced in proportion to the reduction ratio of memory accesses, which means that the energy consumption of the LDPC decoder can be reduced by the reduction ratio of memory accesses.
In addition, both the decoding latency and the decoding throughput are enhanced.The normalized latency of the proposed partial decoding algorithm is compared to the conventional one in Figure 4. Since the number of variable nodes connected to each check node is constant, the number of clock cycles taken to process a check node is constant for all partial H-matrices.Therefore, the number of check node operations affects the decoding latency.As the number of check node operations is proportional to the size of the partial H-matrix, the decoding latency can be effectively reduced by reducing the size.In Figure 4, the decoding latency is reduced to 35.5% at maximum compared to the conventional architecture [15].Since the decoding throughput is inversely proportional to the decoding latency, the proposed partial decoding algorithm can boost the decoding throughput significantly in the beginning stage.

Theoretical Analysis
The proposed decoding algorithm is theoretically analyzed to explain the existence of a partial H-matrix that results in almost the same decoding performance as the full H-matrix.It will be shown that the theoretical prediction of the required number of iterations is consistent with the simulation results.The partial H-matrix can be determined by looking into the number of iterations.To calculate the number of iterations required for a specific SNR theoretically, we estimate how the BER of the decoded outputs changes according to decoding iterations.The LLR distribution obtained by the internal message tracking technique, which is called density evolution in [16], is used to estimate the BER of the decoded outputs.The distribution of the LLR values over all variable nodes is investigated in each iteration.The SPA is assumed for this analysis, as the internal steps of the algorithm can be described in mathematically closed forms.

Calculation of the LLR Distribution
The LLR distribution in the l-th iteration is analyzed by using the mathematically closed forms of the SPA.For an H-matrix H, the set of variable nodes connected to the m-th check node is denoted as where h mn represents the element of the H-matrix on the m-th row and n-th column.
Similarly, the set of check nodes connected to the n-th column is If a regular LDPC code is considered in the analysis, the numbers of elements in N m and M n are w r and w c , respectively.The set that excludes element n from N m is denoted as N m \n, and the set excluding m from M n is similarly denoted as M n \m.The LLR value of the n-th variable node after l iterations is denoted as L for all n and m are denoted as λ (l) and µ (l) , respectively.
In previous works [16,17], the distribution of L (l) n for all n is known to be binomially distributed as N(λ (l) , 2λ (l) ) [16], where N(µ, σ 2 ) represents the Gaussian distribution with µ and variance σ 2 , and C (l) m→n is also binomially distributed [17].Therefore, the LLR distribution can be obtained by tracking λ (l) in each iteration.The equations that update variable and check nodes are used to chase the mean of the LLR distribution.In the SPA, the variable node update is expressed as where n is the initial LLR.The corresponding C2V message for the l-th iteration is For convenience, Equation ( 12) is rewritten as Taking the expectations for both sides, For the sake of simple expression, Ψ(x) is defined as where y ∼ N(x, 2x).Equation ( 14) can be rewritten as Taking the expectations for both sides of (11), we obtain where E c is the energy consumed to transmit a bit of a codeword and σ is the standard deviation of the AWGN channel.A bit of zero or one transmitted over the AWGN channel is mapped to , respectively, and the all-zero codeword is assumed to be sent.By substituting ( 17) into ( 16), we have and it is rewritten as By substituting Equation (19) into Equation ( 17), we finally have the mean of the LLR distribution, where µ (l−2) can be recursively calculated from (19) with the initial condition of µ (0) = 0.The mean of LLR distribution λ (l) is only determined by the column degree w c , the row degree w r , and the channel SNR E c σ 2 .Therefore, the LLR distribution after l iterations can be estimated from the mean expressed in (20).

Calculation of the Number of Iterations
For a specific SNR, the LLR values of all variable nodes are distributed following the binomial distribution of N(λ (l) , 2λ (l) ) [16].The mean of the LLR distribution λ (l) is obtained from (20) by adjusting the column degree w c according to the size of the partial H-matrix.To decide the success or failure of the decoding, the BER is estimated from the calculated LLR distribution.
Assuming that the transmitted codeword are all zeros, the correctly decoded codeword has positive LLR values for all bit-positions, but the uncorrected codeword has some negative LLR values.Therefore, the ratio of the negative area to the total area of the distribution can be considered as the uncorrected BER for a specific number of iterations.Since the LLR values are binomially distributed, the BER after l iterations is calculated as where Q(x) is the Q-function of the given distribution, The LLR distribution and the estimated BER for the (149, 61, 6) array LDPC code with an SNR of 5dB is shown in Figure 5.The full H-matrix is used for decoding, which means that w c is 6.As the number of iterations increases, the mean of the LLR distribution moves to the higher value, leading to a reduced BER.The estimated BER is used to compute the number of iterations needed to achieve successful decoding.It is assumed that the left tail of the BER distribution in Figure 5, which falls into the negative region, represents the proportion of errors relative to the total number of cases.The area of that tail was calculated using the Q-function, as described in (21) to determine the BER value.When the calculated BER is less than 10 −15 in a certain iteration, which is a criterion widely accepted in the storage market, the decoding is considered to be successful in that iteration.Therefore, we analyze the theoretical number of iterations needed to achieve successful decoding for a range of SNR.The numbers are depicted in Figure 6.Since the graphs look similar to the simulation results shown in Figure 2, the proposed decoding algorithm is consistent with the theoretical analysis.In addition, the existence of an initial partial H-matrix that provides the same error-correcting performance as the full H-matrix for a specific SNR is explained theoretically.

Hardware Architecture
A simple modification of the existing decoder hardware allows decoding of the proposed algorithm.Therefore, while maintaining the basic structure of the existing architecture, the addition of the capability to dynamically select the optimal partial H-matrix based on the channel state significantly reduces energy consumption while maintaining decoding accuracy.Through such a simple modification, the proposed decoding method can be easily integrated into existing systems, offering improved performance and energy efficiency.

Dedicated Syndrome Check Module
LDPC decoders that utilize soft-information are generally required to perform the first decoding iteration.This approach is adopted because generating the soft-information itself consumes a significant amount of latency, thus making it more advantageous in several aspects to proceed with an initial decoding iteration rather than performing a separate syndrome check.However, the proposed decoding algorithm, which also uses softinformation, requires decoding with partial H-matrices of various sizes.To accommodate this, a separate syndrome check module is incorporated.Employing an independent syndrome check module can significantly reduce decoding latency, especially in good channel conditions.
Typically, a full H-matrix is not necessary for syndrome checking to verify the integrity of a codeword; it only needs to cover the entire message.Therefore, the size of the dedicated syndrome check module can be very compact and implemented with minimal effort.Table 1 shows the gate count for syndrome check logic of LDPC codes of various sizes in 65 nm CMOS process.For a commonly used 4 KB LDPC code with a rate of 0.9, it only requires 22 k equivalent gates, which is about 1% of the total decoder area.Therefore, incorporating this logic into an existing decoder incurs minimal overhead and can be easily applied to any decoding architecture.

Decoding Architecture
A block diagram of the proposed decoding architecture is shown in Figure 7. Except for the dedicated syndrome checking (SYN) unit, the decoding architecture is identical to the conventional layered min-sum decoder [18].Each decoding function unit (DFU) performs the independent check node operation in parallel, and the corresponding LLR values and the intermediate C2V values are stored in LMEM and C2V memories, respectively.The detailed architecture of the DFU is shown in Figure 8. Through the shuffle network, the appropriate LLR and C2V values are obtained, followed by number system conversion, addition and subtraction operations.For a fair evaluation of the implementation, the most efficient method among the existing approaches has been applied for the Minimum search logic [19,20].
The shuffle and de-shuffle networks align the LLR and C2V values.In the conventional architecture, all syndromes are checked in each DFU operation since the conventional decoding algorithm always uses the full H-matrix.However, the proposed partial LDPC decoding uses the partial H-matrices instead of the full H-matrix when the channel is reliable.Since decoding with the partial H-matrices does not compute all check node equations and syndromes, the dedicated SYN unit, which checks remaining syndromes, is additionally applied.As a result, proposed partial LDPC decoding can be applied by adding a simple SYN unit to any existing structure with ease.

Conclusions
This paper has presented a new energy-efficient LDPC decoding method called partial LDPC decoding by taking into account the characteristics of the NAND flash channel.The proposed algorithm decodes by using a portion of the full H-matrix in order to save the energy consumed in the decoding.The partial decoding can provide a range of errorcorrecting capabilities by adjusting the size of the partial H-matrix, enabling a trade-off between energy consumption and error-correcting capabilities.The existence of a partial Hmatrix, which achieves almost the same decoding performance as that of the full H-matrix for a specific SNR, has theoretically been analyzed and proved by intensive simulations.A prototype decoder to implement the proposed algorithm has been developed for 4 KB LDPC codes using a 65 nm CMOS process.The proposed decoder reduces energy consumption by 93% compared to recent LDPC decoding architectures.

Figure 2 .
Figure 2. The average number of iterations simulated for various partial H-matrices of the (149, 61, 6) array LDPC code.

Figure 3 .
Figure 3.The comparison of memory accesses resulting from the conventional and proposed decoding algorithms for the (149, 61, 6) array LDPC code.

Figure 4 .
Figure 4.The decoding latency of the proposed algorithm normalized by that of the conventional one for the (149, 61, 6) array LDPC code.
(l) n , and similarly the C2V message of the m-th check node after l iterations is represented as C (l) m→n .The means of L (l) n and C (l) m→n

Figure 5 .
Figure 5.The probability distribution of LLR values and the estimated BERs of the (149, 61, 6) array LDPC code when the SNR is 5 dB.

Figure 6 .
Figure 6.The theoretically calculated number of iterations for the various partial H-matrices of the (149, 61, 6) array LDPC code.

Figure 7 .
Figure 7.The prototype decoder for the proposed energy-efficient partial LDPC decoding algorithm for the (607, 60, 6) array LDPC code.

1 :
Input: S = {H 2 , H 3 , . . ., H w c }, MAI, and channel SNR 2: j = index of a partial H corresponding to the channel SNR

Table 1 .
Areas of the Syndrome Checking Logic for Various Sizes of LDPC Codes in 65 nm CMOS.