How to Use Redundancy for Memory Reliability: Replace or Code?

Ju, Hyosang; Kong, Dong-Hyun; Lee, Kijun; Lee, Myung-Kyu; Cho, Sunghye; Kim, Sang-Hyo

doi:10.3390/electronics14091812

Open AccessArticle

How to Use Redundancy for Memory Reliability: Replace or Code?

by

Hyosang Ju

¹

,

Dong-Hyun Kong

²,

Kijun Lee

²,

Myung-Kyu Lee

²,

Sunghye Cho

² and

Sang-Hyo Kim

^1,*

¹

Department of Electrical and Computer Engineering, Sungkyunkwan University, Suwon 16419, Republic of Korea

²

Samsung Electronics, Hwaseong 18448, Republic of Korea

^*

Author to whom correspondence should be addressed.

Electronics 2025, 14(9), 1812; https://doi.org/10.3390/electronics14091812

Submission received: 10 March 2025 / Revised: 19 April 2025 / Accepted: 25 April 2025 / Published: 29 April 2025

(This article belongs to the Section Circuit and Signal Processing)

Download

Browse Figures

Versions Notes

Abstract

Modern digital systems rely on DRAM as main memory and flash-based SSDs for storage, forming the backbone of today’s computing infrastructure. As demands for faster processing and larger data services increase, the memory subsystems have become denser, pushing technologies to their physical limits and increasing susceptibility to faults. To ensure data integrity, two complementary approaches are employed: replacement-based techniques, which map defective cells to redundant areas, and error-correcting code (ECC) methods, which dynamically detect and correct errors. This paper theoretically investigates the most efficient use of redundancy for DRAM reliability by categorizing detects into hard faults and soft errors. Each scenario is evaluated in terms of required redundancy and residual error rate, using finite-length channel coding capacity. We compare the ECC schemes with BCH codes, which are widely favored in on-die ECC applications due to their low latency and decoding complexity.

Keywords:

memory reliability; memory repair; memory ECC; error correction coding

1. Introduction

In modern digital systems, DRAM is commonly employed as the main memory, while flash-based SSDs serve as the primary storage solution [1,2,3]. Both DRAM and flash memory technologies have achieved significant success, becoming key pillars of today’s computing infrastructure. As demands for faster processing and large-scale data services continue to escalate, the industry has prioritized creating larger, faster, and more energy-efficient memory subsystems. Consequently, the densification of memory cells in silicon dies has accelerated to unprecedented levels, pushing the technology toward its fundamental physical limits [2,4]. However, this extreme density renders modern memory devices more susceptible to various faults and errors [5,6]. To address these vulnerabilities, advanced error protection schemes are increasingly imperative for ensuring data integrity and reliability in cutting-edge computing environments.

As DRAM technology continues to scale, ensuring reliability has become a critical challenge due to process variations, retention failures, and inherent defects. Enhancements in DRAM reliability are achieved through two key approaches: replacement-based techniques and error-correcting code (ECC)-based techniques. Replacement-based methods mitigate faults by mapping defective memory cells to redundant, functional regions, while ECC-based methods detect and correct errors dynamically to maintain data integrity. These approaches complement each other, with replacement techniques addressing permanent faults and ECC handling transient and soft errors.

First, replacement-based techniques generally map faulty memory cells to redundant, functional regions. Redundant row/column sparing dynamically replaces defective rows or columns with pre-allocated spares, minimizing defects with minimal overhead [7]. Laser fusing permanently disables faulty cells during fabrication and redirects access to spare units, improving yield [8]. Post-packaging repair (PPR) enables EEPROM-based remapping, allowing in-field correction of memory faults, thereby extending DRAM lifespan [9]. These replacement-based strategies collectively enhance DRAM reliability across manufacturing and operational phases.

On the other hand, ECC technologies have become increasingly important for DRAM [10,11]. ECC techniques are applied in various ways to enhance reliability and data integrity. Rank-level ECC provides robust error protection by applying ECC at the DRAM rank level, ensuring resilience against multi-bit errors [12]. In-DRAM ECC detects and corrects errors within the DRAM chip itself, reducing the dependency on external memory controllers and improving efficiency [13]. Multi-tier ECC optimizes performance by separating error detection (ED) and error correction (EC), allowing for more flexible and efficient error handling.

These diverse ECC strategies collectively strengthen DRAM reliability across different failure scenarios. In spite of these various DRAM reliability solutions, there is no theoretical study on the joint optimization of the two approaches. In this paper, from a theoretical standpoint, we investigate how to use redundancy most efficiently for DRAM reliability. In order to know the theoretical limitation, we put some practical considerations (implementation friendliness, etc.) aside.

More specifically, we classify the defects in cells as hard faults and soft errors and categorize the scenarios. For each scenario, error protection methods are evaluated in terms of required redundancy or remaining error rate. For evaluation of the capacity of ECC schemes, finite-length capacity of channel coding [14] is used and compared with BCH codes [15] as they and their variations are most preferred in on-die ECC applications due to their low decoding complexity and short decoding latency.

In Section 2, the system model and notations are introduced. In Section 3, replacement and ECC techniques are separately evaluated in terms of error-free rate. In Section 4, we study the ECC performance behavior for soft errors and a unified approach for the two schemes where both hard faults and soft errors are existing. Conclusions are given in Section 5.

2. System Model

In order to develop a simplified model, we approximate the DRAM redundancy techniques as a resource allocation problem. First, we consider only two classes of errors: one class is ‘hard fault’ and the other class is ‘soft error’, although DRAM errors occur in various patterns due to different causes. Here, hard faults are permanent errors, normally originating from fabrication or manufacturing processes. The soft errors are the non-permanent errors, which occur occasionally. For basic analysis, we assume soft errors arise independently.

We assume the number of available cells is N among the cells, D cells are used for data, and R cells are allocated for redundancy for the DRAM reliability mechanism. Resultantly,

N = D + R .

Let

ρ = R / D

be called the redundancy rate. As every cell cannot be handled separately, we consider block-wise operation. Let

k_{B}

be the block size, which is a system parameter; then, the number of cell blocks in the system is

N_{B} = ⌊ N / k_{B} ⌋

. Then, we can count the data and redundancy size in terms of the number of blocks:

N_{D} = ⌊ D / k_{B} ⌋

is the number of data blocks and

N_{R} = ⌊ R / k_{B} ⌋

the number of redundant blocks.

Because the goal of this study is to provide a comprehensive guide for efficient use of redundancy or overhead in memory systems, we assume that coding parameters can be freely determined. This is indeed an infeasible assumption. However, in this paper, we focus on the theoretical limit of coding. More practical considerations can be used after the theoretical bound is established, or practical solutions can be evaluated by how far it is from the limit.

As mentioned, there are two types of errors that one addresses: hard faults and soft errors. First of all, it has been considered that hard faults can be coped with through the test and replacement method. If a cell is detected as a fault or a bad cell, then the bad cell’s data block (say, a faulty block) can be replaced with one from the spare cells. However, soft errors cannot be prevented by replacement. Since soft errors occur in a random fashion, a proper error protection technique should be applied. The ECC method can handle this problem.

Replacement can be achieved by allocating a new data block from spare cells to the memory address of a faulty block. In Figure 1, the replacement task is demonstrated. The device or controller should memorize the list of newly allocated addresses. For soft errors, ECC techniques should be applied to protect the data blocks from randomly occurring errors. The error protection capability varies regarding the coding parameters, such as block length and overhead (or coding rate). Here, we will consider two types of coding performances: the theoretical finite-length capacity [14] and one with practical binary block codes preferred in memory applications. The ECC method is also visualized in Figure 1b.

3. Error Protection from Hard Faults

In this section, we investigate how to deal with hard faults in memory systems. Hard faults can be tackled by two types of error protection mechanisms: replacement and error correction coding. We investigate behavior when the two methods are exclusively applied.

3.1. Replacement Approach

In the manufacturing and fabrication process, defects such as faults may occur, which should be avoided and fixed to protect the integrity of the memory system [16]. Under the assumption that we can test a memory block to know if there is a fault, the logical replacement of the memory block would be the best way to enhance the yield rate. The replacement can also be called ‘repair’.

Redundant cells should be augmented in the memory chip for the replacement process. The proper allocation of the redundant cells in a chip is crucial in the economics of memory production. Having the faults replaced, both the yield rate and the lifetime of the memory chip can be enhanced. Here, we investigate the yield rate (virtual) behavior according to a target fault rate, which can be measured in the real world.

We investigate this problem based on the assumption of uniformly random occurrence of hard faults, which may not precisely analyze the practice. However, at least the study results provide a lower (or upper limit) of practical metrics. In addition, analysis can be improved by more accurate or practical channel modeling. For mathematical modeling, some should be defined.

Let

p_{f}

be the probability of a fault occurring in an individual cell; faults are assumed to be distributed randomly and independently. Let

p_{FB}

denote the probability of a faulty block, which is a function of

p_{f}

and the size of the block

k_{B}

. The probability that the entire system (or a chip) is fault-free is denoted by

p_{free}

. We have

p_{f r e e} = 1 - p_{f a i l}

, where

p_{fail}

is the probability of system failure.

The replacement method is very simple. We ignore a faulty cell block by redirecting the address of the block to a redundant cell block. We consider the chip is alive if all faulty blocks are replaced by sound redundancy blocks. The probability of a faulty block is

p_{F B} = 1 - {(1 - p_{f})}^{k_{B}},

(1)

where

p_{f}

is the probability of a faulty cell. This simple formula in (1) will be used to analyze the

p_{f r e e}

of the replacement scheme.

For reference, if there is no redundancy part and replacement cannot apply, then the system survives only when all data blocks are good, or equivalently, all cells are good. Thus,

p_{f r e e} = {(1 - p_{f})}^{N} = {(1 - p_{f})}^{N_{D} k_{B}},

(2)

where

N_{D}

is the number of blocks. In this case,

p_{f r e e}

quickly decreases as

N_{D}

becomes larger. Figure 2 exhibits

p_{free}

versus

p_{f}

. The yield rate quickly diminishes where

p_{f}

is still very low. There is no notable change depending on the block size

k_{B}

. Let us see what changes if we apply the replacement (or repair) technique.

Now, let us incorporate the replacement scheme by allocating a redundancy portion. We consider a typical value of the redundancy rate,

ρ = R / D = 0.1

, corresponding to a 10% overhead. Let

N_{f}

be the number of faulty blocks in the system, where

N_{f, D}

and

N_{f, R}

are the numbers of faulty blocks in data and redundancy parts, respectively. Based on the replacement policy, the failure of chip integrity occurs if

N_{f, D} > N_{R} - N_{f, R},

the event that the number of faulty data blocks exceeds the number of healthy remaining redundant blocks. This condition is equivalent to

N_{f} > N_{R},

which we call an ‘outage’ event.

The metric

N_{f}

is binomial-distributed and is approximated by a Gaussian random variable with mean

μ = N_{B} p_{FB}

and variance

σ^{2} = N_{B} p_{FB} (1 - p_{FB})

due to the central limit theorem. The distribution is given as

f_{N_{f}} (x) = \frac{1}{\sqrt{2 π σ^{2}}} exp (\frac{- {(x - μ)}^{2}}{2 σ^{2}}) .

The outage probability

p_{free}

can be simply evaluated by the Q function as

p_{f a i l} = \Pr [N_{f} > N_{R}] = \int_{N_{R}}^{\infty} f_{N_{f}} (x) d x = Q (\frac{N_{f} - μ}{σ}) .

Figure 3 shows

p_{f r e e} = 1 - p_{f a i l}

versus

p_{f}

under the basic setting. A threshold behavior is observed:

p_{f r e e}

surges from zero to probability 1 at a critical threshold as

p_{f}

decreases. This phenomenon results from the law of large numbers associated with the large

N_{B}

. It is observed that as the block length

k_{B}

increases, the threshold for

p_{f}

decreases. The same mathematical result is drawn in terms of the redundancy rate

ρ

in Figure 4. A similar threshold behavior is again observed. As the available repair resources increase,

p_{f r e e}

surges at a threshold. When the block size is larger, more redundancy is required to achieve

p_{f r e e} \approx 1

.

By applying replacement, it is shown that

p_{f r e e} \approx 1

is achieved at a much higher raw fault rate

p_{f}

. For a fixed

p_{f}

,

p_{f r e e}

jumps as the block length becomes smaller. For a fixed redundancy rate,

p_{f r e e}

also has a sharp transition from zero to one as the block length decreases. In simple terms, a smaller block length is more advantageous for the replacement scheme if we disregard the control overhead. It is worth noting that the block length plays a crucial role in the performance of replacement schemes, whereas it does not affect the probability of a fault-free system when no protection is applied.

3.2. Error Correction Coding Approach

Here, we consider using ECC schemes for handling hard faults. For now, soft errors are not taken into account. Although ECCs are good for random errors, they can still be effective in improving the probability of a fault-free system

p_{f r e e}

. Here, we assume that ECC is applied as a blind fault protection mechanism against hard faults. Thus, the detection of the faulty block is not required.

The performance of ECCs depends on the block length [14,17]. The finite-length capacity of ECCs has been analyzed by a theoretical study over various channels [14]. We consider the capacity of achieving finite-length codes, which is much superior to existing memory codes such as the single error correction codes [18,19] or BCH codes [15], to reveal the theoretical limit of this approach.

In the evaluation of the ECC approach, the

p_{f r e e}

is the performance metric. Because the ECC can correct a number of hard faults in a block, the faulty block with a smaller number of errors than the error correction capability can be considered error-free. Therefore, the probability of a fault-free system,

p_{f r e e}

, can be evaluated regarding the error correction capability.

Let us first discuss the setting. We assume the entire redundancy part is used for ECC parity. Overall, D data cells are encoded with R redundant cells, resulting in a code rate of

r = D / (D + R)

, where D and R are the numbers of data and redundant cells, respectively. Since data are handled in

k_{B}

-bit blocks, the redundancy part should also be partitioned into

k_{P} (≐ k_{B} R / D)

-bit blocks. A data block of

k_{B}

bits and a parity block of

k_{P}

bits form a codeword. The block error rate (BLER) of BCH codes (after decoding) is directly related to their error-correcting capability. Let

p_{B, B C H}

denote the BLER of BCH codes, which is given as

p_{B, B C H} = P (number of errors > t) .

This can be formulated using the binomial probability distribution derived from the bit error probability or the probability of fault

p_{f}

:

p_{B, B C H} = \sum_{i = t + 1}^{n} (\binom{n}{i}) {(p_{f})}^{i} \cdot {(1 - p_{f})}^{n - i} .

This formula represents the probability that more than t errors occur in a block of length n. Hence, the corresponding fault-free probability is given by

p_{f r e e, B C H} = {(1 - p_{B, B C H})}^{N_{D}},

(3)

according to Equations (1) and (2).

As the block length is determined, the BLER can be also evaluated by the finite-length analysis. Let

l_{B}

be the length of a codeword, which is

k_{B} N / D

. Then, the BLER is given by the following formula:

p_{B, N A} = Q ((1 - h (p_{f}) - r + \frac{\log_{2} (l_{B})}{2 l_{B}}) / (\sqrt{\frac{p_{f} (1 - p_{f})}{l_{B}}} \log_{2} (\frac{1 - p_{f}}{p_{f}}))),

(4)

where

h (x) = - (1 - x) \log_{2} (1 - x) - x \log_{2} (x) .

Because there are

N_{D}

codewords,

p_{f r e e, N A} = {(1 - p_{B, N A})}^{N_{D}} .

(5)

In Figure 5, the

p_{f r e e}

(from Equations (2), (3) and (5)) is plotted versus

p_{f}

for various error protection configurations, when the faults are recovered by ECC in a blind manner. For the ‘Repair

(n, k_{B})

’ curves,

(n, k_{B})

indicates a block of

k_{B}

data cells is encoded and

(n - k_{B})

parity bits are generated. For the BCH curves, the parameters indicate

(n, k_{B}, t)

of BCH codes. For finite-length codes’ normal approximation, the parameter indicates

k_{B}

.

The BCH results demonstrate that increasing the block length improves the error correction performance, gradually approaching the theoretical finite-length limit predicted by the normal approximation. This aligns with the expected behavior of ECCs, where longer codes yield higher decoding reliability. In contrast, the repair approach exhibits better

p_{f r e e}

performance with shorter block lengths, as smaller blocks reduce the likelihood of multiple faults occurring within a single block. As the block length increases, the benefit of simple replacement diminishes. These results highlight the distinct tradeoffs between hard fault repair and soft error correction.

As

p_{f}

increases,

p_{f r e e}

drops gradually from one to zero. The slope of the drop is modest in this case. Compared with the replacement method, the BCH codes perform poorly for small to moderate block lengths. (That is because the ECC method is a blind protection.) As the block length increases, the gap between the BCH coding and replacement methods narrows since the performance of BCH codes improves with the block length increase and the number of blocks decreases. But for optimal finite-length codes, this tendency is shown at smaller block lengths. In our tested block lengths, the replacement method is better than the optimal coding only at the smallest length

k_{B} = 128

.

Note that we evaluate the error protection performance against randomly deployed hard faults in the manufacturing process of a chip. Although ignored in this work, the local coherence in cell quality over a chip may be relevant. More precise models should be developed for better analysis. However, the overall trends may not change much.

The

p_{f r e e}

improves as the block length increases because long codes have a better capacity in general and the number of blocks in a chip decreases. No threshold behavior occurs; instead, gradual enhancement is shown. It is interesting to see that for large block lengths, the ECC approach can be better than the replacement approach.

Note that although an ECC method can overcome hard faults to some extent, the error tolerance of ECC schemes is not as good as the replacement method in general. The blocks with hard faults are more prone to soft errors than other blocks as well.

4. Error Protection from Soft Errors

In this section, we incorporate soft errors in our problem. First, the soft error only scenario is considered, and then a mixed case is addressed. We assume soft errors occur randomly and independently of the location. Therefore, every cell suffers from the same soft error probability. It is worth noting that in reality, some cells degrade earlier than others, and they thus make more errors. Soft errors cannot be removed by a replacement method because testing cannot locate the source of soft errors. It is assumed that there is no remaining hard fault because they are removed by the repair process in the production. Soft errors can then be corrected by normal ECC methods.

4.1. Soft Error Only Scenario

In this section, we consider a soft error scenario. We discuss how to handle soft errors where there is no hard fault. All redundancy bits are used for ECC parity. Let

R^{'}

be the number of remaining redundant cells. If we ignore pre-processing, we can consider

R^{'} = R

for this problem only. Hard faults are now assumed to be nonexistent or all repaired in the pre-processing step. The code rate r is

r = N / (N + R^{'}) .

This scenario is evaluated by

P_{B}

, the BLER. In this paper, as a practical coding scheme for DRAM or main memory, the BCH codes are considered. Even though several codes, such as Hsiao codes [19], Dutta codes, and Petro codes, have been developed, they are basically variants of the single error correction code and they are more focused on improving implementation aspects of the code. Since we only consider the error correction performance in this paper, employing multiple error-correcting binary BCH code families is sufficient for providing practical reference performance.

Let

p_{e}

be the probability of soft errors, which occur independently and uniform-randomly. The BLER of practical t error-correcting codes is computed as

p_{B} = 1 - \sum_{i = 1}^{t} (\binom{l_{B}}{i}) {p_{e}}^{i} {(1 - p_{e})}^{l_{B} - i} .

On the other hand, for ideal finite-length codes, the BLER is computed by (4). Let us compare the BLER performances of the two schemes. Figure 6 exhibits the BLER performances of the two code families. The parameters of BCH codes are chosen closest to those of basic

k_{B}

and

l_{B}

settings. Note that the BLER is for a single individual block. As can be expected, the performance improves as the block length increases. The BCH codes are severely outperformed by the optimal codes. Also, the performance improvement of the BCH codes due to block length increase is marginal compared with the optimal coding performance.

It is observed that the optimal finite-length performance is much better than that of BCH codes. As the block length increases,

p_{B}

decreases in general (for both codes). An increase in block length is desired as long as the circumstances such as complexity or latency allow. The performance of BCH codes and its long code improvement is limited compared with the optimal codes. There is a big gap between the existing solutions (individual BCH coding) and theoretical limits (the performance of optimal finite-length coding).

4.2. How to Handle the Mixture of Hard Faults and Soft Errors

We now consider a case where both hard faults and soft errors occur. The amount of redundancy cells is first given. An optimization problem is formed. First, it is clear that hard faults are better replaced. Therefore, a two-step approach—replace faults first and correct soft errors—is very reasonable based on the wisdom we obtained from the former chapters. Although the redundancy part for replacement is solely determined, the block length should be determined. The block length affects both the replacement and error correction efficiencies. The block length optimization problem is addressed in this section under a fixed redundancy amount constraint.

Here, the problem is organized:

Assign replacement cells fitted to the target $p_{f}$ . (We assume aging can increase $p_{f}$ , and the repair process can be conducted after production.)
The block length can be optimized in terms of the final BLER.

If we operate the scheme, all data blocks with hard faults are replaced first, or for future usage at the target hard fault rate, the redundant space for replacement is reserved. Thus, ECC runs without a hard fault and only has to deal with soft errors. The block length controls the replacement efficacy and the ECC’s protection performance. It is worth finding the optimal block length. One might claim that block length for replacement and coding can be different. However, we assume they are the same because it can cause a large overhead when they are different.

In Figure 7a, the BLER is shown versus the processing block length

k_{B}

when

N = 2^{33}

, with

ρ \in {0.1, 0.143}

and

p_{f} = 10^{- 5}

. Both the practical BCH codes and ideal finite-length codes are considered. Because of the tradeoff between replacement efficiency and the soft error correction efficiency, we observe a clean convex curve for the optimal codes. The BLER is minimized at a block length of 3600 for optimal codes at

p_{f} = 10^{- 5}

. A similar behavior is shown for BCH codes. Note that for shortened BCH codes, the error correction capability does not improve gradually with the block length. This is because the error correction capability t can be the same even if

k_{B}

increases. (If

k_{B}

continues to increase, t increases stepwise.) For parameter setting, given

k_{B}

and the code rate, we searched for all possible shortened BCH code candidates and selected the one whose t is the largest, and then the rate is closest to the target. However, the curve still shows a convex shape from which a minimum can be found.

The BLER is minimized at a block length of 3683 for optimal codes where

ρ = 0.1

and

p_{f} = 10^{- 5}

. The tradeoff curve and the optimal value

k_{B}

may vary according to the redundancy rate

ρ

. The results in Figure 7 confirm that the optimal

k_{B}

increases as

ρ

increases, but the optimal

k_{B}

consistently lies in the range of several thousands for proper

ρ

values. The reason that the optimal length becomes larger for higher

ρ

is that more redundancy leads to a large repair margin and low ECC BLER, due to remaining redundancy.

For the redundancy rate

ρ

corresponding to the code rate considered in a typical DRAM system, the optimal

k_{B}

is on the order of several thousand bits, as shown in Figure 7. However, it is infeasible to use such a long

k_{B}

due to hardware constraints such as target decoding latency and power/thermal budget. Today’s technology thus uses a small

k_{B}

. Therefore, as the analysis curve suggests, it is better to increase the block length if possible.

Figure 8 exhibits the case where the information block size is an integer multiple of the replacement block length (say,

k_{R}

). For a given asymmetry, still we obtain an optimal block length to achieve the lowest BLER. As asymmetry increases, the optimal BLER performance improves.

5. Conclusions

In this paper, we investigated the efficient use of redundancy for memory reliability. We examined two error protection methods: replacement and ECC. Then, we analyzed the performances of the two methods to address hard faults and soft errors and their mixed cases. In particular, we considered practical codes such as BCH codes and ideal finite-length codes to know the gap between the theoretical limits and practice. Based on our analysis, it is better to address hard faults with the replacement method. Then, soft errors are preferred to be protected by ECC schemes. There is a tradeoff between hard fault repair and soft error protection in terms of processing block length. Specifically, the fundamental tradeoff lies between the inefficiency of replacement and the improved error correction capability of ECC due to a larger block length. The optimal processing block length can be found from the relation. Practical coding schemes operate with a block length far from the theoretically optimal value, which justifies the application of multi-level coding, which can encode much larger data blocks with low processing complexity with short component codes. However, such multi-level coding schemes should be designed carefully to avoid the coherence of the structural error patterns and the weaknesses of the multi-level scheme.

For future works, more rigorous optimization can be studied for mixed cases. For instance, not all hard faults are repaired, and ECC deals with both hard faults and soft errors. A non-uniform fault or error probability scenario can be investigated. Also, correlated and structured faults or soft errors and their countermeasures, such as multi-level coding, may be studied.

Author Contributions

Conceptualization, S.-H.K.; Methodology, S.-H.K.; Software, H.J. and D.-H.K.; Validation, S.-H.K.; Formal analysis, K.L., M.-K.L. and S.C.; Investigation, H.J. and D.-H.K.; Resources, H.J. and D.-H.K.; Data curation, D.-H.K.; Writing (original draft), H.J., D.-H.K., K.L., M.-K.L., S.C. and S.-H.K.; Writing (review and editing), H.J. and S.-H.K.; Visualization, H.J.; Supervision, S.-H.K.; Project administration, S.-H.K. All authors have read and agreed to the published version of the manuscript.

Funding

This work was partly supported by Samsung Electronics Co., Ltd. (IO201209-07889-01), the Korea Institute of Energy Technology Evaluation and Planning (KETEP) grant funded by the Korea government (MOTIE) (RS-2022-KP002703), and the Basic Science Research Program through the National Research Foundation of Korea (NRF) funded by the Ministry of Science and ICT (RS-2024-00343913).

Data Availability Statement

The original contributions presented in this study are included in the article. Further inquiries can be directed to the corresponding author.

Conflicts of Interest

Authors Dong-Hyun Kong, Kijun Lee, Myung-Kyu Lee, and Sunghye Cho were employed by the company Samsung Electronics. This work was partly supported by Samsung Electronics Co., Ltd. (IO201209-07889-01). The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as potential conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

CLT	central limit theorem
EC	error correction
ECC	error-correcting code
ED	error detection
IECC	in-DRAM error correction code
PPR	post-packaging repair

References

Schroeder, B.; Pinheiro, E.; Weber, W.D. DRAM errors in the wild: A large-scale field study. In Proceedings of the 11th International Joint Conference on Measurement and Modeling of Computer Systems (SIGMETRICS), Seattle, WA, USA, 15–19 June 2009; pp. 193–204. [Google Scholar]
Grupp, L.M.; Davis, J.D.; Swanson, S. The bleak future of NAND flash memory. In Proceedings of the 10th USENIX Conference on File and Storage Technologies (FAST), San Jose, CA, USA, 15–17 February 2012; p. 2. [Google Scholar]
Mielke, N.; Marquart, T.; Wu, N.; Kessenich, J.; Cubert, H.; Cadien, K.; Hankinson, J.; Nevill, R. Bit error rate in NAND flash memories. In Proceedings of the 2008 IEEE International Reliability Physics Symposium, Phoenix, AZ, USA, 27 April–1 May 2008; pp. 9–19. [Google Scholar]
Kim, Y.; Daly, R.; Kim, J.; Fallin, C.; Lee, J.H.; Lee, D.; Wilkerson, C.; Lai, K.; Mutlu, O. Flipping bits in memory without accessing them: An experimental study of DRAM disturbance errors. In Proceedings of the 41st International Symposium on Computer Architecture (ISCA), Minneapolis, MN, USA, 14–18 June 2014; pp. 361–372. [Google Scholar]
Patel, M.; Kim, J.; Mutlu, O. The Reach Profiler (REAPER): Enabling the mitigation of DRAM retention failures via profiling at aggressive conditions. In Proceedings of the 44th International Symposium on Computer Architecture (ISCA), Toronto, ON, Canada, 24–28 June 2017; pp. 255–266. [Google Scholar]
Masashi, H.; Itoh, K. Nanoscale Memory Repair; Springer: Berlin/Heidelberg, Germany, 2011. [Google Scholar]
Hou, C.S.; Chen, Y.X.; Li, J.F.; Lo, C.Y.; Kwai, D.M.; Chou, Y.F. A built-in self-repair scheme for DRAMs with spare rows, columns, and bits. In Proceedings of the 2016 IEEE International Test Conference (ITC), Fort Worth, TX, USA, 15–17 November 2016; pp. 1–7. [Google Scholar]
Gu, B.; Coughlin, T.; Maxwell, B.; Griffith, J.; Lee, J.; Cordingley, J.; Johnson, S.; Karaginiannis, E.; Ehmann, J. Challenges and future directions of laser fuse processing in memory repair. In Proceedings of the Semicon China, Shanghai, China, 12 March 2003. [Google Scholar]
Kim, D.-H.; Milor, L.S. ECC-ASPIRIN: An ECC-assisted post-package repair scheme for aging errors in DRAMs. In Proceedings of the 2016 IEEE 34th VLSI Test Symposium (VTS), Las Vegas, NV, USA, 25–27 April 2016; pp. 1–6. [Google Scholar]
Jung, G.; Na, H.J.; Kim, S.H.; Kim, J. Dual-axis ECC: Vertical and horizontal error correction for storage and transfer errors. In Proceedings of the 2024 IEEE 42nd International Conference on Computer Design (ICCD), Milan, Italy, 18–20 November 2024. [Google Scholar]
Lee, D.; Cho, E.; Kim, S.H. On the performance of SEC and SEC-DED-DAEC codes over burst error channels. In Proceedings of the 2021 International Conference on Information and Communication Technology Convergence (ICTC), Jeju Island, Republic of Korea, 20–22 October 2021. [Google Scholar]
Gong, S.L.; Kim, J.; Lym, S.; Sullivan, M.; David, H.; Erez, M. DUO: Exposing On-Chip Redundancy to Rank-Level ECC for High Reliability. In Proceedings of the 2018 IEEE International Symposium on High Performance Computer Architecture (HPCA), Vienna, Austria, 24–28 February 2018; pp. 683–695. [Google Scholar]
Kwon, S.; Son, Y.H.; Ahn, J.H. Understanding DDR4 in pursuit of In-DRAM ECC. In Proceedings of the 2014 International SoC Design Conference (ISOCC), Jeju, Republic of Korea, 3–6 November 2014; pp. 276–277. [Google Scholar]
Polyanskiy, Y.; Poor, H.V.; Verdú, S. Channel coding rate in the finite blocklength regime. IEEE Trans. Inf. Theory 2010, 56, 2307–2359. [Google Scholar] [CrossRef]
Bose, R.C.; Ray-Chaudhuri, D.K. On A Class of Error Correcting Binary Group Codes. Inf. Control 1960, 3, 68–79. [Google Scholar] [CrossRef]
Cha, S.; Shin, S.O.H.; Hwang, S.; Park, K.; Jang, S.J.; Choi, J.S.; Jin, G.Y.; Son, Y.H.; Cho, H.; Ahn, J.H.; et al. Defect analysis and cost-effective resilience architecture for future DRAM devices. In Proceedings of the 2017 IEEE International Symposium on High Performance Computer Architecture (HPCA), Austin, TX, USA, 4–8 February 2017. [Google Scholar]
Ju, H.; Park, J.; Lee, D.; Jang, M.; Lee, J.; Kim, S.-H. On improving the design of parity-check polar codes. IEEE Open J. Commun. Soc. (OJCOMS) 2024, 5, 5552–5566. [Google Scholar] [CrossRef]
Hamming, R.W. Error correcting and error detecting codes. Bell Syst. Tech. J. 1950, 29, 147–160. [Google Scholar] [CrossRef]
Hsiao, M.Y. A class of optimal minimum odd-weight-column SECDED codes. IBM J. Res. Develop. 1970, 14, 301–395. [Google Scholar] [CrossRef]

Figure 1. Comparison of replacement and ECC techniques. (a) Replacement. (b) ECC.

Figure 2. The probability of survival versus

p_{f}

where there is no protection. There is no notable difference due to the value of

k_{B}

, although they are not exactly equal. A chip size of

N = 2^{33}

bits (approximately 1.08 GByte) is considered.

Figure 2. The probability of survival versus

p_{f}

where there is no protection. There is no notable difference due to the value of

k_{B}

, although they are not exactly equal. A chip size of

N = 2^{33}

bits (approximately 1.08 GByte) is considered.

Figure 3. The probability of a fault-free system

p_{f r e e}

versus

p_{f}

where

N = 2^{33}

and

ρ = 0.1

.

Figure 3. The probability of a fault-free system

p_{f r e e}

versus

p_{f}

where

N = 2^{33}

and

ρ = 0.1

.

Figure 4. The probability of a fault-free system

p_{f r e e}

versus

ρ

where

N = 2^{33}

and

p_{f} = 10^{- 3}

.

Figure 4. The probability of a fault-free system

p_{f r e e}

versus

ρ

where

N = 2^{33}

and

p_{f} = 10^{- 3}

.

Figure 5.

p_{f r e e}

versus

p_{f}

: ECC (both BCH coding and finite-length analysis) and replacement (repair) methods are compared. Data block lengths

k_{B} = 64, 128, 256

are considered.

Figure 5.

p_{f r e e}

versus

p_{f}

: ECC (both BCH coding and finite-length analysis) and replacement (repair) methods are compared. Data block lengths

k_{B} = 64, 128, 256

are considered.

Figure 6. BLER performances of practical and ideal coding schemes. BCH codes and ideal finite-length codes (or their normal approximation [14]) are compared.

Figure 7. Block error rate versus processing block length

k_{B}

. Practical BCH codes and ideal finite-length codes are compared: (a)

N = 2^{33}

,

ρ = 0.1

, and

p_{f} = 10^{- 5}

. (b)

N = 2^{33}

,

ρ = 1 / 7 = 0.143

, and

p_{f} = 10^{- 5}

.

Figure 7. Block error rate versus processing block length

k_{B}

. Practical BCH codes and ideal finite-length codes are compared: (a)

N = 2^{33}

,

ρ = 0.1

, and

p_{f} = 10^{- 5}

. (b)

N = 2^{33}

,

ρ = 1 / 7 = 0.143

, and

p_{f} = 10^{- 5}

.

Figure 8. Block error rate versus processing block length (

p_{f} = 10^{- 5}

,

k_{B}

: ECC information length,

k_{R}

: repair block length).

Figure 8. Block error rate versus processing block length (

p_{f} = 10^{- 5}

,

k_{B}

: ECC information length,

k_{R}

: repair block length).

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Ju, H.; Kong, D.-H.; Lee, K.; Lee, M.-K.; Cho, S.; Kim, S.-H. How to Use Redundancy for Memory Reliability: Replace or Code? Electronics 2025, 14, 1812. https://doi.org/10.3390/electronics14091812

AMA Style

Ju H, Kong D-H, Lee K, Lee M-K, Cho S, Kim S-H. How to Use Redundancy for Memory Reliability: Replace or Code? Electronics. 2025; 14(9):1812. https://doi.org/10.3390/electronics14091812

Chicago/Turabian Style

Ju, Hyosang, Dong-Hyun Kong, Kijun Lee, Myung-Kyu Lee, Sunghye Cho, and Sang-Hyo Kim. 2025. "How to Use Redundancy for Memory Reliability: Replace or Code?" Electronics 14, no. 9: 1812. https://doi.org/10.3390/electronics14091812

APA Style

Ju, H., Kong, D.-H., Lee, K., Lee, M.-K., Cho, S., & Kim, S.-H. (2025). How to Use Redundancy for Memory Reliability: Replace or Code? Electronics, 14(9), 1812. https://doi.org/10.3390/electronics14091812

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

How to Use Redundancy for Memory Reliability: Replace or Code?

Abstract

1. Introduction

2. System Model

3. Error Protection from Hard Faults

3.1. Replacement Approach

3.2. Error Correction Coding Approach

4. Error Protection from Soft Errors

4.1. Soft Error Only Scenario

4.2. How to Handle the Mixture of Hard Faults and Soft Errors

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI