Pre-Emphasis Pulse Design for Random-Access Memory

This paper describes how one can reduce the memory access time with pre-emphasis (PE) pulses even in non-volatile random-access memory. Optimum PE pulse widths and resultant minimum word-line (WL) delay times are investigated as a function of column address. The impact of the process variation in the time constant of WL, the cell current, and the resistance of deciding path on optimum PE pulses are discussed. Optimum PE pulse widths and resultant minimum WL delay times are modeled with fitting curves as a function of column address of the accessed memory cell, which provides designers with the ability to set the optimum timing for WL and BL (bit-line) operations, reducing average memory access time.


Introduction
Nonvolatile random-access memory (NVRAM) or storage class memory are bridging the gap between volatile main memory (DRAM) and nonvolatile NAND flash memory in the memory hierarchy in terms of memory access time to improve memory performance [1,2]. In addition to much faster access time than NAND, NVRAM costs much less than DRAM, helping to keep the computer system cost effective. 3D cross-point memory structure has come to a solution to cost scaling in more advanced nonvolatile memory technology by increasing the number of nonvolatile memory layers [3][4][5][6][7]. A design guideline was proposed for 3D cross-point memory to have a sufficient operation margin to read and write in [8].
Pre-emphasis (PE) pulses are design techniques used to reduce access line delay, especially in large arrays such as 3D NAND [9] and large flat panel display [10]. By driving large RC delay lines with a pulse whose initial period is made with a voltage higher than the target voltage, the entire delay time can be reduced significantly, where the delay time is defined by the farthest point of the line. In [9,10], two calibration methods were proposed, since a precise PE pulse is required even with process variation in the RC time constant. In [11,12], a circuit analysis is discussed to design the PE pulse for minimizing the delay time. Based on the circuit analysis, a PE pulse generator with feedback was proposed in [13].
In this paper, PE pulse design is discussed for NVRAM, where the delay time depends on the column address. Hence, an optimum pulse width of the PE pulse can vary according to the position of the selected memory cell across a selected word-line (WL). In Section 2, the optimum PE pulse width and the minimum delay time are identified as a function of the position on WL in cases of an ideal case with no process variation and an actual case with process variation. Impact of cell current and resistance of decoding transistors is also investigated. In Section 3, the simulated data is compared with measured data for validation. In Section 4, WL behavior with PE pulses is expanded to a three-lines model. Fitting curves with a limited number of parameters are presented for the optimum PE pulse width and the minimum delay time across WL to design the pulses. Colum address dependent PE pulse width is proposed and is applied to a memory system. Figure 1a illustrates a memory array including four different positions N1-N4 across WL, each of which is located at x = 1/4, 1/2, 3/4, and 1. Figure 1b shows simulated waveforms when WL is driven by a PE pulse with an emphasis α of 1.5. As shown in Figure 1c, when the delay time is defined with a voltage window β of 10%, the cell at N2 has the shortest delay time among the four points because the nearest cell at N1 has an overshoot over 10%. Thus, there should be an optimum pulse width per position. An α of 1.5 and a β of 10% are used as the nominal conditions in this paper unless otherwise specified. All the values in this paper that have second as a unit can be scaled by the time constant of WL RC. Thus, arbitrary units are used for time-related parameters. To see how the PE pulse width TPRE affects the WL delay time TDLY, TPRE is skewed as shown in Figure 2a-d. An α of 1.5 and a β of 10% are demonstrated. When TPRE is shorter than optimum, WL at the target N2 does not reach 90% of the target voltage, resulting in a longer TDLY than the minimum, as shown in Figure 2a. When TPRE is longer than optimum, WL at the target N2 overshoots, resulting in a longer TDLY than the minimum, as shown in Figure 2b. Figure 2c shows the minimum TDLY. As one can imagine, there is a window in TPRE to have the minimum TDLY, as shown in Figure 2d.

Ideal Case with no Process Variation
When TPRE was varied, there were four patterns in WL waveform at different locations across WL, as summarized in Table 1. In the case of an α of 1.5 and a β of 10%, those patterns are distributed as shown in Figure 3. The vertical axis is TPRE normalized by TOPT, as given by (1), which is the optimum TPRE in case of NAND where the delay time is determined by the farthest location in WL. τ is a time constant given by τ = 4 / 2 .
Two boundaries indicate TDLY can be minimized at the location x when TPRE is set between those two boundaries. As expected, the minimum can be realized with pattern 1 or 2. However, below 20% of x, there is no TPRE to realize pattern 1 or 2. This is because an α of 1.5 is too large to realize pattern 1 or 2 with a β of 10%.   Table 1, respectively. Figure 4 shows TDLY as a function of TPRE for x = 1/6, 1/3, 1/2, and 1. A dot indicates the optimum point for the case of NAND. As one can see, All four curves penetrate that point, which means that all points have the same time delay with the same TPRE = TOPT. On the other hand, each point has a different optimum TPRE with its own minimum delay time. For example, the minimum TDLY of x = 1/6 is about 0.5τ, with a TPRE of 0.47τ, whereas the minimum TDLY of x = 1/2 is about 0.8τ, with a TPRE of 0.85τ-1.15τ.  Figure 5a shows which positions can have a shorter delay time when TPRE is set to a specific value. For x < 0.5, the range of optimum TPRE does not include TOPT, resulting in a significant difference in TDLY between the case of TPRE = TOPT and that of an optimum TPRE at each x. Bit-line (BL) delay time starts with WL high. As a result, there is room to start BL access earlier for, e.g., x < 0.8. The memory system taking advantage of that feature will be discussed later. Figure 5b compares TDLY with α = 1.5 and α = 1.2 when TPRE is deter-  mined to have the minimum TDLY at each x. A higher pre-emphasis pulse height significantly reduces TDLY at x > 1/3, but increases TDLY a little at x < 1/3 because a larger α does not realize the fastest pattern 1 in Table 1 at the near end of WL.  Figure 6a shows a chip micrograph to validate the SPICE simulation. The test circuits were fabricated in a 0.18 μm 3V CMOS [14]. The RC line is made of multiple units of RC elements, where R and C are given by the poly resistor and MIM capacitor, respectively. Even when a different process technology is used, all the graphs in this paper are still valid because the performance parameters such as TPRE and TDLY are normalized by the RC time constant. Internal nodes can be measured with analog buffers as shown in Figure 6b. Figure 6c is a measured waveform at x = 1/3 and 1. TDLY is measured with TPRE varied at x = 1/6, 1/3, 1/2, and 1 to determine optimum TPRE for minimizing TDLY at each x. Except for x = 1/6, TPRE has the window whose edge points are plotted in Figure 7a. With such optimum TPRE at each x, TDLY is given as a function of x, as shown in Figure 7b. Table 2 summaries errors of measurement with SPICE.   Figure 8 shows TDLY as a function of TPRE under the corner condition of x = 1/2, where RC varies by ±20%. The nominal corner shown by 0% is the same curve as the one in Figure 4. When RC increases, the nominal curve simply shifts in the right top direction of 45°. Therefore, the TDLY − TPRE region can be given as in grey. As a result, the worst corner is determined by the curve in red. Below about 1 for normalized TPRE, the corner of +20% determines TDLY, whereas over about 1 or normalized TPRE, the corner of −20% determines TDLY.

(b) (a)
Such a corner is gathered for different locations x, as shown in Figure 9. Every curve has no flat region in terms of TPRE. The vertical line marked as "TOPT" indicates the case when (1) is applied. The graph suggests that TDLY is reduced for x < 1 even with TPRE = TOPT. It also suggests that TDLY can be minimized if one sets TPRE to the lowest point.    Figure 11 shows that x-dependent optimum TOPT can reduce TDLY by 50% at most.

Impact of Cell Current
3D cross point memory has non-volatile memory cells, each of which flows a cell current. The cell current depends on data 0, 1. When pre-emphasis pulses are used for such memory, an impact of cell current needs to be validated. Figure 12 illustrates WL line model when cells flow at the cell current, which is modeled by Rcel. Let us introduce a parameter γ as Rcel = γ R. When β = 0.1, γ must be greater than 9. Otherwise, the WL voltage at x = 1 cannot reach 0.9E. As γ decreases, TDLY should increase. As shown in Figure 13, when γ = 10, TDLY increases by 1~7% across WL. However, for γ > 30, TDLY only increases 1% at most. Such an analysis is needed to determine the maximum WL length. Once Rcel is determined, WL length should meet the condition R < Rcel/30. Figure 13. TDLY vs. γ with x-dependent optimum TPRE.

Impact of Decoding Transistors
Another concern when designing pre-emphasis pulses for random-access memory is the impact of the driver resistance Rd including decoding transistors and wiring resistance on optimum TPRE and TDLY (see Figure 14). Let us introduce δ to define Rd by Rd = δ R.

Discussion
In this section, three-line cases, fitting curves for Optimum TPRE and TDLY, and applications for memory systems are discussed.

Three-Line Model
In this section, the general three-line model shown in Figure 16 is studied. The center line is a target delay line, while the next neighbor lines are grounded. The lines have grounded capacitors Cg and coupling capacitors Cc.
as shown in Figure 17.

Fitting Curve
To see if one can fit TDLY − x and TPRE − x curves for those two conditions on the cap ratio with single equations with a few fitting parameters, the following equations were investigated.
When one uses γ1 = 1.2, μ1 = 0.4, γ2 = 0.9, and μ2 = 0.8, the curves are well fit, as shown in Figure 19. "up" and "low" indicate the upper and lower bounds in TPRE to have the minimum delay time. The fitting curve for TPRE vs. x was well done within the upper and lower bounds of both 1:1 and 100:1. Therefore, one needs only two independent fitting parameters per specific α and β. The fitting curve for TDLY vs. x was not as well done as the one for TPRE vs. x, but it did validate that a moderate fitting could be done with only two fitting parameters as well. Thus, such a behavioral model allows designers to set optimum PE pulse widths and resultant delay times.

Application to Memory System
NVRAM is expected to have much faster access time than NAND or NAND-based solid-state drive, and to have a moderate access time and low bit cos in comparison with DRAM. In such a situation, WL and BL delay times can be as long as multiple clock periods due to large memory arrays. Column-address-dependent memory access can reduce the WL latency when the memory cells located close to the WL decoder are accessed. Figure 20 illustrates a block diagram to realized column-address-dependent memory access. The pre-emphasis pulse controller varies the PE pulse width depending on the column address. The following operations, including BL access and I/O control, can start earlier than the case where the memory cells located at the far side of WL are accessed. The memory controller and CPU can synchronize with it because they know the column address. When the number of clocks required for WL rise is NWL, which depends on the column address, and that for the other delay times from the address input to the WL decoder and from the memory array to the output buffer is NREST, the total latency of the NVRAM is given by NWL + NREST. Assuming NWL varies from 2 at the nearest cell access to 15 at the farthest cell access based on Figure 19b, one can draw the latency improvement expressed by 1 − (NWL_AVG + NREST)/(NWL_WORST + NREST) as a function of NREST as shown in Figure 21 where NWL_WORST is the worst case NWL when the farthest cell is accessed and NWL_AVG is the average value of NWL between when the farthest and nearest cells are accessed. When NVRAM is designed to have an NREST of 5 to 20, the average latency can be improved by 20-30% with the proposed operation.

Conclusions
PE pulsing was studied to assess whether one can reduce the memory access with a PE pulse even when the memory is a random-access type. One can design a PE pulse whose width varies by column address to reduce the WL delay time, even under process variation. The impact of the cell current and the resistance in the decoding path on optimum PE pulse widths and resultant WL delay times are also investigated. Fitting the curves of optimum PE pulse widths and resultant WL delay times of as a function to the column address are demonstrated using only two parameters for each in the case of α = 1.2, β = 0.1, and process variation in τ of ±20%. A block diagram is also proposed to allow column-dependent memory operations to have faster average access.