EXPRESS: Exploiting Energy–Accuracy Tradeoffs in 3D NAND Flash Memory for Energy-Efﬁcient Storage

: The density and cost-effectiveness of ﬂash memory chips continue to increase, driven by: (a) The continuous physical scaling of memory cells in a single layer; (b) The vertical stacking of multiple layers; and (c) Logical scaling through storing multiple bits of information in a single memory cell. The physical properties of ﬂash memories impose disproportionate latency and energy expenditures to ensure the high integrity of the data during ﬂash memory writes. This paper experimentally explores this disproportionality on state-of-the-art commercial 3D NAND ﬂash memories and introduces EXPRESS—a technique for increasing the energy efﬁciency of ﬂash memory writes by exploiting the premature termination of the ﬂash write operations. An experimental evaluation shows that EXPRESS reduces energy expenditures by 20–50%, relative to the traditional ﬂash writes, at the cost of a minimal loss in the data integrity (<1%). In addition, we evaluate the effects of the page-to-page variability, program–erase cycling, and data retention on the implementation of EXPRESS, and we propose enhancements to counter these effects.


Introduction
Nonvolatile NAND flash memories are the basic building blocks of the data storage components found in a range of systems, including IoT and edge-computing platforms, wearable electronics, smartphones, self-driving cars, and the drones to solid-state drives (SSDs) used in personal computers and cloud computing infrastructures [1]. Energy efficiency is a key requirement for the data storage components used in emerging edge computing devices, as most of them are constrained by limited power sources [2][3][4]. The designers of modern flash storage systems, such as SSDs, focus exclusively on the longterm data integrity rather than on the energy efficiency. In light of the many emerging approximate computing applications, e.g., machine learning, data analytics, vision, object classification, and others as described in [5][6][7][8], where approximate and short-lived data are very common, new opportunities have arisen for developing energy-efficient approximate storage systems [9].
A typical flash-memory-based storage system consists of two discrete components: the flash storage media, with one or more flash memory devices, and a flash memory controller. Often the controller and the flash memory devices are made by different companies, and system integrators integrate these components to design storage solutions tailored for specific applications. Flash memory manufacturers comply with a chip-interfacing specification defined by the Open NAND Flash Interface (ONFI) working group [10]. This specification offers few application-agnostic storage functions, which are not tailored towards energy-efficient approximate storage applications. Thus, there remain several opportunities for the system integrators to design energy-efficient storage systems by utilizing the tradeoffs between the data accuracy and the energy efficiency that are inherent to the NAND flash memory technology. without requiring any privileged flash operations or changes in the system design. An experimental evaluation shows that EXPRESS reduces energy expenditures by 20-50%, relative to the traditional flash writes, at the cost of minimal loss in the data integrity (<1%). In addition, the paper experimentally explores the impact of the page-to-page variability and the program-erase cycling on the implementation of EXPRESS, and it offers strategies to cope with these undesired effects. Compared to the existing techniques, EXPRESS offers the following advantages: (a) It can be applied to both 2D and 3D flash memories; (b) It does not require any privileged operations; (c) It can be combined with, and is orthogonal to, other techniques (e.g., voltage scaling); and (d) It does not require any data preprocessing or special data encoding. Table 1 presents a comparative analysis of the major characteristics of the previously proposed related techniques and EXPRESS. The following are the key contributions of the paper: • We explore and quantify the disproportionate trade-offs between the data accuracy and the energy efficiency of flash memory program operations, using COTS 3D NAND flash memory chips. We find that more than 20% of the energy and time is spent on improving less than 1% of the bit accuracy during the memory write operations. We shed more light on this phenomenon and identify the slow memory cells belonging to the tails of the state distributions, a main reason for the disproportionate energyaccuracy tradeoffs; • We propose a novel technique called EXPRESS, which utilizes partial write operations to increase the energy efficiency at a minimal loss of accuracy. We characterize the NAND flash operations and experimentally explore the energy-accuracy tradeoffs as a function of the partial program time. On the basis of the results of the experimental evaluation, we propose an algorithm for choosing the partial program time that strikes an optimal balance between the energy efficiency and the data accuracy; • We perform a detailed characterization of the page-to-page variability, the programerase cycling effects, and the data retention effects on the effectiveness of EXPRESS. We propose several countermeasures that can be adopted to properly address these variability and reliability issues.
The rest of the paper is organized as follows: Section 2 presents the background by discussing the fundamentals of 3D NAND flash memories, the flash incremental pulse programming scheme, and the flash memory interfacing; Section 3 introduces the proposed technique; Section 4 explores the effectiveness of the proposed technique when applied to 3D flash memories operating in the SLC (single-level-cell) and MLC (multilevel-cell) modes. Section 4 also discusses the challenges due to the page-to-page variability, the programerase cycling, and the data retention issues, and it offers enhancements to EXPRESS to address these challenges. Section 5 concludes the paper.

Fundamentals of 3D NAND Array
Traditional 2D NAND flash technology reached its fundamental scaling limits around 2015. In response, the flash memory industry has transitioned to 3D NAND flash memory technology. Continual advances in this technology have resulted in several generations of 3D flash memory chips, each featuring an increasing number of stacked layers, from early 32-layer to contemporary 128-layer designs. These advances promise to extend the incredible growth of the bit density over the next decade [21][22][23]. Figure 1a shows the device structure of a 3D NAND flash memory cell. It is essentially a floating-gate metal oxide semiconductor field effect transistor (MOSFET), with a gate-allaround cylindrical channel structure. In several 3D NAND flash memory implementations, the floating-gate (FG) layer, made of conductive polysilicon, is replaced with a charge-trap (CT) nitride layer, which acts as an insulator. The FG/CT layer is electrically insulated from the transistor's terminals by the channel and gate oxide layers, and it can trap the charge, thereby holding information even when the power is turned off. The trapped negative charge on the FG/CT effectively increases the transistor's threshold voltage (V t ), relative to the case when there is no charge trapped. Thus, a flash memory cell stores information in the form of charges (electrons). The cell is in a programmed state (logic '0') if there are enough electrons on the FG/CT so that V t > V REF (the transistor is off), whereas it is in an erased state (logic '1') if there are no electrons on the FG/CT so that V t < V REF (the transistor is on). To change the state of a cell, two operations are performed: program and erase. These operations require high voltages on the transistor terminals and are conducted through the oxides via the Fowler-Nordheim (FN) tunneling mechanism. The program operation charges the FG/CT with electrons, whereas the erase operation removes the charges from the FG/CT. An erase operation has to be performed to change the state of the flash memory cells from the programmed state to the erased state. The program and erase operations wear out the oxide layers, thus limiting the lifetime of a flash memory cell tõ 3000−100,000 program-erase cycles, depending on the type of flash memory. Figure 1a shows the device structure of a 3D NAND flash memory cell. It is tially a floating-gate metal oxide semiconductor field effect transistor (MOSFET), w gate-all-around cylindrical channel structure. In several 3D NAND flash memory i mentations, the floating-gate (FG) layer, made of conductive polysilicon, is replaced a charge-trap (CT) nitride layer, which acts as an insulator. The FG/CT layer is electr insulated from the transistor's terminals by the channel and gate oxide layers, and trap the charge, thereby holding information even when the power is turned of trapped negative charge on the FG/CT effectively increases the transistor's threshold age (Vt), relative to the case when there is no charge trapped. Thus, a flash memor stores information in the form of charges (electrons). The cell is in a programmed (logic '0') if there are enough electrons on the FG/CT so that Vt > VREF (the transistor whereas it is in an erased state (logic '1') if there are no electrons on the FG/CT so t < VREF (the transistor is on). To change the state of a cell, two operations are perfo program and erase. These operations require high voltages on the transistor termina are conducted through the oxides via the Fowler-Nordheim (FN) tunneling mecha The program operation charges the FG/CT with electrons, whereas the erase ope removes the charges from the FG/CT. An erase operation has to be performed to c the state of the flash memory cells from the programmed state to the erased state program and erase operations wear out the oxide layers, thus limiting the lifetim flash memory cell to ~3000−100,000 program-erase cycles, depending on the type o memory. Figure 1b shows the physical structure of the 3D NAND flash memory array green layers are the word lines (WL0-WL33), and the vertical pillars are the memory that contain the channel of the flash memory cells. Figure 1c shows the circuit diagr the NAND flash memory array that corresponds to a single flash memory block. memory block consists of a fixed number of memory pages. The cells in each me page are electrically connected through a metal word line (WL) that acts as their c gate. Each column (or string) of cells in a block is connected to a bit line (BL). The me read and program operations are performed at the page-level granularity, whereas the operations are performed at the block-level granularity. Any flash cell that is set to a '0' by a page program operation can only be set to a logic '1' by erasing the entire b  Figure 1b shows the physical structure of the 3D NAND flash memory array. The green layers are the word lines (WL 0 -WL 33 ), and the vertical pillars are the memory holes that contain the channel of the flash memory cells. Figure 1c shows the circuit diagram of the NAND flash memory array that corresponds to a single flash memory block. Each memory block consists of a fixed number of memory pages.
The cells in each memory page are electrically connected through a metal word line (WL) that acts as their control gate. Each column (or string) of cells in a block is connected to a bit line (BL). The memory read and program operations are performed at the page-level granularity, whereas the erase operations are performed at the block-level granularity. Any flash cell that is set to a logic '0' by a page program operation can only be set to a logic '1' by erasing the entire block.
A page read operation in the NAND array involves applying a read reference voltage (V REF ) on the selected page's WL, and then sensing the threshold voltage of the cells connected to that WL. WLs of all the other pages in the selected block are set to a high voltage (V PASS ), which turns on all of the flash cells from the nonselected pages. In this way, the state of the selected page can be sensed through the bit lines. An erased cell conducts the current, and that is sensed as a logic '1', whereas a programmed cell does not conduct the current, and that is sensed as a logic '0'. The read reference voltage is set in between the erased state and the programmed state distributions to correctly identify the cell states. Traditional flash memory cells that store one bit of information are known as "single-level cells", or SLCs. The recent advances in controlling and sensing different levels of the charge on the floating gate have enabled modern flash memory cells that can store two bits of information (multilevel cell, or MLC), three bits (triple-level cell, or TLC), or even four bits (quad-level cell, or QLC).

ISPP Programming Scheme
A page program operation in the NAND array utilizes an incremental step pulse program (ISPP) scheme with multiple program cycles, as illustrated in Figure 2a. Each program cycle consists of a program pulse, followed by a verification phase. During a program pulse phase, a high voltage (~15-18 V) is applied to the corresponding WL to cause the injection of electrons into the FG/CTs of the memory cells that need to be programmed. The verification phase identifies the cells that have reached the required threshold voltage, V t , by performing a page read operation, with a program verification voltage (V PVY re f ) applied on the corresponding WL. Thus, the V PVY re f represents the minimum voltage of the program state distribution. The cells whose V t exceed V PVY re f are identified as "programmed", and they are subsequently locked out from further programming using a program inhibit scheme. The following program cycle starts with an incrementally higher voltage on the WL, which increases the chances that the cells that did not switch their state in the previous cycle become programmed [24]. This sequence of the program and the verification steps continue until most of the cells that are supposed to be programmed are, indeed, programmed. Figure 2a illustrates the different steps associated with the one-page program operation as a function of time. The steps include the high-voltage program pulse phase of the duration t p , a relatively lower voltage verification phase of the duration t v f y , and two setup time intervals-one for the program pulse of the duration, t su p , and the other for the verification phase, t su v f y . Assuming that all of these times remain constant across all of the program cycles, the total page program time can be expressed as follows: Here, t pcy = t su p + t p + t su v f y + t v f y represents the time required for one full program cycle, n stands for the total number of program cycles required for the page program operation, and t init is the initial time required by the NAND array to verify the page status before applying a series of program cycles. Please note that Equation (1) captures the common features of the NAND page program operations. However, it may need to be adjusted depending on the specific implementation of the on-chip control logic in a particular flash memory chip.
The primary purpose of the ISPP scheme is to tighten the program state V t distribution, relative to the initial erase state V t distribution, which is typically wider because of the intrinsic cell-to-cell process variation. The evolution of the cell V t distribution with the ISPP scheme is further illustrated in Figure 2b. For simplicity, we consider an SLC memory, although the same principle holds for MLC and TLC flash memories. We choose a program operation with four program cycles to illustrate the ISPP scheme. In practice, the number of program cycles could be higher. The distribution depicted with the dashed line represents the right-shifted erase state V t distribution after each program cycle, if all the cells are programmed. the intrinsic cell-to-cell process variation. The evolution of the cell distribution with the ISPP scheme is further illustrated in Figure 2b. For simplicity, we consider an SLC memory, although the same principle holds for MLC and TLC flash memories. We choose a program operation with four program cycles to illustrate the ISPP scheme. In practice, the number of program cycles could be higher. The distribution depicted with the dashed line represents the right-shifted erase state distribution after each program cycle, if all the cells are programmed. In practice, a certain number of cells that attain a exceeding the program verification voltage are locked out of (or inhibited from) further programming cycles. Thus, the ISPP scheme tightens the cell distribution by selectively providing fewer program pulses to the fast program cells, and more program pulses to the slow program cells. As a result, the final program state distribution becomes much tighter than in the erase state, as illustrated with the yellow distribution in Figure 2b. Since a tighter distribution is essential for ensuring data integrity, the ISPP scheme is invariably used in all NAND flash memories. Note that the distributions in Figure 2b may not follow the perfect Gaussian distribution. We used Gaussian-like distribution for illustration purposes only, and, thus, it is not a faithful illustration of actual cell distributions. In practice, a certain number of cells that attain a V t exceeding the program verification voltage are locked out of (or inhibited from) further programming cycles. Thus, the ISPP scheme tightens the cell V t distribution by selectively providing fewer program pulses to the fast program cells, and more program pulses to the slow program cells. As a result, the final program state distribution becomes much tighter than in the erase state, as illustrated with the yellow distribution in Figure 2b. Since a tighter V t distribution is essential for ensuring data integrity, the ISPP scheme is invariably used in all NAND flash memories. Note that the V t distributions in Figure 2b may not follow the perfect Gaussian distribution. We used Gaussian-like distribution for illustration purposes only, and, thus, it is not a faithful illustration of actual cell V t distributions.
The wider an erase-state V t distribution is, the larger the number of program cycles required to complete the write operation. Note that the slow program cells may require several additional ISPP cycles. The percentage of such cells, in practice, falls well below 1% of all the flash cells in a page. Thus, the ISPP scheme entails a disproportionate energy-accuracy tradeoff, where a significant fraction of the program time, and, thus, the energy, is spent programming a tiny fraction of memory cells. The energy-accuracy tradeoff is even more skewed for 3D NAND technology, which exhibits significant cell-to-cell variations because of the poly-Si channel material and the nonuniformity in the cell dimensions caused by the reactive ion etching process [25,26]. Thus, long-tail erase-state distribution is a fundamental property of 3D NAND. Hence, the energy-accuracy tradeoffs in the ISPP programming scheme of the 3D NAND need to be evaluated carefully for energy-efficient storage applications.

Interfacing NAND Chip from the Host Controller
COTS flash memory chips use a standardized low-level interface, which was developed by the Open NAND Flash Interface (ONFI) working group, a consortium of flash memory manufacturers [10]. The ONFI specifications define: the standard physical interfaces; the chip identification mechanisms; a standard command set for reading, writing, and erasing the NAND flash; the timing requirements; and the data integrity features.
Depending on the chip package and the type of the interface, the number of bytes sent to, or received from, a device at a time can vary. In our case, both the commands and the data are carried through eight data lines (DQ0-DQ7). The control lines, CE# (Chip Enable, active low), CLE (Command Latch Enable), ALE (Address Latch Enable), RE# (Read Enable, active low), and WE# (Write Enable, active low), allow for the control of the functions and timing of the interface. As is shown in Figure 3, a command placed on the data lines by the host is written into the device's command register on the rising edge of WE# when CE# is low, ALE is low, CLE is high, and RE# is high. An address placed on the data lines by the host is written to the device's address register on the rising edge of WE# when CE# is low, ALE is high, CLE is low, and RE# is high. Data placed on the data lines by the host is written into the device's data register on the rising edge of WE# when CE# is low, ALE is low, CLE is low, and RE# is high. Data is output from the device if it is in a ready state. The data from the device's data register is output to the data lines on the falling edge of RE# when CE# is low, ALE is low, CLE# is low, and WE# is high.

Proposed Technique-EXPRESS
The EXPRESS technique reduces the energy consumed during the flash program operations, at the cost of a negligible loss of accuracy. It relies on a partial page program operation to counter the disproportionate energy-accuracy tradeoff inherent in the ISPP scheme. Figure 4a illustrates the proposed EXPRESS technique. The solid black line represents the status of the RB pin during a regular page program operation. This pin goes low, indicating that the NAND array is busy for the duration of the program operation, . The value lies in the range of 300-600 μs for an SLC memory page of the chip used in this study [27]. The program operation, however, can be terminated prematurely using a RESET command, such as the program suspend operation [28]. In this case, the state of the RB pin is illustrated with the red dashed line. The premature termination of the program operation results in a partial program operation. Although this operation may slightly increase the bit error rate (BER), it can significantly reduce the time and energy of the page program operations. The critical parameter that enables an exploration of the tradeoffs between the energy and the accuracy is the partial program time, . The following equation can be used to estimate the : Here, is the number of program cycles that can be skipped to achieve higher energy efficiency. Note that we have not included the verification phase of the last program cycle in Equation (2) Figure 3 illustrates a sequence of commands that carry out a page program operation. The operation is initiated by the host that sends the command (0x80) to the device through the data lines. Next, the host writes five address cycles (A0-A4), while keeping the ALE signal high. Next, the host controller sends the data to be written to the device's data register, byte by byte. Finally, the host sends the PAGE PROGRAM command (0x10) that initiates the write operation to the specified page of the flash memory array. During the page program operation, the device's status control pin RB (Ready/Busy#) is low, indicating that the device is currently busy. Upon completion of the program operation, the RB signal is set high. Thus, the host can determine the page program time (t prog ) by monitoring the state of this pin after issuing the command sequence.

Proposed Technique-EXPRESS
The EXPRESS technique reduces the energy consumed during the flash program operations, at the cost of a negligible loss of accuracy. It relies on a partial page program operation to counter the disproportionate energy-accuracy tradeoff inherent in the ISPP scheme. Figure 4a illustrates the proposed EXPRESS technique. The solid black line represents the status of the RB pin during a regular page program operation. This pin goes low, indicating that the NAND array is busy for the duration of the program operation, t prog . The t prog value lies in the range of 300-600 µs for an SLC memory page of the chip used in this study [27]. The program operation, however, can be terminated prematurely using a RESET command, such as the program suspend operation [28]. In this case, the state of the RB pin is illustrated with the red dashed line. The premature termination of the program operation results in a partial program operation. Although this operation may slightly increase the bit error rate (BER), it can significantly reduce the time and energy of the page program operations. The critical parameter that enables an exploration of the tradeoffs between the energy and the accuracy is the partial program time, t pp . The following equation can be used to estimate the t pp : dicating that the NAND array is busy for the duration of the program operation, The value lies in the range of 300-600 μs for an SLC memory page of the chip in this study [27]. The program operation, however, can be terminated prematurely a RESET command, such as the program suspend operation [28]. In this case, the st the RB pin is illustrated with the red dashed line. The premature termination of the gram operation results in a partial program operation. Although this operation slightly increase the bit error rate (BER), it can significantly reduce the time and ene the page program operations. The critical parameter that enables an exploration tradeoffs between the energy and the accuracy is the partial program time, . Th lowing equation can be used to estimate the : Here, is the number of program cycles that can be skipped to achieve h energy efficiency. Note that we have not included the verification phase of the las gram cycle in Equation (2), as no additional bits are programmed during the verific phase. In general, Equation (2) can be used as a guideline for finding an optima which needs to be precharacterized on the basis of the properties of the particular f of flash memory chips. Here, n skip is the number of program cycles that can be skipped to achieve higher energy efficiency. Note that we have not included the verification phase of the last program cycle in Equation (2), as no additional bits are programmed during the verification phase. In general, Equation (2) can be used as a guideline for finding an optimal t pp , which needs to be precharacterized on the basis of the properties of the particular family of flash memory chips. Figure 4b sheds more light on the rationale behind EXPRESS by illustrating three different reference voltages, which correspond to three different memory operations. The erase operation ensures that the threshold voltages of all the erased cells in the block are below the reference voltage, V EVY re f . Similarly, V PVY re f is the reference voltage used during the program verification phase of the page program operation. The ISPP scheme ensures that the threshold voltages of all the programmed cells are above the reference voltage, V PVY re f . Finally, a read reference voltage, V Read re f , is used to distinguish between the erase and program states of the cell during a page read operation. All NAND manufacturers keep a sufficient voltage margin between the read and program verification voltages in order to minimize read errors. However, this margin can be exploited to increase the energy efficiency in all applications where the BER is sufficiently low and it can be corrected using error-correction techniques. In addition, EXPRESS can be used even when a somewhat higher BER can be tolerated, e.g., in applications where approximate short-lived data are common. For example, if we terminate the program operation prematurely, the resulting threshold voltage distribution will be mostly above the read reference voltage, as is shown with the dashed lines in Figure 4b. The resulting distribution may have some area below the read reference voltage and that will create errors, which we are trading off for the saved energy.
Since 3D NAND flash memory cells in the erased state exhibit long tails of the threshold voltage distribution, programming these cells may require several extra program pulses. Since left-tail cells usually represent less than 1% of the total page size, a premature termination of the program operation may cause just 1% of cells to have threshold voltages below V PVY re f . Interestingly, not all of these tail bits will show up as error bits during a read operation, as there is a sufficient voltage margin between the read and the program verification voltages. Thus, one can improve the energy efficiency of flash memory program operations with very little, or no, sacrifice in the bit accuracy if the partial program time is chosen appropriately. However, such partial programming may lead to increased retention loss because of the reduced reliability margin. The following section presents the experimental evaluation of the energy-accuracy tradeoffs in the state-of-the-art 3D NAND flash memory. It formulates guidelines for choosing the appropriate partial program time.

Experimental Evaluation
In our experimental evaluation, we use a 3D NAND flash memory chip that supports both the SLC and MLC modes of operation. Section 4.1 describes our experimental setup. Sections 4.2 and 4.3 describe the results of the experimental evaluation of EXPRESS for the SLC and MLC modes, respectively. Whereas EXPRESS promises energy savings at a negligible loss of accuracy, it is important to address any practical issues that can impact the efficacy of the proposed technique, including the page-to-page variability, the wear-out of the gate oxides, and the data retention. Hence, Section 4.4 discusses the effects of the page-to-page variability and the PE cycling on EXPRESS. Section 4.5 discusses the long-term effects of the EXPRESS mechanism on data retention. Finally, Section 4.6 puts everything together with a real-world example. Figure 5 shows our experimental setup, which consists of a TSOP-48 socket that holds a flash memory chip, an FT2232H mini module from Future Technology Devices International (FTDI), and a workstation. The FT2232H module acts as a bridge between the workstation and the device, implementing an asynchronous 8-bit parallel interface to the device, as described in Figure 3. A software package running on the workstation executes the ONFI commands for sending data to the flash memory chip, erasing a block, writing a page, reading a page, or retrieving the data from the device. This hardware setup allows us to access raw memory bits without any error correction. We used a logic analyzer and a Digilent Analog Discovery II multifunction instrument to measure the time and capture the voltage samples from a shunt resistor connected to the power line of the TSOP socket. We performed the experimental evaluation on several 3D NAND MLC chips, with the following properties: the chip capacity is 256 Gbits; the number of blocks is 2192; each block contains 1024 pages; and each page contains 18,592 bytes of data (16,384 bytes of user data, with 2208 spare bytes reserved for storing out-of-band information, such as error correction codes). The chip was manufactured using 32-layer 3D technology. the ONFI commands for sending data to the flash memory chip, erasing a block, writing a page, reading a page, or retrieving the data from the device. This hardware setup allows us to access raw memory bits without any error correction. We used a logic analyzer and a Digilent Analog Discovery II multifunction instrument to measure the time and capture the voltage samples from a shunt resistor connected to the power line of the TSOP socket. We performed the experimental evaluation on several 3D NAND MLC chips, with the following properties: the chip capacity is 256 Gbits; the number of blocks is 2192; each block contains 1024 pages; and each page contains 18,592 bytes of data (16,384 bytes of user data, with 2208 spare bytes reserved for storing out-of-band information, such as error correction codes). The chip was manufactured using 32-layer 3D technology.

Evaluation of the Proposed Technique on SLC Memory
We first validate EXPRESS by configuring a NAND chip to operate in the SLC mode. An all-zero data pattern is written using partial-page program operations while varying the partial program time, . Later, in Section 4.4, we perform a similar experiment with a random data pattern with an equal distribution among all the available flash cell states. Figure 6a shows the percentage of the programmed bits as a function of the partial program time. Each point in the plot represents the percentage of programmed bits collected from 10 experiments on the same page. Each partial program experiment is proceeded by

Evaluation of the Proposed Technique on SLC Memory
We first validate EXPRESS by configuring a NAND chip to operate in the SLC mode. An all-zero data pattern is written using partial-page program operations while varying the partial program time, t pp . Later, in Section 4.4, we perform a similar experiment with a random data pattern with an equal distribution among all the available flash cell states. Figure 6a shows the percentage of the programmed bits as a function of the partial program time. Each point in the plot represents the percentage of programmed bits collected from 10 experiments on the same page. Each partial program experiment is proceeded by a full block erase operation. Figure 6b shows the current drawn by the NAND chip during a regular page program operation. The corresponding status of the RB pin during a regular page program operation is illustrated by a red dashed line. The current drawn increases notably during the program operation relative to the current drawn in the device's idle state. The current waveform reveals two distinct profiles, which are repeated alternatively. We hypothesize that these characteristic current profiles correspond to the program (blueshaded regions), and that they verify (red-shaded regions) the phases of the page program operation and its ISPP scheme. 1. Figure 6a shows that just three program cycles out of five used in a regular program operation are sufficient to achieve a bit accuracy above 99.9%. The last two program pulses are mainly used to program a tiny fraction of bits located in the lower tail of the erase distribution, as illustrated in the inset of Figure 6a; 2. Figure 6b illustrates that there is periodicity in terms of the program and verification cycles, and that all program pulses and verification phases have similar duration and current profiles. Thus, Equation (2) can be used for determining a suitable partial program time. As there is no tangible advantage in the termination of the program operation in the middle of a verification or a program cycle, the optimal should correspond to the end of a program pulse. The number of program pulses required to achieve the desired bit accuracy may be specific for a family of chips, the location of the page in the 3D structure, and its usage conditions. Still, all of these can be precharacterized and then used to inform a proper implementation of the partial program operations.

Evaluation of the Proposed Technique on MLC Memory
MLC flash memory cells store 2 bits of information, and, hence, there are two different types of logical pages sharing a single word line. These two bits correspond to four states of the flash memory cells, i.e., the information is encoded in the form of four thresh- The plot in Figure 6a shows that the percentage of programmed bits resembles a step function. The flash memory cells are programmed only during program pulses. The transition points of the percentage of programmed bits align with the program pulse phases in Figure 6b. Furthermore, the percentage of programmed bits remains constant during the verification phases. This confirms our hypothesis that the ISPP scheme is used in a page program operation, and that the characteristic waveforms correspond to the program pulses and verify the phases of the page program operation.
Furthermore, the results from Figure 6 support the following two observations: 1. Figure 6a shows that just three program cycles out of five used in a regular program operation are sufficient to achieve a bit accuracy above 99.9%. The last two program pulses are mainly used to program a tiny fraction of bits located in the lower tail of the erase V t distribution, as illustrated in the inset of Figure 6a; 2. Figure 6b illustrates that there is periodicity in terms of the program and verification cycles, and that all program pulses and verification phases have similar duration and current profiles. Thus, Equation (2) can be used for determining a suitable partial program time. As there is no tangible advantage in the termination of the program operation in the middle of a verification or a program cycle, the optimal t pp should correspond to the end of a program pulse. The number of program pulses required to achieve the desired bit accuracy may be specific for a family of chips, the location of the page in the 3D structure, and its usage conditions. Still, all of these can be precharacterized and then used to inform a proper implementation of the partial program operations.

Evaluation of the Proposed Technique on MLC Memory
MLC flash memory cells store 2 bits of information, and, hence, there are two different types of logical pages sharing a single word line. These two bits correspond to four states of the flash memory cells, i.e., the information is encoded in the form of four threshold voltage distributions (Er-11, A-01, B-00, C-10), as illustrated in Figure 7. The most significant bit (MSB) of the logic states of all the memory cells connected to a given word line forms the MSB page. Similarly, the least significant bit (LSB) of the logic states of the memory cells from the same word line forms the logical LSB page. The LSB page programming involves raising the erase state (V t ) of certain cells to the B-state, as is shown in Figure 7. The MSB page programming is performed after the LSB page programming is finished. During MSB page programming, certain memory cells from the Er state go to the A state, and certain cells from the B state go to the C state, as is shown in Figure 7. Two read reference voltages are used to read the MSB page data, whereas only one read reference voltage is needed for reading the LSB page data.   Figure 8a shows the percentage of programmed bits as a function of the for the MSB and LSB pages in the red and blue solid lines, respectively. The experiments are conducted as follows: Logical LSB and MSB pages are used from a freshly erased block. First, we partially program an LSB page, and then the corresponding MSB page, with an allzero data pattern. Please note that the chip used in this study, when configured in the MLC mode, by default implements data scrambling, which ensures that all four states are uniformly utilized in a physical page, regardless of the input data pattern. Thus, writing all zeros in the LSB and MSB pages does not imply that all the cells are in the B state. Therefore, the data pattern does not impact the results of our experiments, as demonstrated later in Section 4.4, where we use random data patterns. After the partial program operation, we perform the page read operation for both the LSB and MSB pages, and we determine the percentage of programmed bits for each experiment. The programmed bit percentage for the LSB pages looks quite similar to the one observed for the SLC mode of operation. Since writing on an LSB page involves only one programmed state (B state), its ISPP scheme is quite similar to the one used in the SLC mode.  Figure 8a shows the percentage of programmed bits as a function of the t pp for the MSB and LSB pages in the red and blue solid lines, respectively. The experiments are conducted as follows: Logical LSB and MSB pages are used from a freshly erased block. First, we partially program an LSB page, and then the corresponding MSB page, with an all-zero data pattern. Please note that the chip used in this study, when configured in the MLC mode, by default implements data scrambling, which ensures that all four states are uniformly utilized in a physical page, regardless of the input data pattern. Thus, writing all zeros in the LSB and MSB pages does not imply that all the cells are in the B state. Therefore, the data pattern does not impact the results of our experiments, as demonstrated later in Section 4.4, where we use random data patterns. After the partial program operation, we perform the page read operation for both the LSB and MSB pages, and we determine the percentage of programmed bits for each experiment. The programmed bit percentage for the LSB pages looks quite similar to the one observed for the SLC mode of operation. Since writing on an LSB page involves only one programmed V t state (B state), its ISPP scheme is quite similar to the one used in the SLC mode. However, the programmed bit percentage for the MSB pages has distinctively different characteristics. There are two plateaus because two different states, A and C, are formed during MSB programming. The first plateau corresponds to the construction of the A state, as it has a lower , and is thus formed first. The second plateau corresponds to the formation of the C state. The time to complete an MSB page program operation is significantly longer than the time needed to program the corresponding LSB page. As the programming of an MSB page involves transitioning the flash cells from Er to A, and from the B to C states, it thus requires more ISPP cycles and, consequently, more time to complete a program operation, relative to its LSB counterpart. Another distinctive feature of the MSB page programming is its verification phases, which are more complex than the LSB counterparts. An LSB page verification requires only one read to verify that the cell exceeds the lower bound of the B state ( ), whereas an MSB page verification requires two reads to check the lower bounds of both the A and C states. These hypotheses are confirmed by inspecting the current profiles, as discussed in the text below. Figure 8b,c shows the current drawn by the chip during a page program operation for an LSB page and an MSB page, respectively. Similar to the SLC current profiles, we observe the periodic program pulses and verify the phases in the current waveform. For example, the LSB page, analyzed in Figure 8b, requires nine ISSP cycles, with the total program time, ≈ 1000 μs. However, the bit accuracy reaches above 99% with only seven program pulses ( ≈ 750 μs), indicating a 25% energy saving with a <1% bit accuracy loss. Programming MSB pages generally requires more time than programming LSB pages. For example, the MSB page analyzed in Figure 8c requires ≈ 1500 , However, the programmed bit percentage for the MSB pages has distinctively different characteristics. There are two plateaus because two different V t states, A and C, are formed during MSB programming. The first plateau corresponds to the construction of the A state, as it has a lower V t , and is thus formed first. The second plateau corresponds to the formation of the C state. The time to complete an MSB page program operation is significantly longer than the time needed to program the corresponding LSB page. As the programming of an MSB page involves transitioning the flash cells from Er to A, and from the B to C states, it thus requires more ISPP cycles and, consequently, more time to complete a program operation, relative to its LSB counterpart. Another distinctive feature of the MSB page programming is its verification phases, which are more complex than the LSB counterparts. An LSB page verification requires only one read to verify that the cell V t exceeds the lower bound of the B state (V LSB REF ), whereas an MSB page verification requires two reads to check the lower bounds of both the A and C states. These hypotheses are confirmed by inspecting the current profiles, as discussed in the text below. Figure 8b,c shows the current drawn by the chip during a page program operation for an LSB page and an MSB page, respectively. Similar to the SLC current profiles, we observe the periodic program pulses and verify the phases in the current waveform. For example, the LSB page, analyzed in Figure 8b, requires nine ISSP cycles, with the total program time, t LSB prog ≈ 1000 µs. However, the bit accuracy reaches above 99% with only seven program pulses (t LSB pp ≈ 750 µs), indicating a 25% energy saving with a <1% bit accuracy loss. Programming MSB pages generally requires more time than programming LSB pages. For example, the MSB page analyzed in Figure 8c requires t MSB prog ≈ 1500 µs, or 11 ISPP cycles. In addition, the verification phases in the case of MSB program operations take more time than those that take place during LSB program operations. Still, we find that partial program operations can be utilized on MSB pages, offering more than 20% in energy savings, with a negligible (<1%) bit-accuracy loss. The optimal partial program time for MSB pages is t MSB pp ≈ 1150 µs. We observe considerable page-to-page variability in the bit accuracy (error bar in Figure 8a), even though the t pp was fixed. Such page-to-page variability may arise in the NAND memory because of the inherent process variations, physical organization, and the presence of program and read noise. In the next section, we elaborate further on the page-to-page timing variability and the possible countermeasures.

Effects of Page-to-Page Variability
3D NAND flash memories exhibit page-to-page variations because of the unique nature of the array geometry and the intrinsic process variations within the array. Figure 9a shows the organization of a 3D NAND memory block configured in the SLC mode. The pages in a block are organized in rows that correspond to the physical vertical layers (L 0 , L 1 , . . . L N−1 ) and columns that correspond to the sub-blocks (0, 1, . . . , M − 1). A page number within a block can be expressed as P or 11 ISPP cycles. In addition, the verification phases in the case of MSB program operations take more time than those that take place during LSB program operations. Still, we find that partial program operations can be utilized on MSB pages, offering more than 20% in energy savings, with a negligible (<1%) bit-accuracy loss. The optimal partial program time for MSB pages is ≈ 1150 μs. We observe considerable page-to-page variability in the bit accuracy (error bar in Figure 8a), even though the was fixed. Such page-to-page variability may arise in the NAND memory because of the inherent process variations, physical organization, and the presence of program and read noise. In the next section, we elaborate further on the pageto-page timing variability and the possible countermeasures.

Effects of Page-to-Page Variability
3D NAND flash memories exhibit page-to-page variations because of the unique nature of the array geometry and the intrinsic process variations within the array. Figure 9a shows the organization of a 3D NAND memory block configured in the SLC mode. The pages in a block are organized in rows that correspond to the physical vertical layers ( 0 , 1 , … −1 ) and columns that correspond to the sub-blocks (0, 1, … , − 1). A page number within a block can be expressed as , where represents the layer number, and = 0, 1, … − 1. The chip under evaluation has 32 layers ( = 32), where each layer contains 16 logical pages ( = 16) of a given memory block. Thus, there are a total of 16 × 32 = 512 pages within a block. We performed a characterization of the page program times by sequentially programming all of the pages of a memory block using a random data pattern. Our characterization results are shown as a cumulative distribution plot in Figure  9b. The results indicate that the standard-page program time varies significantly among different pages within the same block. Since the implementation of EXPRESS requires an estimation of the on the basis of the nominal page program time ( ), it needs to be adapted in order to account for the page-to-page variability. To understand the precise nature of the page-to-page variability, we measure each page's nominal page program time in a block in SLC mode. Figure 10 shows the results of these measurements. We can make the following two observations from these results: a. The first page to be programmed in a given layer takes more time to complete a program operation. We classify these pages as "slow" pages, shown in blue in Figure 10; b. The variability is minimal among memory pages located in the same vertical layer ( 1 15 ) of the array. Consequently, we argue that the memory controller can learn the value from the page, 1 (referred to as a "learning" page), and then apply EXPRESS when programming the remaining pages.
To further illustrate the variability, we compute the median as the last column in Figure 10. We find that the median varies between different layers, but within the same layer, the remains relatively unchanged (except the first page of a layer). To understand the precise nature of the page-to-page variability, we measure each page's nominal page program time in a block in SLC mode. Figure 10 shows the results of these measurements. We can make the following two observations from these results: a.
The first page to be programmed in a given layer takes more time to complete a program operation. We classify these pages as "slow" pages, shown in blue in Figure 10; b.
The t prog variability is minimal among memory pages located in the same vertical layer (P L j 1 to P L j 15 ) of the array. Consequently, we argue that the memory controller can learn the t prog value from the page, P L j 1 (referred to as a "learning" page), and then apply EXPRESS when programming the remaining pages. We exploit this observation and propose an adaptive learning algorithm to maximize the energy savings for the EXPRESS method. To address the observed variabilities, we propose the following modification to EX-PRESS. The nominal program time variation among slow pages (marked as blue boxes in Figure 10) is minimal. Consequently, the flash controller may apply EXPRESS on the slow pages by learning the corresponding from the first page of the block ( 0 0 ). The remaining pages of the block are classified as learning pages (yellow) and EXPRESS pages (green). The nominal page program operations are performed on the learning pages to acquire the exact value, and EXPRESS is applied on the remaining pages of the layer ( 2 15 ) by estimating the corresponding using Equation (2). Next, we discuss the page-to-page variability for a flash block configured in the MLC mode. Figure 11a shows the cumulative distribution of for the LSB and MSB pages. We find significant page-to-page variability for both the LSB (blue line) and MSB (red line) pages. LSB pages behave similarly to the SLC pages, where the first LSB page of a given layer requires significantly longer , compared to the other LSB pages in the same layer. These slow LSB pages constitute the upper tail (~10%) of the cumulative distribution in Figure 11a. The average for the MSB pages is distinctively higher than the average for the LSB pages. Unlike the LSB pages, the variability for the MSB pages is relatively small. The first MSB pages in a layer do not require a higher than the other MSB pages in the given layer.
Since LSB pages behave similarly to SLC pages, the algorithm for implementing EX-PRESS on LSB pages can mirror the algorithm proposed for the SLC pages, as described above. For the MSB pages, a slight modification of the algorithm is introduced by treating the first MSB page ( 0 ) in a given layer as the learning page. Figure 11b shows the EX-PRESS algorithm for MSB pages where the is learned from the first MSB page of a given layer, 0 . Equation (2) is used to estimate the from the corresponding , and that value is applied to the remaining (n =1, 3, …, 15) pages. Next, we will discuss the energy benefits obtained with these algorithms when implemented in the 3D NAND chip under evaluation.

EXPRESS Page Slow Page
Learning Page − (median) Figure 10. Classification of different SLC pages of the same memory block. The numbers represent measured t prog values in µs corresponding to the page location.
To further illustrate the variability, we compute the median t prog as the last column in Figure 10. We find that the median t prog varies between different layers, but within the same layer, the t prog remains relatively unchanged (except the first page of a layer). We exploit this observation and propose an adaptive learning algorithm to maximize the energy savings for the EXPRESS method.
To address the observed variabilities, we propose the following modification to EX-PRESS. The nominal program time variation among slow pages (marked as blue boxes in Figure 10) is minimal. Consequently, the flash controller may apply EXPRESS on the slow pages by learning the corresponding t prog from the first page of the block (P L 0 0 ). The remaining pages of the block are classified as learning pages (yellow) and EXPRESS pages (green). The nominal page program operations are performed on the learning pages to acquire the exact t prog value, and EXPRESS is applied on the remaining pages of the layer (P L j 2 to P L j 15 ) by estimating the corresponding t pp using Equation (2). Next, we discuss the page-to-page variability for a flash block configured in the MLC mode. Figure 11a shows the cumulative distribution of t prog for the LSB and MSB pages. We find significant page-to-page variability for both the LSB (blue line) and MSB (red line) pages. LSB pages behave similarly to the SLC pages, where the first LSB page of a given layer requires significantly longer t prog , compared to the other LSB pages in the same layer. These slow LSB pages constitute the upper tail (~10%) of the cumulative distribution in Figure 11a. The average t prog for the MSB pages is distinctively higher than the average t prog for the LSB pages. Unlike the LSB pages, the t prog variability for the MSB pages is relatively small. The first MSB pages in a layer do not require a higher t prog than the other MSB pages in the given layer. The adaptive learning algorithm for EXPRESS widens the opportunity wind performance and energy enhancement. Table 2  Since LSB pages behave similarly to SLC pages, the algorithm for implementing EXPRESS on LSB pages can mirror the algorithm proposed for the SLC pages, as described above. For the MSB pages, a slight modification of the algorithm is introduced by treating the first MSB page (P L j 0 ) in a given layer as the learning page. Figure 11b shows the EXPRESS algorithm for MSB pages where the t prog is learned from the first MSB page of a given layer, P L j 0 . Equation (2) is used to estimate the t pp from the corresponding t prog , and that value is applied to the remaining P L j n (n =1, 3, . . . , 15) pages. Next, we will discuss the energy benefits obtained with these algorithms when implemented in the 3D NAND chip under evaluation.
The adaptive learning algorithm for EXPRESS widens the opportunity window for performance and energy enhancement. Table 2 summarizes the measured t prog (or nominal program time of a page), and the corresponding optimal t pp , for pages in both the SLC and MLC configurations. The table also quantifies the effectiveness of EXPRESS by reporting the bit error rate and the average percentage of energy saved. The results are broken down on the basis of the page types, as discussed above. We calculate the number of program loops that can be skipped for EXPRESS to acquire an acceptable accuracy loss (<1%) for each page type. We find an optimal value of the parameter, n skip , in Equation (2): n skip = 1 or 2 for SLC pages, depending on their type, and n skip = 2 for MLC pages. For higher values of the n skip , the BER in the written data is found to be more than 1%. However, the n skip needs to be precharacterized for each class of chips for optimal EXPRESS implementation. Note that the table is prepared on the basis of data collected from 1024 pages of an MLC flash block, and from 512 pages of an SLC flash block. We find that EXPRESS can save an average of 20 to 50% of the write energy, depending on the page type, whereas the exact figure for energy savings may differ for flash memory chips that have a different organization, or that are manufactured in different technology nodes. The proposed technique applies to all of them because it exploits the accuracy-energy disproportionality that is common for all modern flash memory chips.

Effects of Program-Erase Cycling on EXPRESS
NAND flash memory exhibits limited endurance, which is typically specified by the maximum number of program-erase operations (or PE cycles) allowed on a memory block. The number of PE cycles may impact the nominal page program time, t prog , and stressed pages with a high number of PE cycles may take more time to program [29][30][31]. Hence, an implementation of EXPRESS needs to consider the number of PE cycles. Figure 12a shows the cumulative distribution of the nominal page program time for SLC pages in a fresh flash memory block, and in a memory block that has been exposed to 10,000 PE cycles. Similarly, Figure 12b shows the cumulative distributions of the nominal-page program times for the LSB and MSB pages in the MLC mode, for a fresh block and a block exposed to 5000 PE cycles. We find that the average t prog increases with PE cycling in the MLC mode, whereas a minimal change is observed in the SLC mode.
MLC mode, whereas a minimal change is observed in the SLC mode.
Even though the average increases with an increase in the number of PE cycle in the MLC mode, the intralayer and interlayer variations remain unchanged rela tive to the fresh memory blocks. Specifically, our observations (a) and (b) of Section 4 remain true, even on stressed memory blocks. Therefore, the algorithm proposed in Se tion 4.4 can be used unchanged because EXPRESS learns the from the learning pag regardless of the PE cycles.

Data Retention Effects
Data retention is an essential consideration for nonvolatile flash memories. Th charge stored on the FG/CT of the flash cells tends to leak out through the tunnel oxide at room temperature, lowering the cell threshold voltage over a period of time [ Even though the average t prog increases with an increase in the number of PE cycles in the MLC mode, the intralayer and interlayer t prog variations remain unchanged relative to the fresh memory blocks. Specifically, our observations (a) and (b) of Section 4.4 remain true, even on stressed memory blocks. Therefore, the algorithm proposed in Section 4.4 can be used unchanged because EXPRESS learns the t prog from the learning page, regardless of the PE cycles. Table 3 summarizes the updated t prog and the corresponding t pp on the PE-cycled memory blocks. We find that, for 10K PE cycles in the SLC memory block, the optimal value for n skip = 1. Higher n skip values cause very high BERs in the written data. With n skip = 1 in the SLC mode, we find that EXPRESS saves~30% of the write energy for nominal SLC pages. Similarly, in the MLC mode operation, we find that the optimal n skip = 2, which ensures that the BER < 1%. Thus, the energy savings are found to be~16% for the MSB pages, and~46% for the LSB pages. Since the t prog values for the MSB pages are longer compared to the LSB pages, the percentage of energy savings is lower for the MSB pages for the same n skip value.

Data Retention Effects
Data retention is an essential consideration for nonvolatile flash memories. The charge stored on the FG/CT of the flash cells tends to leak out through the tunnel oxides at room temperature, lowering the cell threshold voltage over a period of time [31,32]. Hence, flash memory manufacturers keep wider voltage margins between the program V t and the read reference voltage in order to guarantee long-term data retention (~10 years for many products). Since EXPRESS trades off the voltage margin for improved energy efficiency, it is important to characterize the data retention time. Figure 13 summarizes the results of an experiment that explores the effects of EXPRESS on the data retention for both the SLC and MLC modes of operation on PE-cycled blocks. It shows the bit error rate of the data written by EXPRESS (red bars), and the data written by the nominal program operation (blue bars). To accelerate the retention loss, we bake the chip at a higher temperature (120 • C) for 1, 2, or 3 h. Using the acceleration-factor-based calculation, we find that the 3 h of baking time corresponds to 5 years at room temperature, assuming the activation energy for the charge loss in the 3D NAND as E A = 1 eV [33]. The results in Figure 13 show that the BERs for the EXPRESS write increase relative to the traditional programming, after the accelerated retention test. The temporary read error is a new reliability issue in 3D NAND Flash [34,35]. It is not considered in this case, as the BERs are <1% for all the types of pages and, hence, can be corrected using standard error-correction techniques [36][37][38]. and the read reference voltage in order to guarantee long-term data retention (~10 yea for many products). Since EXPRESS trades off the voltage margin for improved energ efficiency, it is important to characterize the data retention time. Figure 13 summarizes the results of an experiment that explores the effects of EX PRESS on the data retention for both the SLC and MLC modes of operation on PE-cycle blocks. It shows the bit error rate of the data written by EXPRESS (red bars), and the da written by the nominal program operation (blue bars). To accelerate the retention loss, w bake the chip at a higher temperature (120 °C) for 1, 2, or 3 h. Using the acceleration-facto based calculation, we find that the 3 h of baking time corresponds to 5 years at room tem perature, assuming the activation energy for the charge loss in the 3D NAND as 1 [33]. The results in Figure 13 show that the BERs for the EXPRESS write increas relative to the traditional programming, after the accelerated retention test. The temporar read error is a new reliability issue in 3D NAND Flash [34,35]. It is not considered in th case, as the BERs are <1% for all the types of pages and, hence, can be corrected usin standard error-correction techniques [36][37][38].

Validity of the Proposed Technique for Arbitrary Image Data
In this section, we verify that EXPRESS is applicable on any data pattern, with simila results. The chip under evaluation uses an internal data randomizer that randomizes th user data before writing them in the NAND array. The goal of such data randomizatio is to ensure the memory reliability by utilizing all four analog states. In the absence the data randomizer, all-zero data on both the LSB and MSB pages would lead to all cel being programmed into the B state. Because of data randomization, the exact cell sta will be decided by the randomization key, which will ensure an even distribution of states among the memory cells. Even distribution is beneficial to improving the cell en durance and reliability. Thus, randomization is an integral feature in state-of-the-a NAND flash chips [39].
In order to demonstrate EXPRESS for any arbitrary data, we write an Einstein imag Figure 14 summarizes the evaluation results for both the SLC and MLC modes of oper tion. We observe the same trend that we observed in Sections 4.2 and 4.3. The BER star from 40% because the chosen image has 40% of the cells in the erase state at the beginnin Similar to earlier results, the percentage of the programmed bit exceeds 99%, with 2. Nevertheless, it will be interesting to study the performance gain with the EXPRES method when it is used for error-tolerant image classification applications using neur morphic computing systems, as demonstrated in the previous works [40][41][42].

Validity of the Proposed Technique for Arbitrary Image Data
In this section, we verify that EXPRESS is applicable on any data pattern, with similar results. The chip under evaluation uses an internal data randomizer that randomizes the user data before writing them in the NAND array. The goal of such data randomization is to ensure the memory reliability by utilizing all four analog V t states. In the absence of the data randomizer, all-zero data on both the LSB and MSB pages would lead to all cells being programmed into the B state. Because of data randomization, the exact cell V t state will be decided by the randomization key, which will ensure an even distribution of V t states among the memory cells. Even distribution is beneficial to improving the cell endurance and reliability. Thus, randomization is an integral feature in state-of-the-art NAND flash chips [39].
In order to demonstrate EXPRESS for any arbitrary data, we write an Einstein image. Figure 14 summarizes the evaluation results for both the SLC and MLC modes of operation. We observe the same trend that we observed in Sections 4.2 and 4.3. The BER starts from 40% because the chosen image has 40% of the cells in the erase state at the beginning. Similar to earlier results, the percentage of the programmed bit exceeds 99%, with n skip = 2. Nevertheless, it will be interesting to study the performance gain with the EXPRESS method when it is used for error-tolerant image classification applications using neuromorphic computing systems, as demonstrated in the previous works [40][41][42].

Conclusions
In this paper, we experimentally demonstrate energy-accuracy disproportionality i 3D NAND flash memory chips. We propose EXPRESS, a new method for improving th

Conclusions
In this paper, we experimentally demonstrate energy-accuracy disproportionality in 3D NAND flash memory chips. We propose EXPRESS, a new method for improving the energy efficiency of NAND write operations using a partial programming technique. We demonstrate EXPRESS on a 32-layer 3D NAND memory, operating it in both the SLC and MLC modes. We propose an adaptive algorithm for EXPRESS, considering the effects of the page-to-page variability, PE cycling, and data retention. We find that energy savings in the range of 20 to 50% are achievable, depending on the page type, at the cost of less than a 1% loss in accuracy with EXPRESS. We also find that the retention loss with EXPRESS is slightly higher than the traditional write operation. The accelerated retention test shows that the BER with the EXPRESS write remained below 1% for five years of retention time. We demonstrate the robustness of EXPRESS using an arbitrary image as a testing data pattern.