Kernel Mapping Methods of Convolutional Neural Network in 3D NAND Flash Architecture

: A ﬂash memory is a non-volatile memory that has a large memory window, high cell density, and reliable switching characteristics and can be used as a synaptic device in a neuromorphic system based on 3D NAND ﬂash architecture. We fabricated a TiN/Al 2 O 3 /Si 3 N 4 /SiO 2 /Si stack-based Flash memory device with a polysilicon channel. The input/output signals and output values are binarized for accurate vector-matrix multiplication operations in the hardware. In addition, we propose two kernel mapping methods for convolutional neural networks (CNN) in the neuromorphic system. The VMM operations of two mapping schemes are veriﬁed through SPICE simulation. Finally, the off-chip learning in the CNN structure is performed using the Modiﬁed National Institute of Standards and Technology (MNIST) dataset. We compared the two schemes in terms of various parameters and determined the advantages and disadvantages of each.


Introduction
Convolutional neural networks (CNNs), which are a type of deep neural networks (DNNs), have been used for feature extraction in various applications, such as computer vision, pattern recognition, object detection, and medicine image segmentation [1][2][3][4][5].However, CNNs require considerable time owing to vector-matrix multiplication (VMM) operations based on kernel strides and the sequential computation of conventional processing units.Neuromorphic systems have been actively studied as candidates that can replace the von Neumann architecture owing to their low power consumption and fast computation through parallel computing [6][7][8][9][10][11][12][13][14][15][16][17][18][19][20].In the neuromorphic system mimicking a neural network, VMM operations are generally conducted by Ohm's law and Kirchhoff's current laws with conductance states of synaptic devices and input signals.Since this operation can be distorted by hardware-intrinsic issues such as reliability and device-to-device variation, for accurate analog weighted sums, weight quantization, including binary neural networks, including binary CNN (BCNN), can be adopted to realize hardware-friendly neural networks [21][22][23].In particular, since a CNN requires a striding convolution operation for feature extraction, input signals are sequentially entered into a fixed kernel array to improve the efficiency of the memory cell when realizing CNN in the neuromorphic system, but this method requires considerable time accordingly [24][25][26][27][28][29].Kernel strides can be conducted with multiple duplicate kernels, but a large number of cells need to be consumed for this operation [30].
Various candidates can be utilized as synaptic devices, such as memristors, flash memory, and ferroelectric field-effect transistors (FeFETs) [31][32][33][34][35][36][37][38][39][40].Memristor is one of the most widely studied candidates because it can be integrated into a cross-point structure that enables parallel computations; however, it requires additional selector devices such as diodes or transistors to suppress sneak path current, which can be extra overhead for the entire system.Transistor-based synaptic devices are relatively free from this issue thanks to a gate electrode, and flash memory stands out as a promising candidate for synaptic devices due to its advantages, including high cell density, reliable switching characteristics, excellent retention, and multi-level capabilities [41][42][43][44][45].
Flash memory can be organized into NAND and NOR structures based on their connection methods.NAND flash offers excellent area efficiency but lacks parallel summation due to its serial connection structure.Also, conventional 2D NAND Flash memory features a string structure where cells are connected in series, making it challenging to apply in neural networks.Consequently, research groups are exploring the utilization of the 3D NAND flash structure for neuromorphic systems, employing vector matrix multiplication operations through bit lines [46][47][48][49].To utilize 3D NAND flash in a neural network, the mapping method for the synaptic layer is crucial.Typically, a VMM operation involves summing the current through the bit line using a string select line (SSL) as input, where one word line (WL) represents the synapses of a single synaptic layer [48].The read voltage is imposed on the WL sequentially along the synaptic string to perform inference in the neural network.In the case of small-sized networks, many cells are wasted when mapped to one WL, so there is also a method of mapping multiple synapse layers to one layer [40].In this method, multiple layers within the same WL can be sequentially inferred by repeatedly applying the BL output voltage to the same SSL.
In addition, there are two kinds of methods for a hardware-based neural network: on-chip and off-chip learning.Both training and inference operations are performed in a neuromorphic system in on-chip learning; therefore, extra circuitries are required to obtain gradient descent for the weight update of each synaptic device, and it is hard to adjust weight values precisely according to the backpropagation algorithm due to reliability issues such as device variation and limited switching cycles [50,51].On the contrary, only inference operations are conducted with pre-trained weight values in a software-based manner in off-chip learning, so additional systems for learning algorithms are unnecessary, and system performance can be less affected by inaccurate device states with weight quantization [52,53].
In this paper, we demonstrate the CNN implementation having binary weight with TiN/Al 2 O 3 /Si 3 N 4 /SiO 2 /Si (TANOS) stack-based NAND flash memory in a 3D NAND architecture.First, the electrical characteristics of the fabricated NAND flash memory are verified, including transfer characteristics and switching properties.Then, the kernel mapping methods for CNN are presented and compared in terms of utilized cells and striding operations.Finally, SPICE simulations are performed based on the measured data to confirm the VMM operation in CNN for Modified National Institute of Standards and Technology (MNIST) recognition, and the non-ideality effects, including device variations and stuck cells, are analyzed depending on the kernel mapping methods.The rest of this paper is organized as follows: Section 2 describes the fabrication process of a flash memory cell, and the electrical characteristics required as a synaptic device in a neural network were measured.Section 3 presents two convolution kernel mapping methods using the 3D NAND structure and displays the results of VMM calculations reflecting the measured devices.Additionally, it demonstrates non-ideal characteristics in hardware implementation based on the two kernel mapping methods.Finally, Section 4 concludes this paper.

Materials and Methods
Figure 1a shows a device structure with a TANOS gate stack-based single NAND flash memory cell and a transmission electron microscopy (TEM) image of the fabricated device.The single NAND flash memory cell with the TANOS stack was fabricated through the process flow outlined in Figure 1b.First, a 300 nm-thick buried oxide layer was formed through a wet oxidation process.Following this, 100 nm-thick amorphous silicon was deposited as a channel layer at 550 • C through low-pressure chemical vapor deposition (LPCVD).To crystalize the deposited amorphous silicon, a drive-in process was conducted at 600 • C for 24 h.Subsequently, the gate stack, composed of TiN/Al 2 O 3 /Si 3 N 4 /SiO 2 , was sequentially deposited.Initially, SiO 2 (4 nm) and Si 3 N 4 (6 nm) were deposited as a tunneling oxide and a charge-trapping layer, respectively, using LPCVD.A 9 nm-thick layer of Al 2 O 3 was then deposited as a blocking layer through atomic layer deposition (ALD), followed by the deposition of the gate electrode, TiN (50 nm), through metal organic chemical vapor deposition (MOCVD).A 25 nm-thick SiO 2 hard mask was deposited using plasma-enhanced chemical vapor deposition (PECVD) to enhance adhesion between the TiN gate and photoresist.After that, gate patterning was performed, and the selfaligned source/drain was formed by ion implantation with arsenic ions (As + ).Finally, the conventional back-end of-line (BEOL) process was performed, followed by forming gas annealing (FGA) at 450 was formed through a wet oxidation process.Following this, 100 nm-thick amorphous silicon was deposited as a channel layer at 550 °C through low-pressure chemical vapor deposition (LPCVD).To crystalize the deposited amorphous silicon, a drive-in process was conducted at 600 °C for 24 h.Subsequently, the gate stack, composed of TiN/Al2O3/Si3N4/SiO2, was sequentially deposited.Initially, SiO2 (4 nm) and Si3N4 (6 nm) were deposited as a tunneling oxide and a charge-trapping layer, respectively, using LPCVD.A 9 nm-thick layer of Al2O3 was then deposited as a blocking layer through atomic layer deposition (ALD), followed by the deposition of the gate electrode, TiN (50 nm), through metal organic chemical vapor deposition (MOCVD).A 25 nm-thick SiO2 hard mask was deposited using plasma-enhanced chemical vapor deposition (PECVD) to enhance adhesion between the TiN gate and photoresist.After that, gate patterning was performed, and the self-aligned source/drain was formed by ion implantation with arsenic ions (As + ).Finally, the conventional back-end of-line (BEOL) process was performed, followed by forming gas annealing (FGA) at 450 °C for 30 min.All the electrical characteristics of the fabricated NAND flash memory were measured using a semiconductor parameter analyzer (Keysight B1500A) with a source measure unit (B1517A having a maximum slew rate of 0.2 V/µs), a pulse generator unit (PGU), and a waveform generator/fast measurement unit (B1530A).
Figure 1c shows the measured I D −V G transfer curve characteristics of the fabricated single device (gate length = 2 µm, channel width = 5 µm) with respect to different program voltages (V PGM ).During the read operation, the gate voltage (V G ) was swept from −1 V to 7 V with a drain voltage (V D ) of 0.1 V. V PGM with a pulse width of 1 ms was applied, ranging from 9 V to 15 V at 1 V intervals.As the program voltage increased, the threshold voltage (V T ) gradually shifted in the positive direction.This shift occurs because electrons are trapped in the charge-trapping layer through Fowler-Nordheim (FN) tunneling, which, in turn, prevents the formation of the inversion layer.To verify the possibility of multi-level states in flash memory, we investigated the changes in conductance in response to voltage pulses, as shown in Figure 1d.The program and erase pulse were each applied 25 times using the incremental step pulse programming (ISPP) method.For the program, the voltage pulse was increased in 0.1 V steps from 8 V to 10.5 V with a pulse width of 1 ms, and for the erase, it was decreased in −0.1 V steps from −12 V to −14.5 V with a pulse width of 5 ms.After applying the program and erasing pulses, a read pulse was applied to verify the state.The device current is differentiated by voltage pulses to extract the conductance levels, showing that flash memory can operate at multiple levels.Although the goal is to implement hardware BCNN, having compatibility with these multi-level characteristics is crucial [33].Multi-level characteristics imply the ability to fine-tune weights, enabling more precise refinement of binary states.Additionally, considering the energy consumption of the entire system, it is possible to select various sets of binary states.
The measured I D −V G transfer characteristics of the fabricated 30 programmed and erased cells were extracted in order to access device-to-device variations, as shown in Figure 1e.The cells were programmed using V PGM at 13 V and a 1 ms pulse width and erased with the erase voltage (V ERS ) at −15 V and a 5 ms pulse width.A memory window exceeding 3 V, which represents the V T difference between the programmed and erased states, was achieved.To ensure a large current on/off ratio (>10 4 ) and minimal variation within the same state, the read voltage for the BCNN inference operation was set to 1.5 V. Figure 1f shows the threshold voltage distribution extracted from 30 programmed, erased states.The threshold voltage was extracted at I D = 10 −7 × W/L [54].The dispersion in the program state is larger than that in the erase state, and the dispersion obtained through measurement was reflected in the VMM SPICE simulation.
Since the read voltage (V read ) and pass voltage (V pass ) are frequently applied to the devices during the inference operations, it is important that the device state is not disturbed by the read and pass bias conditions.Figure 2a,b show the read and pass disturbance characteristics of the fabricated device.The read voltage was 1.5 V with sufficient on/off margin, the pass voltage was 7.5 V, a high voltage to operate like a wire, and the pulse width was 100 µs.It is confirmed that both programmed and erased states did not change while V read and V pass pulses were repeatedly applied to the device over 10 5 times.In addition, the endurance characteristics were measured by repeatedly programming and erasing the cell, as shown in Figure 2c.It turned out that the V T window remained sufficiently large even after more than 10 4 cycles.Although the threshold voltage of the programmed state increased in the cycling endurance measurement, this may not cause significant issues.The device still maintained the off-state at the read voltage (1.5 V) with a current level of about 10 −11 A even with the increased threshold voltage.Also, the proposed system was based on off-chip training, where adjusting weights rarely occurs compared to on-chip learning-based hardware systems.For retention characteristics, it is confirmed that the two states in the Flash memory device maintained over >10 4 s at a high temperature of 85 • C. The extended lines from the log plot suggest that the two states can be distinguished over ten years.This implies that the fabricated device can stably store binary weight values for neuromorphic computing operations, as depicted in Figure 2d.retention characteristics, it is confirmed that the two states in the Flash memory device maintained over >10 4 s at a high temperature of 85 °C.The extended lines from the log plot suggest that the two states can be distinguished over ten years.This implies that the fabricated device can stably store binary weight values for neuromorphic computing operations, as depicted in Figure 2d.

Results
Figure 3a shows the 3D NAND architecture for performing VMM computations in CNN with a comparator circuit for binary activations.Input voltage (Vinput) signals encoded by +3 V/0 V are applied through SSLs, while Vread and Vpass are applied to the selected WL and unselected WLs, respectively.One WL turns on at a time, and the read operation takes turns sequentially for the inference of all layers.Each cell acts as a synaptic device and has a different weight depending on the training results.The output currents are added through the string lines, and two BL currents are compared for binary activations with both positive and negative weight values.The combined current flowing through the two BLs is converted into voltage, and the voltages of the two lines are compared through a comparator to generate the final output, which is then input as the input of the next layer.When the current of BL + is greater than that of BL -, the comparator maintains Vdd, thereby serving a binary activation role.The weight is expressed using only the program state conductance (GPGM) and erase state conductance (GERS) of the NAND flash device; binary weight values of +1 and −1 are expressed by (GERS − GPGM) and (GPGM − GERS), respectively.
In addition, two kernel mapping methods in the WL plane are illustrated for the convolution layers.In each kernel weight composed of two devices, the colored portion represents 1, and the white portion represents −1. Figure 3b illustrates the kernel mapping

Results
Figure 3a shows the 3D NAND architecture for performing VMM computations in CNN with a comparator circuit for binary activations.Input voltage (V input ) signals encoded by +3 V/0 V are applied through SSLs, while V read and V pass are applied to the selected WL and unselected WLs, respectively.One WL turns on at a time, and the read operation takes turns sequentially for the inference of all layers.Each cell acts as a synaptic device and has a different weight depending on the training results.The output currents are added through the string lines, and two BL currents are compared for binary activations with both positive and negative weight values.The combined current flowing through the two BLs is converted into voltage, and the voltages of the two lines are compared through a comparator to generate the final output, which is then input as the input of the next layer.When the current of BL + is greater than that of BL -, the comparator maintains V dd , thereby serving a binary activation role.The weight is expressed using only the program state conductance (G PGM ) and erase state conductance (G ERS ) of the NAND flash device; binary weight values of +1 and −1 are expressed by (G ERS − G PGM ) and (G PGM − G ERS ), respectively.
In addition, two kernel mapping methods in the WL plane are illustrated for the convolution layers.In each kernel weight composed of two devices, the colored portion represents 1, and the white portion represents −1. Figure 3b illustrates the kernel mapping method (Scheme 1) for conducting striding calculations with fixed input signals and duplicated kernel maps.Thanks to the duplicated kernel maps, striding operations can be completed time-efficiently with one input pulse at the expense of devices.There are non-kernel areas in the array since the input is fixed with this method, and they are tuned to the programmed state because the non-kernel area should not affect the VMM operation, while the kernel area was tuned to a programmed or erased state according to weight values.In contrast, Figure 3c shows the kernel mapping method (Scheme 2) with input striding and fixed kernel maps like software-based CNN.Because the kernel map is fixed, the input signal needs to be provided sequentially as many times as the striding number, which means that this method is time-consuming but area-effective because the kernel map is not duplicated.In addition, Scheme 2 is more affected by the pass and read disturbances of the device because multiple kernel operations are performed simultaneously.Also, because the current must be read several times from the same BL, an additional circuit that could store existing results might need to be used [22,24], which can be an overhead of the overall system in return for cell efficiency.method (Scheme 1) for conducting striding calculations with fixed input signals and duplicated kernel maps.Thanks to the duplicated kernel maps, striding operations can be completed time-efficiently with one input pulse at the expense of devices.There are nonkernel areas in the array since the input is fixed with this method, and they are tuned to the programmed state because the non-kernel area should not affect the VMM operation, while the kernel area was tuned to a programmed or erased state according to weight values.In contrast, Figure 3c shows the kernel mapping method (Scheme 2) with input striding and fixed kernel maps like software-based CNN.Because the kernel map is fixed, the input signal needs to be provided sequentially as many times as the striding number, which means that this method is time-consuming but area-effective because the kernel map is not duplicated.In addition, Scheme 2 is more affected by the pass and read disturbances of the device because multiple kernel operations are performed simultaneously.Also, because the current must be read several times from the same BL, an additional circuit that could store existing results might need to be used [22,24], which can be an overhead of the overall system in return for cell efficiency.We extracted the Berkeley short-channel IGFET model (BSIM4) based on the measured transfer curve characteristics of the programmed and erased cells in order to perform SPICE simulations.To simulate on-chip learning, models need to be established for all program and erase operations [55].However, since the proposed system is based on off-chip learning for hardware BNNs, models were extracted only for the two states.As shown in Figure 4a, the BSIM4 models exhibited a precise fit with the measurement We extracted the Berkeley short-channel IGFET model (BSIM4) based on the measured transfer curve characteristics of the programmed and erased cells in order to perform SPICE simulations.To simulate on-chip learning, models need to be established for all program and erase operations [55].However, since the proposed system is based on off-chip learning for hardware BNNs, models were extracted only for the two states.As shown in Figure 4a, the BSIM4 models exhibited a precise fit with the measurement data.The convolution operation results of the two schemes are compared using a 5 × 5 input image and a 3 × 3 kernel, as shown in Figure 4b.The input (+1/0) and weight (+1/−1) values are binarized, and two BLs are used as a pair to generate output signals as discussed in Figure 3a.As a result of the convolution operation, there are a total of 9 output values given by (inputkernel + 1) × (input-kernel + 1), and the ideally calculated values are depicted.Figure 4c shows the corresponding BL current difference of the two schemes obtained through SPICE studies, which confirms that both methods provide the BL current difference in a linear proportion according to the output value, and the convolution operation can be performed accurately.Compared with Scheme 1, in which the convolution operations can be conducted simultaneously from nine BLs with one read pulse, nine read pulses need to be sequentially applied to one BL in Scheme 2. In addition, the VMM operations are verified in the 3D NAND structure (16 BLs × 16 strings × 16 WLs) with randomly generated inputs and kernels and compared with the ideal case as shown in Figure 4d.To consider device variation effects in SPICE simulations, we incorporated the deviation in V T from 30 measured devices.As we increased the number of cells, we conducted the VMM operations 25 times for each, and the results were presented as a box plot.With the increase in the number of cells, the current flowing through the bitline (BL) also increased.However, when the VMM was performed multiple times, it was not consistent due to device-to-device variations among erased state devices, resulting in increased scatter as the number of cells increased.Nevertheless, the VMM results closely approximated the ideal line with minimal variation, indicating the feasibility of performing inference operations under our system conditions.input image and a 3 × 3 kernel, as shown in Figure 4b.The input (+1/0) and weight (+1/−1) values are binarized, and two BLs are used as a pair to generate output signals as discussed in Figure 3a.As a result of the convolution operation, there are a total of 9 output values given by (input-kernel + 1) × (input-kernel + 1), and the ideally calculated values are depicted.Figure 4c shows the corresponding BL current difference of the two schemes obtained through SPICE studies, which confirms that both methods provide the BL current difference in a linear proportion according to the output value, and the convolution operation can be performed accurately.Compared with Scheme 1, in which the convolution operations can be conducted simultaneously from nine BLs with one read pulse, nine read pulses need to be sequentially applied to one BL in Scheme 2. In addition, the VMM operations are verified in the 3D NAND structure (16 BLs × 16 strings × 16 WLs) with randomly generated inputs and kernels and compared with the ideal case as shown in Figure 4d.To consider device variation effects in SPICE simulations, we incorporated the deviation in VT from 30 measured devices.As we increased the number of cells, we conducted the VMM operations 25 times for each, and the results were presented as a box plot.With the increase in the number of cells, the current flowing through the bitline (BL) also increased.However, when the VMM was performed multiple times, it was not consistent due to device-to-device variations among erased state devices, resulting in increased scatter as the number of cells increased.Nevertheless, the VMM results closely approximated the ideal line with minimal variation, indicating the feasibility of performing inference operations under our system conditions.In addition, the effect of device state variation on the recognition accuracy of the MNIST images is analyzed depending on the kernel mapping scheme, as shown in Figure 6a.The device state variation can occur through either process-induced variation during fabrication or programming-induced variation for the weight transfer procedure.The variation tests were repeated 20 times and summarized in box plots based on the conductance variation in terms of the standard deviation (σ)/average (μ) for both kernel mapping schemes.The device variation (σ/μ) is applied in the conductance range, assuming that each programmed and erased state has a Gaussian distribution.The accuracy decreases as σ/μ increases for both schemes, but more seriously for Scheme 2. In Scheme 2, the input is provided multiple times with the fixed kernel maps; therefore, each cell state can be read several times, and the state variation in each cell can affect the overall output signals.In Scheme 2, it is noticeable that as variation increases, the device's sensitivity to variation depends on its distribution, leading to a significantly broader spread.In contrast, in Scheme 1, where the kernel maps are duplicated for the striding operations, the device variation lessens the output signals because each BL is used once for a whole convolution operation.Therefore, Scheme 2 is more vulnerable to the device variation effect, especially when the number of kernels is small, so the number of input strides is high accordingly.In addition, the effect of device state variation on the recognition accuracy of the MNIST images is analyzed depending on the kernel mapping scheme, as shown in Figure 6a.The device state variation can occur through either process-induced variation during fabrication or programming-induced variation for the weight transfer procedure.The variation tests were repeated 20 times and summarized in box plots based on the conductance variation in terms of the standard deviation (σ)/average (µ) for both kernel mapping schemes.The device variation (σ/µ) is applied in the conductance range, assuming that each programmed and erased state has a Gaussian distribution.The accuracy decreases as σ/µ increases for both schemes, but more seriously for Scheme 2. In Scheme 2, the input is provided multiple times with the fixed kernel maps; therefore, each cell state can be read several times, and the state variation in each cell can affect the overall output signals.In Scheme 2, it is noticeable that as variation increases, the device's sensitivity to variation depends on its distribution, leading to a significantly broader spread.In contrast, in Scheme 1, where the kernel maps are duplicated for the striding operations, the device variation lessens the output signals because each BL is used once for a whole convolution operation.Therefore, Scheme 2 is more vulnerable to the device variation effect, especially when the number of kernels is small, so the number of input strides is high accordingly.The effect of stuck devices on system performance is also analyzed for 20 trials, depending on the kernel mapping method, as shown in Figure 6b.The stuck device is defined as a cell whose state is fixed off and is not switchable.As the stuck-at-off device ratio increases, the classification accuracy decreases, but is more significantly degraded for Scheme 2. This is also because of input striding in Scheme 2 for convolution operations, which makes Scheme 2 more sensitive to stuck cells in return for using fewer cells.On the contrary, Scheme 1 is relatively robust against the effects of stuck cells thanks to the duplicated kernel maps.In summary, the two kernel mapping methods are compared in The effect of stuck devices on system performance is also analyzed for 20 trials, depending on the kernel mapping method, as shown in Figure 6b.The stuck device is defined as a cell whose state is fixed off and is not switchable.As the stuck-at-off device ratio increases, the classification accuracy decreases, but is more significantly degraded for Scheme 2. This is also because of input striding in Scheme 2 for convolution operations, which makes Scheme 2 more sensitive to stuck cells in return for using fewer cells.On the contrary, Scheme 1 is relatively robust against the effects of stuck cells thanks to the duplicated kernel maps.In summary, the two kernel mapping methods are compared in Table 1.Scheme 1 can be further optimized for parallel operation, instead of using more cells, to achieve time efficiency.Scheme 2 is time-consuming because the input signals should be given sequentially for the striding operations, but it has good cell efficiency.When using 8 kernels, Scheme 1 only needs to apply 8 pulses per image, but for Scheme 2, 576 pulses need to be applied.Furthermore, Scheme 1 exhibits low cell efficiency, approximately 10 5 times, through kernel duplication.Nevertheless, given the speed of inference and the stress applied to the device, Scheme 1 offers sufficient advantages.In terms of variation and stuck cells, Scheme 2, which uses fewer cells to represent the kernel, is more severely affected, so the system performance can be more degraded by the non-ideal issues in devices.In particular, when variation or stuck cell ratio is high, Scheme 1 demonstrates better performance by up to 40%, even in the worst-case scenario.

Conclusions
In this study, we proposed the two kernel mapping methods in CNN neuromorphic computing in 3D NAND architectures.The Flash memory devices were fabricated with the TANOS stack, and their electrical characteristics were measured.Based on the measured data, the VMM operation was verified for both methods through the SPICE simulations.In addition, a device non-ideality test was performed according to the mapping schemes through a binary CNN off-chip simulation.The results reveal that Scheme 1 is robust to variations and stuck cells.In implementing a very large CNN neuromorphic system utilizing a 3D NAND Flash structure with high cell density but slow operation characteristics, Scheme 1, which is robust against non-idealities like variations and stuck cells and offers faster speed, could be a suitable choice.

Figure 1 .
Figure 1.(a) Schematic view of a NAND flash structure with a TANOS stack and a TEM image of the gate stack.(b) Process flow of the fabricated NAND flash device.(c) ID−VWL curves according to

Figure 1 .
Figure 1.(a) Schematic view of a NAND flash structure with a TANOS stack and a TEM image of the gate stack.(b) Process flow of the fabricated NAND flash device.(c) I D −V WL curves according to different program voltages.(d) Conductance response according to voltage pulse (e) I D −V WL transfer curve characteristics of programmed and erased states for 30 NAND flash cells.(f) Threshold voltage distribution of programmed and erased states.

Figure 3 .
Figure 3. (a) Schematic view of the 3−D NAND flash architecture with binary input, weight, and activation.Kernel mapping methods in 3−D NAND flash architecture: (b) Scheme 1 and (c) Scheme 2.

Figure 3 .
Figure 3. (a) Schematic view of the 3−D NAND flash architecture with binary input, weight, and activation.Kernel mapping methods in 3−D NAND flash architecture: (b) Scheme 1 and (c) Scheme 2.

Figure 4 .
Figure 4. (a) Measured and fitted transfer curves with the SPICE BSIM4 model.Kernel operation verification.(b) Input, kernel, and ideal output values.(c) Corresponding BL current difference depending on the two mapping schemes.(d) VMM results in 3−D NAND architecture.

Figure
Figure5ashows a CNN configuration for MNIST recognition verification in the 3D NAND architecture using the off-chip learning method and its training results according to epochs.The CNN consists of one convolution layer with eight kernels of size 5 × 5 and one FC layer.The MNIST data, consisting of 60 k training data and 10 k test data, are binarized in black and white.The weights are initially randomly set to have a Gaussian distribution, and training is performed using AdamW as an optimizer in the learning process.For the BCNN training with the straight-through estimator (STE) method, real weight values are

Figure 5 .
Figure 5. (a) BCNN structure and (b) training results using binary input, weight, and activation (output) for MNIST data.

Table 1 .
Comparison of kernel mapping methods in MNIST BCNN.