Exploring the Impact of Variability in Resistance Distributions of RRAM on the Prediction Accuracy of Deep Learning Neural Networks

: In this work, we explore the use of the resistive random access memory (RRAM) device as a synapse for mimicking the trained weights linking neurons in a deep learning neural network (DNN) (AlexNet). The RRAM devices were fabricated in-house and subjected to 1000 bipolar read-write cycles to measure the resistances recorded for Logic-0 and Logic-1 (we demonstrate the feasibility of achieving eight discrete resistance states in the same device depending on the RESET stop voltage). DNN simulations have been performed to compare the relative error between the output of AlexNet Layer 1 (Convolution) implemented with the standard backpropagation (BP) algorithm trained weights versus the weights that are encoded using the measured resistance distributions from RRAM. The IMAGENET dataset is used for classification purpose here. We focus only on the Layer 1 weights in the AlexNet framework with 11 × 11 × 96 filters values coded into a binary floating point and substituted with the RRAM resistance values corresponding to Logic-0 and Logic-1. The impact of variability in the resistance states of RRAM for the low and high resistance states on the accuracy of image classification is studied by formulating a look-up table (LUT) for the RRAM (from measured I-V data) and comparing the convolution computation output of AlexNet Layer 1 with the standard outputs from the BP-based pre-trained weights. This is one of the first studies dedicated to exploring the impact of RRAM device resistance variability on the prediction accuracy of a convolutional neural network (CNN) on an AlexNet platform through a framework that requires limited actual device switching test data.


Introduction
The advent of convolutional neural networks (CNN) has brought about a paradigm shift in the adoption of hardware assisted deep learning for edge computing applications.Several non-volatile memory based device technologies have been explored to realize neuromorphic/deep learning based CNN systems including resistive switching memory (RRAM) [1,2], spin-transfer torque RAM (STT-RAM) [3,4] and phase-change memory (PCM) [5,6] for synaptic storage of the weights of the network, which form the core of the computation.While many of these device technologies have been shown to be good candidates at an array level for hardware implementation, the issue of "variability" in the device performance and its impact on the reduction in prediction accuracy (say for image classification) of the neural network has not been specifically dealt with.In this study, we aim to propose a framework to assess and "quantify" the impact of RRAM switching variability on the prediction accuracy of CNN.The RRAM is chosen as the candidate device here as it has a simple structure [7], enables high integration density (small form factor) and can exhibit switching in the low-power regime, although the variability in the switching trends can be quite high [8][9][10].

Overview of Convolutional Neural Networks
Convolutional neural networks (CNNs) are a subset of the machine learning family, which are well known for its grid-like topology, in which each grid is computed in parallel and hence best suited for a low footprint "digital signal processor" like implementation.The CNN computation can be explained with time-series data, by processing pixel data at regular time intervals.The CNN can be implemented effectively on a parallel processing system, in which the mathematical matrix operation of convolution, is followed by a fully connected neural network implementation [11].The mathematical operation of "convolution" is illustrated in Figure 1.The kernel or trained weight is applied to the entire image, as shown in Figure 1.The boxes with arrows show how the input tensor element gets transformed to the output tensor by using the trained weight or kernel.Following the convolution, we have the fully connected layers that connect every neuron in one layer to every neuron in another layer, similar to the traditional multi-layer perceptron (MLP) neural network.Every neuron in the given network computes an output value by applying a function to the input values from the previous layer, as shown in Figure 2. The function applied to the input values is specified by trained weights.Learning in a neural network is a process of adjusting its weights (minimizing the cost function by iteration) [12].

Use of RRAM for Neuromorphics/Deep Learning
Resistive switching random access memory (RRAM) devices are suitable for neuromorphic applications as the change of the weights in the neural network can be mimicked by the analog conductance change in the dielectric of the RRAM, which changes state due to defect (vacancy) generation, transport or annihilation.The application of an electrical current or voltage as the stimuli creates an ionic motion or vacancy cluster formation in a nanoscale dimension within a two-terminal device leading to local redox phenomena and this, in turn, affects the device resistance, which stays non-volatile, unless perturbed externally again by applying another voltage or current pulse.Devices exhibiting such properties are, in general, termed as memristors.Figure 3 depicts the typical currentvoltage (I-V) characteristics of an RRAM, which could be switched either in the bipolar mode or unipolar mode (only bipolar schema shown here for illustration).The evolution from high to low resistance state is called SET, while the transition from the low to high resistance state is referred to as RESET.The reason for RRAM devices to be used in neuromorphic application are its analog conductance changes (depending on the material stack and operating conditions), high scalability, ultrafast switching speed, non-volatility, large HRS (High Resistance State) /LRS (Low Resistance State) window, non-destructive read, simple structures with standard materials, 3D stackability and multilevel storage (the option to store more than one bit of information in one cell).Low write current (short electrical pulses able to change the resistive state), longer retention time and endurance are other factors that favor the adoption of RRAM.Lower operating energy per bit and excellent CMOS (Complementary Metal Oxide Semiconductor) process compatibility and manufacturability are also advantages of using the RRAM as the primary element for neuromorphic computing [1].

Use of the AlexNet Platform for Implementation
AlexNet, which is one of the popular CNN, competed in the ImageNet Large Scale Visual Recognition Challenge in 2012, with an error percentage of 15.3%.The entire model was computationally expensive, as it was implemented with a graphics processing units (GPUs), making the network perform well in real time.Figure 4 shows the schematic of the AlexNet containing eight layers out of which five are convolutional and three are fully-connected layers.The first convolutional layer filters the 224 × 224 × 3 input image with 96 kernels of size 11 × 11 × 3 with a stride of 4 pixels (this is the distance between consecutive computations).The second convolutional layer takes as input, the output of the first convolutional layer and filters it with 256 kernels of size 5 × 5 × 48.The third, fourth, and fifth convolutional layers are connected to one another without any intervening pooling or normalization layers.The third convolutional layer has 384 kernels of size 3 × 3 × 256 connected to the (normalized, pooled) outputs of the second convolutional layer.The fourth convolutional layer has 384 kernels of size 3 × 3 × 192 and fifth convolutional layer has 256 kernels of size 3 × 3 × 192.The fully-connected layers have 4096 neurons each [11].While the combination of AlexNet and RRAM for deep learning implementation is not new and have been proposed in References [14][15][16][17][18] and these works did not give any details on the simulation methodology and procedure.Our work here provides a full system flow chart that explains how the variability in RRAM was explored and exploited to represent the "binary" representation of the weights in the CNN.

Scope and Outline of Study
This paper is organized as follows.Section 3 explains the device fabrication of RRAM and the simulation framework setup for the performance analysis of the CNN where the weights in the AlexNet Layer 1 are represented with the RRAM LUT (look-up table).Section 4 presents the electrical characterization results of the fabricated RRAM device, resulting in seven different HRS states depending on the magnitude of the RESET stop voltage pulse.In Section 5, we discuss the simulation results obtained from embedding RRAM conductance variations into the AlexNet Layer 1 weights.
Finally, we conclude our study in Section 6 revisiting the novelty of our study and proposing possible recommendations for further research along this line.

HfOx RRAM Device Fabrication
The HfOx-based RRAM (1R) devices were fabricated on an SiO2 substrate, as shown in Figure 5.The bottom electrodes were patterned by UV lithography with positive tone resist followed by sputtering of the inert 10 nm Pt metal electrode under Ar ambient at room temperature and with a lift-off process.The process was repeated to sputter 15 nm HfOx layer and then the active top metal electrode of Ti (10 nm) to form 10 × 10 µm crosspoint devices.Finally, another 10 nm of inert Pt layer was deposited on top of the Ti layer for passivation.The I-V curves and endurance cycles were characterized using the Keithley SCS-4200 A system.The SET operation was initiated using 3 V, 5 ms pulses with an external current compliance setting of 100µA, while the RESET events were triggered by stepwise increase in pulse amplitudes with an opposite voltage polarity (from −1.2 V to −2.2 V with 0.2 V intervals) with a pulse period of 200 ns to achieve multilevel resistance states.

RRAM Based Synaptic Simulation Setup
The framework for RRAM resistance distribution incorporation into the AlexNet Layer 1 using a look-up table (LUT) approach is illustrated in Figure 6.The AlexNet Layer 1 convolution operation is implemented in MATLAB ® , with two pipelines, one which uses AlexNet's original extracted trained weights (96 filters each of 11 × 11 for red-green-blue (RGB)), implemented with MATLAB ® general matrix and the second pipeline, which constructs the AlexNet weights from the LUT.The LUT contains the algorithm trained synaptic weight in a binary format, 32-bit (24-bits for the mantissa, 7-bits for exponent and 1-bit for sign) for every single weight value, as illustrated in Figure 7.Each algorithm trained bit, Logic-0 or Logic-1, in the trained weight, is substituted by resistance values from the low and high resistance states of the switching RRAM device, encoded into a Logic-0 or Logic-1, based on a set resistance threshold level (RTH).The difference in the accuracy of prediction of the Layer-1 computed output between the two pipelines provides a quantitative estimate of the variability induced error in the hardware implementation of the CNN, in comparison to its algorithmic counterpart.
The detailed step-by-step implementation of the RRAM based CNN setup is described below, along with the approach to quantify and compare the prediction errors.
Step 2 The RGB image (224 × 224 × 3) is copied multiple times with a window size of 11 × 11 each and with a 4-pixel difference between the adjacent windows.The output from this stage is a matrix of size 11 × 11 × 3025 for each color (RGB).Step 4 For each 11 × 11 window of the input image (total 3025), the matrix SOP (sum of the product) is separately computed using the weights from the two pipelines.The difference in the computed output for the two pipelines is then recorded and defined as the "prediction error".
Step 5 To capture the stochastics of the error in computation, Steps 3 and 4 are iterated 1000 times using different randomly sampled resistance values for the HRS and LRS states of the RRAM device.

Generation of RRAM based LUT for AlexNet Weights
As illustrated in Figure 7, the software trained weights are obtained based on the stochastic gradient descent (SGD) optimization algorithm, optimized for one million images over 1000 categories.Each of these trained weights are represented in a 32-bit floating point format comprising of 24 bits of mantissa, 7 bits of exponent, and one sign bit.Every single matrix element within a filter within a certain layer of the AlexNet framework would need to represented using this 32-bit format.Figure 8 illustrates how the measured electrical resistance of the RRAM is incorporated into the logic-0/1 for every single bit of the 32-bit representation of the algorithm trained weight in Figure 7.The DC I-V characteristics of the HfOx-based RRAM device with abrupt SET and gradual RESET data are measured and the corresponding resistances in the LRS and HRS are extracted at a specific read voltage (VREAD ~ 150 mV in our case).The extracted resistances in the HRS and LRS states (RHRS and RLRS) are then fitted with a Lognormal distribution.As seen in Figure 8, the extracted resistance distributions are expected to overlap at the tail ends which is where the bits in the floating point could be erroneously represented in the CNN hardware implementation.
To demarcate the "0" and "1" from the resistance distributions, a threshold resistance (RTH) is arbitrarily defined by the average of the log scale values of the mean resistances in the HRS and LRS states (as shown in the equation below).For any randomly sampled value (R) from the fitted HRS or LRS distribution, if R < RTH, the value is encoded as a "0"; else, it is taken as "1".For every bit in the floating point representation for error computation in Step 5 of Figure 6, 1000 values of RHRS and RLRS are sampled from their distributions and encoded to Logic-0/Logic-1 using the rule below.

Else If R > RTH = logic 1
In this study, the LRS and HRS states are denoted as Logic-0 and Logic-1 respectively.Sampled values of RLRS from the LRS distribution that have a value larger than RTH will be wrongly classified as a "1".Similarly, samples values of RHRS from the HRS distribution will get classified as a "0" if the value is lower than RTH.These incorrectly coded bits are referred to here as "False-0" and "False-1", as shown in Figure 8.
The above LUT generation procedure using MATLAB ® subroutine is applied for four inner loops as discussed in the following subsection.

Multiple Embedded Loops of CNN Simulation
To fully examine the impact of hardware device variability on the accuracy drop in the CNN implementation for a practical edge computing application, the following four loops of simulation are deemed necessary: Loop 1 As illustrated in Figure 9, Loop-1 involves replacement of a group of just 4 bits every time from the 32-bit string representation of the algorithm trained weight with the RRAM encoded binary values.The binary values of the remaining 28 bits are intentionally undisturbed.With this 4-bit perturbation from the least significant bit (LSB) to the most significant bits (MSB) (Bits3~0, Bits4~7, Bits8~11, Bits12~15, Bits16~19, Bits20~23, Bits 28~31), the impact of the bit position on the prediction error in a hardware CNN can be explicitly quantified.

Electrical Characterization Results and Discussion
The I-V characteristic of the fabricated RRAM device is shown in Figure 13a for 100 switching cycles.It is apparent that while the bipolar switching is gradual and consistently observed, the I-V curves for the different HRS states show a significant overlap.While such substantial overlap in the resistance values is undesirable from a CNN hardware implementation point of view, the experimental data sets are being used here only to illustrate the methodology that is proposed in Section 3.With further optimization of the device stack, fabrication process, and operating conditions, it should be possible to realize more confined and distinguishable resistance patterns in the future.The current measured at a read voltage of VREAD ~ 150mV for a single tested device is plotted in Figures 13b and 14a after every SET and RESET sweep, keeping the pulse period fixed at 200 ns.The corresponding cumulative density function plots of the 8 resistance states (in terms of conductance) are plotted in Figure 14b.Multiple resistance states are attained here by gradually increasing the magnitude of the RESET pulses from −1.2 V to −2.2 V at steps of −0.1 V.The gradual process of RESET in this device stack may enable the realization of a multi-bit cell in the future if the stack is further optimized to ensure more spread out and discrete patterns of resistance variation with minimal overlap.

Simulation Results and Discussion
This section examines the results of the simulation of the AlexNet CNN layer-1 involving computation of the output matrix SOP for the two different pipelines shown in Figure 15 (reproduced from Figure 6).The value of matrix SOP from the purely algorithm trained weights is denoted as SCONV, while the corresponding value from RRAM encoded LUT is denoted as HCONV.We introduce a term called the relative error (RE) defined as the modulus of the difference between HCONV and SCONV, normalized by the value of SCONV.
Figure 16 lists the values of RE for the R-G-B data considering different 4-bit positions along the 32-bit string and also the different HRS levels (HRS-1 corresponding to shallow reset all the way to HRS-7 for deep reset) attained by fine tuning VRESET-STOP.When examining the impact of the bit positions where RRAM variability is incorporated, we see an increase of more than two orders of magnitude in the RE going from B3-B0 to B31-B28.This steep increase in error is very much expected given the increasing significance of the numerical values of the bits as we move from the right to the left in the 32-bit string.Note that the value of RE is substantially high for B31-B28 as the sign and exponent bits plays a critical role there.
When comparing the impact of the different HRS levels, a notable decrease in RE when we transition from HRS-1 to HRS-7 is only observed for B31-B28, but not for the other bit positions.By right, similar reduction in RE should be observed irrespective of the bit positions analyzed.These expected trends are obscured by the large overlaps in the resistance distributions of the different HRS states in our current fabricated device.When a more robust device with distinctive HRS states is demonstrated in the future, the reduction in RE will be apparent for all bit positions though the magnitude of reduction will vary.

Conclusions
This study presented a comprehensive methodology of assessing the impact of variability in the RRAM resistance distributions for the low and high resistance states on the image classification error based on the synaptic weight representation using the 32-bit format for the floating point, with 4 bits at any time taking the values from the hardware implementation, while the remaining 28 bits having values from the pre-trained AlexNet CNN framework.A significant increase in the relative error (of the computed matrix SOP) from the least significant bits (LSB) to the most significant bits (MSB) is observed.The error value is particularly high for the MSB as it carries the exponent and the sign bit of the weight.It is also evident from the CNN simulations that the ability to reach deeper reset states more consistently also enables a significant reduction in RE.
The proposed LUT-based analysis proves to be a useful technique when there is a need for quick validation of the RRAM device performance and its impact on a large scale CNN network through simulation, rather than having to fabricate a large array of devices to quantify the actual loss in image classification accuracy.In general, device engineers fabricate only a handful of devices for a specific suite and combination of process parameters.They use these small arrays of RRAM devices to measure its electrical characteristics and switching performance for a few 100-1000 cycles.These small array of RRAM samples are certainly not sufficient to construct a fully functioning deep learning (DL) hardware platform.Considering the expenses and time taken for a full-fledged array level fabrication and the needed characterization setup (more so at academic institutions with limited facilities), it is vital for device engineers to be able to adopt a "short cut" approach to assess and validate their device performance in a practical CNN scale setting with minimal effort from a simulation framework point of view.This is where the proposed LUT-based framework here comes in handy.
One can use a small array of RRAM device switching data, extract the switching resistance state distributions, fit them with a Lognormal model and then generate a large LUT based data set and test the performance of their device for a specific deep learning application quickly by plugging the LUT into the MATLAB ® simulation framework available for Deep Learning CNN.These quick simulations allow one to effectively quantify the expected prediction accuracy that can be attained for the fabricated devices.Our framework serves as a good intermediary step to assess whether to proceed with the existing process flow for a full array fabrication or to go back and optimize the RRAM process to attain better switching distributions.Our approach clearly enables one to quantify the impact of different memory windows and variations (and also different number of states (4-bit/8-bit etc.) on the image classification error using a simple set of devices (it could just be isolated set of MIM capacitors that were measured).
There are several possible improvements to be considered in subsequent studies.The current test setup arbitrarily defines a fixed threshold to classify any samples resistance value as Logic-0 or Logic-1.It may be worth exploring different definitions of the threshold levels to examine the changes in the classification errors.As discussed earlier, the fabricated stack in this study shows large variations in the resistance that result in significant overlap of the HRS and LRS distributions.The stack will be further optimized to improve the memory window, either by fabrication, choice of different switching voltages/pulse waveforms and/or through material engineering.Additionally, the impact of read disturb, random telegraph noise as well as endurance/retention degradation on the HRS/LRS distributions and their indirect influence on classification accuracy of a real scale CNN will also be a subject of study in the near future.

Figure 1 .
Figure 1.Illustrating the procedure of convolution in a convolutional neural network (CNN).

Figure 2 .
Figure 2. Part of the neural network architecture with a fully connected layer with input layer denoted by the nodes x1~x5 and the output layer denoted by S1~S5.Here, as an example, with matrix multiplication, all the outputs are affected by x3 (shown in blue).

Figure 3 .
Figure 3.Typical I-V plot for the bipolar switching trends in a resistive switching memory device.The image here is adopted and modified from Reference [13], Copyright 2015, John Wiley & Sons.

Figure 5 .
Figure 5. (a) Top view of the cross-point device under an optical microscope and a cross-sectional view of the schematic of the structure in the inset figure.The bottom electrode (BE) consists of an inert electrode, and top electrode (TE) serves as the active electrode, which essentially functions as an oxygen reservoir to facilitate oxygen migration in and out of the oxide layer.This allows the switching behavior to be predominantly governed by oxygen ion movement instead of Joule heating.(b) Threedimensional (3D) view of the cross-point region (see red rectangle) of the structure, taken using an atomic force microscope.

Step 3
There are two parallel pipelines here-the first (MATLAB® array) reads the weights (which are software trained using backpropagation) from a local drive to the program cache for further processing.The second pipeline (RRAM LUT) represents these software trained weights in a floating point format and encodes these weights with a Logic-0 or Logic-1 using the measured resistance distribution of RRAM in the low and high resistance states based on comparison of the randomly sampled resistance value from the LRS/HRS best fit resistance distributions to a set threshold resistance value (RTH).The resulting floating point representation of the RRAM encoded weights is then passed onto the next stage.The output size of both pipelines is 11 × 11 × 96 (filters) × 3(RGB).It is the second pipeline where the physical variability (measured by resistance of low (LRS) and high resistance states (HRS)) inherent in the fabricated RRAM device is incorporated into the CNN.

Figure 6 .
Figure 6.Flowchart showing the implementation of resistive random access memory (RRAM) based look-up table (LUT) for error quantification of a hardware implementation of the CNN using the AlexNet schema with Layer 1 being the convolution layer where the LUT is incorporated.

Figure 8 .
Figure 8. RRAM resistance coded to digital logic-0/1 with false logic-0/1 condition explained on a normal pdf plot of low resistant state (LRS) and high resistant state (HRS) resistance data.

Figure 9 .
Figure 9.The blue-shaded 4-bit string represents the bits where RRAM resistance values are encoded and the remaining 28 bits represent the unchanged binary version of the algorithm trained weight.

Figure 10 .
Figure10.The coding scheme consisting of 7 different HRS states (one at a time) and 1 LRS state for the LUT generation.Attaining different reset conditions allows for the examination of memory window impact on the prediction error, also accounting for the higher resistance variability at deeper reset conditions.

Figure 11 .
Figure 11.Total of 96 AlexNet Layer-1 filters with 48 filters each from the upper and lower pipeline.

Figure 12 .
Figure 12.Iteration repeated for red, green and blue data of the pixels in the classified image pattern.

Figure 13 .
Figure 13.(a) DC I-V characteristics of HfOx-based RRAM device with abrupt SET and gradual RESET behavior.The grey curve indicates the forming process on the pristine device.(b) Endurance cycling performed with a single SET pulse and multiple RESET pulses (−1.2 V to −2.2 V with 0.2 V interval) for each cycle.

Figure 14 .
Figure 14.(a) Closer look at the current jumps during the endurance test from the 700th to 820th pulse cycle.(b) Cumulative probability of the resistance states (1 LRS and 7 HRS states) achieved by successively increasing RESET voltage pulse amplitudes (−1.2 V to −2.2 V) during the multi-step RESET process, based on the pulsing scheme shown in the inset of Figure 13b

Figure 15 .
Figure 15.Definition of the relative error (RE) in computed matrix sum of products for the software and hardware flow convolution pipeline.The software pipeline refers to the algorithm trained weights while the hardware pipeline refers to the binary data encoded from the measured RRAM resistance distributions.

Figure 16 .
Figure 16.Bitwise relative error for all the 7-HRS values listed for different 4-bit positions along the 32-bit binary representation of the floating-point value for the CNN weights.The error values are listed separately for the red, green and blue colors accordingly.