Study of Quantized Hardware Deep Neural Networks Based on Resistive Switching Devices, Conventional versus Convolutional Approaches

: A comprehensive analysis of two types of artiﬁcial neural networks (ANN) is performed to assess the inﬂuence of quantization on the synaptic weights. Conventional multilayer-perceptron (MLP) and convolutional neural networks (CNN) have been considered by changing their features in the training and inference contexts, such as number of levels in the quantization process, the number of hidden layers on the network topology, the number of neurons per hidden layer, the image databases, the number of convolutional layers, etc. A reference technology based on 1T1R structures with bipolar memristors including H fO 2 dielectrics was employed, accounting for different multilevel schemes and the corresponding conductance quantization algorithms. The accuracy of the image recognition processes was studied in depth. This type of studies are essential prior to hardware implementation of neural networks. The obtained results support the use of CNNs for image domains. This is linked to the role played by convolutional layers at extracting image features and reducing the data complexity. In this case, the number of synaptic weights can be reduced in comparison to MLPs.


Introduction
Resistive switching devices based on the conductivity modulation of a dielectric thin layer are a variety of a broader class of electron devices known as memristors [1]. They show great potential from the integration viewpoint since they can be easily scaled within a CMOS compatible technology framework; besides, they show good endurance, retention and low power operation [2][3][4]. These devices present an overwhelming potential for applications linked to non-volatile memories, physical unclonable function implementation and neuromorphic computing. The latter research field is gaining momentum due to the limitations of current computing paradigms affected by the slowing down of Moore's law device scaling, the effects connected to the physical separation of data processing units and memory (von Neumann bottleneck) and the steadily growing performance gap between memory and processors (known as memory wall) [5].
Within the most promising alternatives in neuromorphic computing, memristors constitute the key part of circuits designed to greatly accelerate the repetitive and energy costly operation behind artificial neural networks training and inference. Memristor features allow to mimic biological synapses, a key component to build an efficient native hardware platform for ANNs based on matrix-vector multiplication circuits [2]. The implementation of these circuits can be easily performed by means of memristor crossbar arrays, taking into consideration their non-volatility and scalability [2,3,[5][6][7][8][9][10]. Matrixvector multiplication is very convenient when dealing with large-scale data processing, as it is the case of deep neural networks (DNN).
It is known that, due to the intrinsic variability of resistive switching devices [2,[11][12][13][14][15], a well-conceived circuitry is needed for memristor multilevel operation. In this approach, multilevel device operation is provided as the basis for hardware quantized ANNs; i.e., networks with quantized synaptic weights and biases. This is the key difference between the ANN hardware implementation and their software counterparts. When using this hardware approach to reduce the power consumption produced by the separation of memory and processing units in CPUs and GPUs, quantized weights will come into play and consequently new adapted ANN architectures and training strategies will be needed [16]. In this context, at the device and circuit level, there have been contributions that paved the way presenting technological alternatives, e.g., a row-by-row parallel program-verify scheme on a fabricated 160-Kb RRAM array employing an incremental gate voltage programming methodology [17], a binary synaptic learning scheme that benefits from the set and reset voltage variability of CMOS integrated 1T1R structures [18], a multilevel cell scheme in H f O 2 -based resistive memory (RRAM) arrays [19], etc.
It is important to state that a determinant breakthrough in the field of machine learning took place in the 2000-2010 decade due to a step forward in the efficiency and generalized use of ANNs [16]. ANNs have been successfully used recently in many fields, such as smart cities [20][21][22], biology [23][24][25] and medicine [26][27][28]. The CPU performance increase, the GPU massive use in artificial intelligence and the availability of well-structured data for training and testing helped ANNs to take over in the machine learning realm. The power-hungry GPUs and the need for intensive use of on-chip buffers and off-chip DRAM in conventional ANN operation presents a high energy cost. In particular, on-chip buffers (×6) and external DRAM cells (×200) consume more energy than processors register files (×1, normalized energy cost in a typical neural network accelerator [16]). The von Neumann limitations reported above and this energy consumption problem pose important difficulties for the ANN medium-and long-term development. Neuromorphic computing can provide solutions, that is why great efforts are going on to provide hardware platforms to enhance ANNs [3,6,16,29,30] and develop the technology for edge computing linked to the Internet of Things era that will come with the deployment of 5G networks. In order to shed light on some of the issues raised above, we have analyzed the role of quantization on different types of hardware ANNs. For this purpose, we have considered resistive switching devices based on H f O 2 dielectrics as the reference technology. The devices show filamentary charge conduction and bipolar operation. We have employed several strategies to implement multilevel conductance modulation, accounting for different sets of conductance levels (2, 3, 4, 5 and 8 levels). The viability of this multilevel implementation led us to study quantization at the ANN level in order to assess the possibilities of this technology in the neuromorphic circuit context (we employed from 2 to 8 levels for the sake of completeness). We did so by studying different DNN architectures (by changing the number of hidden layers and neurons in each layer), both conventional multilayer perceptron and convolutional neural networks were employed. The influence of the number of quantized levels, the dataset employed under a supervised learning scheme, etc., on the ANN recognition accuracy, was analyzed. We worked at the inference level; i.e., quantization was taken into consideration after a conventional training process without synaptic weight discretization. In this context, it is important to highlight that it has been proven that only a 3-bit precision was needed to encode state-of-the-art networks [16], this corresponds to the higher number of levels analyzed here. From another viewpoint, quantization could be beneficial to deal with common problems in the ANN realm, such as overfitting.

Device Fabrication and Measurement Set-Up, a Multilevel Approach
The devices used to define experimentally the different sets of conductance levels (2, 3, 4, 5, and 8), in order to implement the quantized weight values for the artificial synapses included in the ANN architectures considered here, are 1T1R RRAM cells integrated in 4-kbit arrays [31]. These resistive switching cells are constituted by a NMOS transistor (manufactured in 0.25 µm CMOS technology) connected in series to a metal-insulatormetal (MIM) structure placed on the metal line 2 of the CMOS process (Figure 1). Such a MIM structure consists of a TiN/H f O 2 /Ti/TiN stack with 150 nm TiN top and bottom electrode layers deposited by magnetron sputtering, a 7 nm Ti layer (under the TiN top electrode) and a 8 nm H f O 2 layer grown by atomic layer deposition (ALD) [32]. The MIM structures have an area of about 0.4 µm 2 . The endurance and retention of these devices has been studied in order to show the viability of this technology both for memory and neuromorphic applications [11,33,34]. The algorithms employed to program the conductive levels on the devices are writeverify approaches. The cost in time of such approaches is justified due to the fact that the ANNs implemented aim to perform just the inference phase, where an accurate definition of the conductive levels is crucial. Most of the sets of conductive levels considered (in particular, 2, 3, 4, and 5) were defined by using the well-known Multilevel Incremental Step Pulse with Verify Algorithm (M-ISPVA) [35], which tunes the current target (I trg ) and gate voltage (V g ) parameters during set operations to achieve each conductive level within a specific set of conductive levels. To do so, we employ different pairs of parameters that allow the control of the redox chemical reaction kinetics that lead to the device conductance variation produced by the growth and destruction of conductive filaments that bridge the electrodes through the device dielectric. During the M-ISPVA programming, a sequence of increasing voltage amplitude pulses are applied either on the terminal connected to the top electrode of the MIM structure, during set operations, or on the source terminal of the transistor, during reset operations. In contrast, placing eight conductive levels in a read-out current range of about 50 µA requires a more advanced implementation. This algorithm, already tested in [36] and known as Incremental Gate Voltage and Verify Algorithm (IGVVA), keeps constant the pulse amplitude applied on the terminal connected to the top electrode of the MIM structure during set operations and increases step by step the V g value until I trg is achieved. The cumulative distribution functions (CDFs) of the readout currents measured on 128 RRAM devices for different sets of conductive levels, namely: 2, 3, 4, and 5; are shown in Figure 2. We perform the study of read-out current distributions since it directly characterizes the device conductance, which is the key magnitude to quantize in order to mimic biological synapses plasticity in neuromorphic circuits. Notice that the separation between the distributions is essential to avoid overlapping in the quantized levels. As it will be shown below, we will study the quantization in the synaptic weights of different neural networks to assess its influence on the network performance. Step Pulse with Verify Algorithm (M-ISPVA) algorithm [35]. LRS stands for low resistance state, this means that the memristors have a low internal resistance value, and HRS stands for high resistance state. These current levels, and therefore, the device conductance, are obtained by means of different sets of algorithm parameters. In this respect, the corresponding neural network weight quantization strategies can be linked to the algorithms chosen to electrically operate the devices.
By tuning the M-ISPVA parameters, the definition of several non-overlapping conductive levels, up to five, can be effectively performed, see Figure 2. In addition, Figure  3 shows the current CDFs corresponding to eight conductive levels defined by using the IGVVA.  . CDFs of the read-out currents measured for eight levels. The conductance levels were obtained by means of the Incremental Gate Voltage and Verify Algorithm (IGVVA) algorithm [36]. LRS stands for low resistance state, this means that the memristors has a low internal resistance value, and HRS stands for high resistance state.
A strong reduction of the device-to-device variability allows to define extremely narrow distributions and, therefore, eight non-overlapping conductive levels in a current range of less than 50 µA, see Figure 3. The exact determination of the device conductivity would require the detailed calculation of the current for a determined bias, this can be performed following different simulation schemes [37,38].

ANN Architecture Analysis, the Role of Quantization
As pointed out above, ANN have resurfaced and gained prominence due to several factors; among them, the emergence and fast development of DNNs. Because of this, the main features of DNN architectures have been characterized in-depth in the last few years [39][40][41][42]. Convolutional Neural Networks (CNN) are an alternative type of DNNs widely employed to tackle image recognition problems. Image recognition is a key activity in what we call perception; at this task, efficiency implies a fast and highly parallel data processing [43].

Convolutional Neural Networks
CNNs are based on the application of convolution functions: a mathematical operation on two functions ( f 1 and f 2 ) that produces a third function ( f 1 × f 2 ) expressing how the shape of one of them is modified by the other. It is defined as the integral of the product of the two functions after one is reversed and shifted. Then, the integral is evaluated for all the possible values of the shift producing the convolution function [44].
A CNN ( Figure 4) starts with a convolutional layer which requires a minimum of three parameters: stride, kernel and filter. Stride is the number of pixels shifts over the input matrix (e.g., when the stride is 2, then, we move the filters 2 pixel at a time). The kernel is a shared small weight matrix often used for pre-processing operations such as blurring, sharpening, embossing, or edge detection. Finally, several different filters can be used, thus having a multidimensional layer. Each filter is initialized differently, and can, therefore, find better or worse image features. After a convolutional layer, a pooling layer is generally applied to reduce the size of the image combining neighboring pixels of a certain image area into a single representative value. In our case, we used the maximum value of all the considered neighboring pixels, constructing a max-pooling layer. The backpropagation algorithm used in the CNN is in charge of reducing the value of inefficient filters while promoting those that are useful for the classification task [45]. As a final step in the CNN, a MLP is attached. To do this, the pooled features of the last pooling layer are flattened. The MLP can have its own architecture with various hidden layers.

… Flattened
Convolution 1 Convolution k Pooling 1 Pooling k Figure 4. CNN architecture. The input data are introduced into a convolutional layer (accounting for different convolutional filters to generate feature maps, that accentuate the unique features of the original image) followed by a pooling layer (that reduces the size of the image combining neighboring pixels of a certain image area into a single representative value). Multiple sets of convolutional + pooling layers can be applied as pre-processing. Finally, all outputs of the last pooling layer are flattened (from a matrix to an array) and used as inputs of a MLP.

Quantization Process
ANN synaptic weights were quantized to a fixed number of levels during inference. The quantization process is based on a quantile-based discretization function. All weights belonging to the same bin will have the same value: the average for the given bin. This process is coherent with the experimental multilevel scheme we have presented for the reference technology we are considering. The quantization process is used locally in each layer, that is, for the set of weights used as input in a given layer.

Experiments and Results
We designed several experiments divided into two main groups: fully dense multilayer perceptron, called MLP, and convolutional neural networks, called CNN, as detailed in Section 3. The neural networks used in this manuscript were trained using the default parameter values of the keras/tensorflow implementation (see Table 1 for a summary of the parameters used) without any regulation techniques or dropout method [44]. We selected categorical accuracy as the comparison metric between approaches. This metric calculates the frequency in which the prediction matches the labeled data [46,47]. For both sets of experiments, input data (both training and test) were first binarized, being 0 those values lower than a threshold of 0.1, and 1 otherwise. Although the experimental multilevel approach consisted of quantization on 2, 3, 4, 5, and 8 levels, we have considered all the possibilities in the 2-8 interval at the simulation level to sweep all the possibilities and enhance our study. We have interpolated the distance between levels accounting for the data we have for the multilevel schemes experimentally fulfilled.

MLP Architecture
We studied both deep and shallow fully dense architectures. The final dense layer of all neural networks has a so f tmax activation layer attached, while the hidden layers have a relu activation layer after each of them. We experimented with the customization of two parameters to create different architectures: the number of hidden layers (ranging from 1 to 4) and the number of neurons per layer (ranging from 8 to 128).

CNN Architecture
For the CNN experiments, we customized the CNN explained in Section 3.1: (1) number of convolutional layers (ranging from 1 to 3), (2) the number of filters used (ranging from 8 to 32), (3) size of the pooling matrix (ranging from 2 × 2 to 8 × 8), and (4) number of neurons in the hidden layer (ranging from 8 to 128). All convolutional layers are followed by an activation layer using the relu function. Only one hidden layer is explored in the MLP within the CNN since the previous layers are built to cope with the data set complexity. For all the experiments in this set we used a 3 × 3 kernel matrix. The CNN also uses a SGD optimizer with the same configuration mentioned for the MLP experiments.
The total number of synapses is, in general, higher for MLP architectures compared to CNN architectures, as seen in Figure 5. This is due to the convolutional and pooling layers incorporated in the architecture that preprocess the input data, reducing the complexity before the subsequent application of the dense layer, thus reducing the overall number of synapses.

Datasets
We used two datasets for our quantization study: MNIST [48] and Fashion MNIST [49]. The MNIST image dataset [48] is composed of 28 × 28 grayscale pixel images of 70,000 handwritten digits (labeled in the interval [0, 9]), divided into a training set (60,000 images) and a test set (10,000 images), see Figure 6. The Fashion MNIST image dataset [49] also consists of 70,000 28 × 28 grayscale images in 10 classes (T-shirt/top, Trouser, Pullover, Dress, Coat, Sandal, Shirt, Sneaker, Bag and Ankle boot), divided into a training set (60,000 images) and a test set (10,000 images), see Figure 7. We chose to use the Fashion MNIST dataset since their authors claim that this dataset is more complex to learn than the original MNIST and better represent real-world tasks [49].

MLP Experimental Results
As introduced in Section 4.1, we experimented with different network configurations. For each configuration, we trained and tested with full floating-point precision. We also tested the neural network reducing the precision in the inference phase from 2 to 8 levels, to test its accuracy. Results are shown in Figure 8 for the MNIST dataset and in Figure 9 for the Fashion MNIST dataset.  Analyzing the experimentation results, we can infer that reducing the precision from float32 to low levels of precision (2 or 3 levels) highly impacts the overall accuracy in both datasets. When the number of quantization levels is higher (from 4 to 8 levels), the difference from the accuracy obtained using full float precision is reduced, particularly when the number of hidden units (neurons) used in each hidden layer is increased. It also seems that the minimum number of hidden units per hidden layer needs to be at least 64 in order to let a quantized MLP be comparable to the float32 precision accuracy in both training and test data.

CNN Experimental Results
As introduced in Section 4.2, we also experimented using convolutional neural networks. The results show in Figure 10 the 1-CNN (one set of convolutional layer + pooling layer) using the MNIST dataset, in Figure 11 the 2-CNN is shown (two sequential sets of convolutional layers + pooling layer) using the MNIST, in Figure 12 the 1-CNN is shown again making use of the Fashion MNIST dataset and in Figure 13, the 2-CNN results are shown using the Fashion MNIST dataset.     The obtained results suggest that when the number of conductance levels used is above 5, the CNNs share a common trend with the float 32 approach, but with a slight reduction in categorical accuracy. MNIST dataset, as opposed to the Fashion MNIST dataset, is greatly affected by the CNN architecture, particularly before the flattened and fully dense layers.
The Fashion MNIST dataset is harder to predict ( Figure 14) than the simpler MNIST dataset, as suggested by their authors [49]. Performing a Kruskal-Wallis rank-sum test to these data, we obtained a statistically significant difference (p-value = 1.02 × 10 −5 ) between the MLP and the CNN results in the test data for MNIST, supporting the idea that CNNs perform better in image domains. The significance obtained for its counterpart in the Fashion MNIST dataset (p-value = 0.147) is low, but not low enough to be statistically significant.
It is remarkable that for both databases, and in the two types of neural networks under consideration, reasonably similar results are obtained for quantizations with 6, 7 and 8 levels. This is not a minor issue since the reduction of the number of levels allows a less complex electronic circuitry to manage memristor quantization, and also permits a greater separation of the average values in the conductance interval employed to implement the synaptic weights. This latter consideration can lead to less overlapping of the conductance levels, and therefore, to a more precise network operation. It is also important to highlight that there is a long way to go in the study and optimization of the network size, since it is clear that for a low number of neurons in the hidden layers, and for quantization strategies with very few levels, the accuracy drops off significantly. All these considerations are reflected upon the size of crossbar arrays and the consequent matrix-vector multiplication circuits needed for the hardware implementation of these neural networks. As it is usual, a trade-off has to be considered.
Overfitting is a common problem in the field of neural networks and it is spotted also in our results ( Figure 14). Since our main goal is to compare the neural network behavior with different precision levels, we did not use any regulation techniques or dropout methods [44] to avoid it. However, due to the intrinsic variability of memristors linked in their stochastic operation, overfitting could be reduced significantly in comparison with software-based networks with continuous synaptic weights.
In the CNN realm, an efficient design of the convolutional layers might be the way to go to optimize the network hardware. The silicon area could be reduced if the hidden dense neural layer is optimized, due to the implications in the number of synapses that a shortening in the number of neurons can produce. In this case, even for the lower number of neurons in the hidden layer, the accuracy difference between the quantized and nonquantized cases is lower than in the other network type. Therefore, a quantized CNN approach to image recognition seems to be more appropriate (at least more flexible) from the hardware design viewpoint. Future work on this subject could be linked, among other issues, to the effects of variability on the memristive devices on the performance of the neural networks under study here.

Conclusions
An in-depth study of the influence of the number of conductance levels in quantized neural networks based on memristors has been performed. Different types of networks (multilayer-perceptron and convolutional) have been considered under a variety of features, such as the number of levels in the synaptic weight quantization process, the number of neurons in the hidden layers of the network topology, the image databases, the number of convolutional layers, etc. An exhaustive analysis was performed that led us to know that neural networks using a low number of synaptic weight levels require deeper and wider network architectures than those using higher precision programming algorithms. Different variables have to be taken into account in the network optimization since the number of quantized levels, the number of hidden layers and the number of neurons per hidden layers are closely related in determining the accuracy; therefore, a deep characterization is needed in each particular case prior to step in a hardware implementation process. The obtained results support the use of CNN for image domains, since convolutional layers perform a feature extraction process that reduces the complexity and the number of synaptic weights needed to achieve a neural network full potential. Thus, by reducing the number of synaptic weights needed, we can reach a comparable accuracy to their conventional multilayered-perceptron counterparts. Data Availability Statement: The datasets generated during and/or analysed during the current study are available from the corresponding author on reasonable request.

Conflicts of Interest:
The authors declare no conflict of interest.