Resistive-RAM-Based In-Memory Computing for Neural Network: A Review

: Processing-in-memory (PIM) is a promising architecture to design various types of neural network accelerators as it ensures the efﬁciency of computation together with Resistive Random Access Memory (ReRAM). ReRAM has now become a promising solution to enhance computing efﬁciency due to its crossbar structure. In this paper, a ReRAM-based PIM neural network accelerator is addressed, and different kinds of methods and designs of various schemes are discussed. Various models and architectures implemented for a neural network accelerator are determined for research trends. Further, the limitations or challenges of ReRAM in a neural network are also addressed in this review.


Introduction
A major bottleneck in power consumption and performance in the Internet of Things is the data exchange between processor and memory [1].The commonly used von Neumann architecture separates the processing unit and memory, thus causing a large amount of data movement energy and data access latency [2].To overcome these limitations, memristor technology emerged, which is a type of Resistive Random Access Memory (ReRAM).Unlike commonly used Random Access Memory, it is able to provide nonvolatile storage.ReRAM also offers storage capabilities that are multi-state and have high endurance and high density with much faster access speed [3].With this, ReRAM is able to overcome the bottleneck faced by conventional CMOS (Complementary Metal Oxide Semiconductor)-based architecture [4].For example, SRAM (Static Random Access Memory), which is formed by cross-coupled CMOS inverters, has high energy leakage and is unable to efficiently support a wide range of operations in memory [5].
ReRAMs are also relatively small in size, making the integration of memory with computing circuits possible.This leads to an efficient architecture for learning algorithms and neural network applications [6].With the capabilities of ReRAM, it is a potential element for computing in-memory (CIM), also known as processing-in-memory (PIM).Many proposed ReRAM-based PIM architectures work in the direction of machine learning (ML) accelerators.This is because the ReRAM has the characteristics of an activationdependent dynamic passive analog device, making it ideal for modeling and updating synaptic weights [6].Moreover, ReRAM-based processors are able to significantly enhance the efficiency of Convolutional Neural Networks (CNNs) as they involve matrix vector multiplication (MVM), which is supported by ReRAM (Mittal, 2019).This enhancement is achieved because the ReRAM crossbar structure is able to efficiently improve the computational complexity of MVM from O n 2 to O(1) while eliminating the parameter fetching

Background and Related Work 2.1. Processing In-Memory Architecture
The idea of processing in-memory (PIM) was first introduced in the 1960s, where the first computer that used PIM architecture performed processing-in-memory by merging a collection of processing elements and Random Access Memory (RAM).Then, until the early 2000s, it was suggested to join this withthe Dynamic Random Access Memory (DRAM) structure to increase the capability of computation [12].However, the proposed idea did not gain much recognition in computer architecture as it failed to work practically due to the obstacle of poor technology development during that time.Currently, with the rapid development of technology, research dealing with PIM architectures has been vigorously conducted.
For a better understanding of the PIM architecture, we refer to von Neumann's architecture, in which memory and processing are two separate things.Data are fetched and stored between the CPU and memory, and instead of frequent memory accesses, the cache is introduced in the architecture [13].The PIM architecture applies the idea of reducing the access rate of memory, and computation is performed in memory rather than in the CPU to reduce the data movement [12].Due to the limits faced in the DRAM architecture, it became difficult to increase density and reduce energy consumption; therefore, two new approaches for memory system designs have been developed to carry out the PIM architecture, i.e., 3D-stack memory and byte-addressable resistive non-volatile memory (NVM) [12].
Han et al. [14] proposed a ReRAM-based PIM architecture named RPBFS to optimize the graph travelling, specifically on breath first search (BFS), which reduces the data movement on graph traversal.A segment of the memory bank was used to store the graph data, and there was a primary memory bank used to record the data and handle the graph travesal procedure.An algorithm applied on the RPBFS was developed by Han et al. [14] to describe the three phases: graph mapping in one bank, with the information updated in the primary bank; graph initialization, where the status bitmap was initialized; and BFS traversal, where both the graph bank and primary bank worked together.

Processing-In-Memory vs. Near-Memory Processing
The two approaches mentioned above can be further discussed with the terms in-memory and near-memory.As the name shows, near-memory processing (NMP) refers to the ability to perform computation near the data, thus achieving the aim of speeding up execution.NMP integrates computing cores or PIM logic such as accelerators close to the main memory [12,15].Three-dimensional stack memory, usually DRAM, contains multiple layers, which include a compute layer for the integration of computing cores [15,16].Conversely, PIM takes advantage of the original characteristics of the memory cells by interacting and sharing properties between cells themselves to perform the processing [12,15].Research works have been conducted on the PIM architecture in SRAM, DRAM, phase-change memory (PCM), Magnetic RAM and Resistive RAM, which is the main focus of this paper [12].Both PIM and NMP are used to reduce unnecessary data movements, but only simple operations such as arithmetic and Boolean logic operations are enabled in the PIM architecture; in addition, even NMP uses the instruction set architecture (ISA) on the computing cores, and the thermal issue of 3D-stack memory has also caused a problem [15].

Resistive RAM
Metal-oxide resistive RAM (ReRAM or RRAM) is a new memory technology that has been developed recently due to the disadvantages of SRAM and DRAM.Apart from ReRAM, PCM and Magnetic RAM are also being researched as new memory technologies.From [17][18][19], we have collated Table 1 to show the comparison between existing and modern types of technologies.We can see that ReRAM offers a fast write/read speed compared to other memory devices.Although SRAM has the fastest write/read speed, its cell area is considerably large (at least 25 times that of ReRAM).Compared with ReRAM, the Hard Disk Driver (HDD) has a higher endurance than ReRAM, and its cell area is smaller, but its write/read time is longer [17].In contrast, ReRAM also has a long duration and low programming voltage while maintaining the cell area [18].On the other hand, ReRAM has the disadvantage of high energy consumption compared to DRAM, SRAM, STTRAM and Flash, and its energy consumption per bit is tens or even hundreds of times that of other memory devices [19].However, overall, among all the emerging memory technology candidates, the performance of ReRAM is competitive.There are two main purposes of designing these memory technologies: (I) combining the advantages of SRAM and DRAM, which are the switching speed and storage density, respectively, and (II) the property of non-volatile flash memory [20].The fundamental structure of the ReRAM cell is generally a metal-insulator-metal form, which can be understood as a sandwich layer with the insulator in between two metal layers [20].It has various types of chemical compositions built on the three layers, and each layer can be formed with different electrodes for the top and bottom metal layers and oxide materials for the insulator layer [20].
As with other traditional electronic devices, the chosen type of electrode material has a huge influence on the performance of ReRAM.For example, ReRAM with a Cu electrode has a more stable resistive switching behavior than that with a Pt electrode [20].Generally, the electrode materials utilized by ReRAM can be divided into five different categories based on the composition, which are elementary substance electrodes, siliconbased electrodes, alloy electrodes, oxide electrodes and nitrite-based electrodes [21].The most frequently used materials are elementary substance electrodes, including Al, CU, Ag, Pt, etc. [22][23][24].
The state of the ReRAM cell switches with the rise of the conductive filament (CF) inside a dielectric.The CF is a nanoscale component connecting the top and bottom electrodes of the memory cell of ReRAM [25].According to the type of CF material, ReRAM can be grouped into two kinds: (i) metal ion filament-based ReRAM is referred to as conductive bridge random access memory (CBRAM), and (ii) ReRAM based on an oxygen vacancies filament is known as OxReRAM [25].
Due to the fundamental structure of ReRAM, there is a close relationship between ReRAM and the memristor, which was first introduced to ReRAM by HP in 2008 [26] and attracted a great deal of attention from electronics designers due to its nanometer size and special electrical properties [27].With the special properties of memristors, the burgeoning memristor-based ReRAM has competitive advantages of high density, low-power, goodscalability, and non-volatility [28].The concept of the memristor first came from memristor theory, which was put forward earlier in the 1970s.It describes the behaviors of components whose resistances change with the amount of charge flowing through [29], has dramatically influenced the design of ReRAM by providing further theoretical instructions of their behaviors and has been widely used by ReRAM scientists to develop devices and explain corresponding properties more conveniently [30,31].
Payam et al. [27] proposed a type of ReRAM based on the hybrid structure of the memristor and Complementary Metal-Oxide Semiconductor (CMOS).By storing data in the binary or non-binary logic, this design increases the quantities of stored data by increasing the volume of saved data per square area of a memory chip.In their experiment results, the proposed ReRAM design showed a comparable read time and a competitive power consumption with other RAMs [27].
The structure of ReRAM cells can be combined into a crossbar array without much difficulty and distributed into 3D architecture [32].Due to the crossbar array, ReRAM can perform computation and functions of memory [14].Though ReRAM has many advantages such as low energy consumption, high density and low operation voltage, it has its challenges as well, such as stability of performance and storage systems on devices with various types of materials [32].Chi et al. [33] proposed a PIM-based NN accelerator.The accelerator utilized the ReRAM crossbar arrays that can be used to implement Matrix-Vector Multiplication (MVM) in fully connected layers by preprogramming the weight matrix in ReRAM cells to achieve acceleration.The vector is represented by input signals on the worldlines (WLs), each element of the matrix is programmed into the cell conductance in crossbar arrays, and the result of the MVM can refer to the accumulation of the current flowing to the end of each bit line (BL).Because the synaptic weights have already been pre-programmed in ReRAM cells and do not require data to be fetched from the main memory during computation, this highly improves computing efficiency.
Qiu et al. [34] proposed a novel ReRAM crossbar-based (RCA) CNN accelerator, Re-siRCA, due to the difficulty of sensing-match latencies in energy-harvesting IoT nodes and the great enhancement of NN accelerator performance in ReRAM.ResiRCA enables the accelerator to fit the energy-harvesting scenario and performs multiplication-andaccumulation (MAC) computations.With the support of the design of lightweight hardware and low power, ResiRCA is allowed to perform scalable computations.Besides, Qiu et al. [34] proposed the ResiSchedule approach, which merges the functions of sequential and pipelined computation modes in order to realize the flexibility of computation scheduling.The approach uses three techniques to achieve high throughput: loop tiling-based computation, ReRAM duplication and inter-layer pipelining.

Neural Network
The Neural Network (NN) or so-called Artificial Neural Network (ANN) was developed based on the human brain's neural system.It is highly related to Artificial Intelligence (AI).AI is known for its anthropomorphic behavior and ability to learn.Studies on the biological neural system have provided a modeling algorithm for the NN, with the ability to undertake non-linear and parallel tasks such as learning, pattern recognition, memorizing, sensation and control [35].
As mentioned above, the NN architecture uses the idea of the human brain; hence, it is a directed acyclic graph that contains nodes and edges [36].Figure 1 shows the layers of the NN architecture.L n refers to the n th layer, while the I s are the inputs and Q s are the outputs or results.The layers between the input and output layers are the layers used to perform computations, which are considered as hidden layers or so-called black boxes [37].

Neural Network
The Neural Network (NN) or so-called Artificial Neural Network (ANN) was developed based on the human brain's neural system.It is highly related to Artificial Intelligence (AI).AI is known for its anthropomorphic behavior and ability to learn.Studies on the biological neural system have provided a modeling algorithm for the NN, with the ability to undertake non-linear and parallel tasks such as learning, pattern recognition, memorizing, sensation and control [35].
As mentioned above, the NN architecture uses the idea of the human brain; hence, it is a directed acyclic graph that contains nodes and edges [36].Figure 1 shows the layers of the NN architecture.Ln refers to the n th layer, while the Is are the inputs and Qs are the outputs or results.The layers between the input and output layers are the layers used to perform computations, which are considered as hidden layers or so-called black boxes [37].There are many existing research works on NN accelerators based on ReRAM that have utilized the memory wall itself and no extra components, greatly reducing the complication of hardware design [38].Although NN accelerators work well in ReRAM-based architectures with significant efficiency, NN accelerators were unable to fully utilize the advantages of ReRAM due to the inability to satisfy the huge requirements for PE communication, resulting in a communication bottleneck.Therefore, Ji et al. [39] proposed a ReRAM-based NN accelerator based on FPGA, a reconfigurable architecture, to solve the bottleneck problem.Ji et al. [39] found that the weight mapping was imbalanced, where 0.028% of PEs performed 12.5% of computations; therefore, the weight layers were adjusted by duplicating them.

ReRAM Based Processing-In-Memory Architecture vs. Memristor-Based Neural Network Circuit of Associative Memory
The mainstream opinion [40][41][42] is that it is better to combine the work on ReRAM and the memristor into one field, since plenty of designs of ReRAM are actual instantiations of the memristor [43].Hence, apart from the concept of ReRAM-based PIM architecture, other technology based on the memristor, such as the Neural Network circuit design of associative memory, have also attracted researchers' attention.In contrast to the idea of There are many existing research works on NN accelerators based on ReRAM that have utilized the memory wall itself and no extra components, greatly reducing the complication of hardware design [38].Although NN accelerators work well in ReRAM-based architectures with significant efficiency, NN accelerators were unable to fully utilize the advantages of ReRAM due to the inability to satisfy the huge requirements for PE communication, resulting in a communication bottleneck.Therefore, Ji et al. [39] proposed a ReRAM-based NN accelerator based on FPGA, a reconfigurable architecture, to solve the bottleneck problem.Ji et al. [39] found that the weight mapping was imbalanced, where 0.028% of PEs performed 12.5% of computations; therefore, the weight layers were adjusted by duplicating them.

ReRAM Based Processing-In-Memory Architecture vs. Memristor-Based Neural Network Circuit of Associative Memory
The mainstream opinion [40][41][42] is that it is better to combine the work on ReRAM and the memristor into one field, since plenty of designs of ReRAM are actual instantiations of the memristor [43].Hence, apart from the concept of ReRAM-based PIM architecture, other technology based on the memristor, such as the Neural Network circuit design of associative memory, have also attracted researchers' attention.In contrast to the idea of ReRAM-based PIM architecture for reducing the access rate of memory and accomplishing all principal computational tasks in memory [12], the memristor-based Neural Network circuit of associative memory has mainly tended to store the intermediate state data for Artificial Neural Networks [44].It provides auxiliary functions to simulate human brain neurons to construct more complex Neural Network computing circuits [45,46].Sun et al. [47] proposed a novel design of the memristor-based Neural Network circuit of Pavlov associative memory, which is a composite system based on a memristor that provides significant biological synaptic functions.This technique simulates learning functions in Artificial Neural Networks by mimicking more realistic situations of memory.In [48,49], the authors further implemented the simulation of the forgetting process among neurons by realizing multiple generalization and differentiation in the memristor-based Neural Network circuit of associative memory.

Research Methodology
This paper aims to perform a review of ReRAM-based in-memory processing or computing architecture for neural networks.Challenges and recent research trends regarding Neural Networks using ReRAMs are presented, and various articles and journals relating to the topic are thoroughly studied.The materials were carefully selected based on some criteria listed below.

Search Strategies
To ensure the review did not lose focus, potential articles and journals were selected based on related keywords.The search string used was as follows: ("ReRAM" OR "Resistive RAM" OR| "RRAM") AND ("neural network architecture" OR "NN" OR "neural network") AND ("in-memory computing" OR "processing-inmemory" OR "in-memory processing") AND "accelerator".
IEEEXplore and ACM (Association for Computing Machinery) Digital Library were the main sources of our selections, as they are able to provide large amounts of quality literature and research papers.However, some articles outside of these sources were also considered with given reliable and quality content.

Inclusion Criteria
Any articles or research papers with topics regarding ReRAM and Neural Networks were included, specifically those on ReRAM-based accelerators for Neural Networks and also their limitations.Studies on improving Neural Network accelerators using ReRAM or overcoming ReRAM's challenges were emphasized when reviewing various articles.

Exclusion Criteria
Articles and journals unrelated to Neural Networks or ReRAM were excluded.This was done to focus on the topic and prevent side tracking.In addition, any outdated articles and journals were not included.

Study Design
The study design of this research is a qualitative systematic literature review.This method was chosen as the purpose of this research is to review the direction of recent studies on NN, specifically on ReRAM-based PIM architecture.This method was conducted in a few steps.Firstly, in the planning stage, research purposes were identified, and a review protocol was drafted and developed.In the conducting stage, research papers for review were selected according to the inclusion and exclusion criteria.Data were extracted by following the guidance in section F (data extraction), and then the quality of the studies was assessed as stated in section E (assessment of quality).Lastly, in the execution stage, data and information were analyzed and distributed.

Assessment of Quality
The quality of the journals or research papers was evaluated by two reviewers.Each reviewer had to check the titles, abstracts and publication dates to meet the inclusion criteria.To ensure the latest information in the research, journals and research articles earlier than 2015 were rejected.

Data Extraction
As mentioned, with the aim of reviewing trends and challenges, the following information was extracted: 1.
Publication details: The names of the authors and publication year; 2.
Model: An overview of the novel model proposed in existing research;

3.
Method: The detailed steps and frameworks used for the models to setup experiments; 4.
Experiment results: Statistical output of each model for comparison and discussion on the effects; and 5.
Future work and challenges: The difficulties faced and flaws detected for future improvements.

Outcome
The outcome of this review paper is to determine the recent research trends of Neural Networks based on ReRAM-based PIM architecture and the existing limitations of ReRAMbased PIM architecture.

ReRAM-Based System Architecture
Long, Na and Mukhopadhyay [2] proposed an RNN accelerator design using ReRAMbased PIM architecture that is different from previous studies where the accelerators were designed for CNN.The purpose of the design of the RNN accelerator was the different features between the CNN and RNN.In the CNN, the inputs are computed independently, and the outputs are based only on its current inputs.However, the outputs are based not only on current inputs but also on previous outputs for RNN.In a previous study [50], the accelerator was only supported for the basic RNN; then, they further proposed an accelerator that supports long short-term memory (LSTM) and the gated recurrent unit (GRU).
The architecture was designed in a similar way to the previous PIM architecture.The memory bank was divided into three sections, which are the memory, buffer and processing subarrays.To make the PIM architecture suitable for the RNN, the processing subarrays were then further divided into three sections again: (1) a ReRAM crossbar subarray for matrix-vector multiplication (MVM), (2) special function unit (SFU) subarrays for non-linear functions, and (3) multiplier subarrays for element-wise operations.In contrast to other works, Long, Na and Mukhopadhyay [2] focused on circuit design: a two-stage Op-amp was used to reduce power consumption, and 16-cycle DAC was used to compute matrix-vector multiplication since the 16-bit fixed point number was divided into 16 individual bits.
The experiment was conducted to evaluate the performance of two RNN applications: Natural Language Processing (NLP) and Human Activity Recognition (HAR).The performance was evaluated on RNN, LSTM, and GRU bases, and the whole experiment was performed on an NVIDIA GPU, with the baseline given.The evaluation of computing was compared with the GPU baseline.As a result, the computing efficiency of this system improved by an average of 79 times, and there was a slight reduction in the accuracy when the standard deviation was less than 0.2.
There are a few challenges when using ReRAM with the RNN accelerator.It is not practical to compute matrix-vector multiplication (MVM) and the element-wise operations with the ReRAM crossbar, as it is the most expensive operation.Furthermore, unnecessary time was spent to wait for the previous output to be calculated from the hidden layers due to the time dependency [2].

PSB-RNN
Challapalle et al. [51] proposed a ReRAM-based PIM architecture for an RNN accelerator using the block circulant compression approach.The approach achieved the benefits of reducing the computational complexity and space complexity by decomposing the operations into Fourier transforms and point-wise operations.However, the accuracy was reduced slightly.The decomposed operations then were formulated into in-situ Multiply and Accumulations (MAC) operations to provide high throughput.Apart from the block circulant compression, systolic dataflow was employed for communication, improving the energy efficiency and avoiding global data flow.Due to the complex operations, the weight matrices of the block circulant were not able to be mapped directly onto the ReRAM crossbars.Therefore, the mapping was conducted based on the in-situ MAC operations.
The main operation used in this study was matrix-vector multiplication (MVM).The input vectors and previous hidden state vectors in MVM were multiplied with the weight matrices of the block circulant to obtain the output vectors.As mentioned before, MVM was decomposed into discrete Fourier transform and point-wise operations by the fast Fourier transform (FFT) algorithm.The input vector and hidden state vectors were required to be converted into the frequency domain, where its values were pre-calculated and mapped to the crossbar arrays.The procedure for MVM operations was first to compute the input and previous hidden state vectors using FFT and then perform the block-wise element multiplication.Next, the partial products were accumulated, and lastly, the inverse FFT was performed to obtain the MVM result.
Challapalle et al. [51] conducted their experiment by evaluating the computational efficiency.Tensorflow was employed for the implementation of baseline LSTM and GRU networks.In this experiment, CPU and GRU power consumption and the performance evaluation of PSB-RNN were compared with other accelerators, specifically C-LSTM, a Field-Programmable Gate Array (FPGA)-based block circulant and a custom ReRAM crossbar as the baseline architectures.The performance was evaluated on NLP, HAR and TIMIT.PSB-RNN achieved a 44-fold improvement on average in computational efficiency compared to the FPGA-based method and a 17-fold improvement compared to the ReRAM crossbar-based method.

ReRAM Memory Wall Accelerator
Chi et al. [33] proposed a PIM-based NN accelerator by simply utilizing the ReRAM crossbar arrays, namely PRIME.Instead of adding an extra processing unit for computation, PRIME takes part of the memory arrays to serve as the accelerator.It fully utilizes the advantages of ReRAM and the PIM architecture in terms of computation capability and accelerates the processing of data.There are other types of memory technologies that can be used for such research, e.g., 3D-stack memory.However, PRIME uses ReRAM instead of 3D-stack memory due to the ease of manufacturing as well as uncomplicated logic of ReRAM.Moreover, the high cost and thermal issues of 3D-stack memory came into consideration.Some other works made use of non-volatile memory technologies to build ternary content addressable memory (TCAM), in which it is necessary to redesign the memory cells to a larger size to support the search operation, which is costly.Therefore, for PRIME, it was decided not to retain the design but to be able to support more advanced computations.
PRIME divided ReRAM into three sections, i.e., memory arrays that are only capable of storage, full function (FF) arrays that contain both computation and storage capabilities and buffer arrays that work with FF arrays as a data buffer.There was a total of 66 subarrays for 2 FF subarrays and 1 buffer subarray per bank by taking 8 levels and 2 levels of input voltage for computation and memory, respectively.By following the input and synapse composting schemes, for computation, 6 bits of dynamic fixed points were given to the input and output, and 8 bits were used for weights.The experiment was conducted by comparing PRIME with pNPU-co and pNPU-pim, using the parallel NPU as a secondary processor and NPU with 3D-stack memory instead of ReRAM.
In addition, four aspects were compared, i.e., speedup performance, execution time breakdown, energy saving and energy breakdown.PRIME had the highest speedup, achieving a 4.1-fold increase compared to pNPU-pim-x86.In contrast, pNPU-co had the lowest result.In the evaluation of execution time breakdown, Chi et al. [33] separated the results into memory and computation.pNPU-pim reduced most of the memory access time from pNPU-co previously, whereas PRIME even reduced it to 0, thus reducing the processing time.Due to the reductions in accessing the memory, PRIME saved a great deal of energy.On the other hand, the energy consumption breakdown of pNPU-co reached almost 100%; however, PRIME's consumption was estimated to be not more than 2%.Generally, PRIME improved the overall performance ~2360-fold and showed an a ~895-fold reduction in energy.

CNN Accelerator Reusing Data at Multiple Levels
Convolutional Neural Networks (CNN) have contributed greatly towards artificial intelligence applications such as image processing and image classification or image recognition.In addition, ReRAM has great potential for the development of in-memory processing techniques, e.g., CNN accelerators.However, writing to ReRAM takes substantial amount of time compared to reading from ReRAM.Therefore, it is better to map weights from CNN filters to ReRAM and keep it constant until there are any further updates to the weights [52].This has also caused dataflow control methods and weight mapping schemes to become significant factors in improving processing efficiency.
Luo et al.
[52] came up with a ReRAM CNN accelerator called FullReuse.This method aims to achieve the maximum utilization of reusability of weights, input and output data.CNN computing operations are nearly all Multiply and Accumulations (MACs).Out of the total computation in convolutional layers, 90% of them are MACs [53].This results in massive movements of data due to the large number of weight filters, which also causes large energy consumption.Hence, it is more efficient if data such as input data can be reused instead of reading data from off-chip memory and sending it to ReRAM arrays.
There are a few ways to reuse input data such as convolutional reuse, filter reuse and Input Feature Map (IFM) reuse.Convolutional reuse is where each IFM pixel, during convolution with a filter, is reused except for pixels on the edges.As for filter reuse, when the convolution of multiple IFMs takes place on a filter, the filter weights are able to be reused.While IFM reuse indicates the reuse of each IFM pixel when the particular IFM is convoluted with various different filters, FullReuse uses all data reuse types mentioned, also implementing horizontal and vertical convolutional reuse.To maximize the reuse of data, 3 × 3 processing elements contain the mappings of filter weights.This shows that this system can simultaneously execute three convolution operations.
The evaluation of FullReuse is done on a VGG-8 network using a 32 nm DNN+NeuroSim framework.To determine the performance and energy efficiency of the proposed method, CIFAR-10 images were used and processed for FullReuse.Then, the approach is compared to a baseline architecture that adopts a filter mapping scheme where data in the PE on the same row can be reused [54].The results show that in the layer with the most data movements, FullReuse is able to reduce energy consumption in the buffer and interconnect by 75% and 38%, respectively.As for processing speedup, FullReuse is 1.6 times faster than the baseline architecture.
6. Generative Adversarial Network 6.1.ReRAM-Based Accelerator for Deconvolutional Computation Z. Li et al. [55] proposed an improved accelerator, known as RED, to improve the execution and energy of deconvolution.Deconvolution is also known as transpose convolution, which is a new type typically used NNs, generative adversarial networks (GANs) and fully convolution networks (FCNs).Deconvolution involves two algorithms that are not seen in traditional convolution: (1) zero padding, which performs padding by adding extra zeros first and then performs convolution; and (2) padding-free, which contains three steps: multiplication, summation and cropping.However, there were some limitations found by Z. Li et al. [55] that led to ineffective computation.With that, Z. Li et al. [55] proposed a novel accelerator using the ReRAM crossbar for computation and optimization.
Figure 2 shows the inefficiency that appeared in deconvolution algorithms.Zero padding inserted extra zeros for the input, causing unnecessary and redundant operations.The padding-free approach based on CMOS can achieve high performance, while in the ReRAM-based PIM, extra storage and significant efforts were needed for circuitry adjustment.Moreover, overlap matrices were generated due to the multiplication of the input and kernel, causing extra calculations and circuits to be needed to obtain the final output.Therefore, the pixel-wise mapping scheme and zero-skipping data flow were suggested and also taken into action by Z. Li et al. [55].The function of the pixel-wise mapping scheme is to remove unnecessary zeros and provide the ability to execute the computations in parallel.It was designed to match the location of the pixels and support parallel processing.Furthermore, zero-skipping data flow was further expanded to skip the positioning of zeros and maximize the data reuse throughout the processing by distributing the original input to the processing engine (PE) buffers.
The padding-free approach based on CMOS can achieve high performance, while in the ReRAM-based PIM, extra storage and significant efforts were needed for circuitry adjustment.Moreover, overlap matrices were generated due to the multiplication of the input and kernel, causing extra calculations and circuits to be needed to obtain the final output.Therefore, the pixel-wise mapping scheme and zero-skipping data flow were suggested and also taken into action by Z. Li et al. [55].The function of the pixel-wise mapping scheme is to remove unnecessary zeros and provide the ability to execute the computations in parallel.It was designed to match the location of the pixels and support parallel processing.Furthermore, zero-skipping data flow was further expanded to skip the positioning of zeros and maximize the data reuse throughout the processing by distributing the original input to the processing engine (PE) buffers.The experiment was performed by evaluating the execution performance and energy consumption on GAN and FCN models under the zero-padding and padding-free design The experiment was performed by evaluating the execution performance and energy consumption on GAN and FCN models under the zero-padding and padding-free design scenarios.Z. Li et al. [55] executed the RED model by deconvolutional layers; a single layer required multiple processing cycles to complete, and multiple processing modes were executed simultaneously in one cycle.The executing stages were performed as follows: (1) fetch inputs from global memory to the input buffer, (2) process data that passed to Pes, (3) store intermediate output data in the output buffer and (4) transfer back the data to global memory, hence completing the processing cycle.The evaluation was performed by using the modified NeuroSim+, adjusting the clock frequency to 2 GHz and using 2 V of voltage.
RED achieved a 4-fold to 56.16-fold speedup compared to zero-padding, and voc_FCN8s had the highest speedup among all models tested by Z. Li et al. [55].However, RED only showed slight improvements compared to the padding-free approach.In terms of the execution time breakdown, the data show that RED had the same results for the array and peripheral latency as the padding-free design; the only difference was that the padding-free approach had different output buffer latencies, which can cost an extra 29% to 104% in execution time, but RED did not have this issue in output buffer latency.In terms of energy consumption, RED saved energy 1.05-fold to 18.17-fold compared to the zero-padding design, achieving an 18% to 49% increase in energy efficiency.

LerGAN
Mao et al. [56] proposed a novel GAN accelerator based on the PIM architecture known as LerGAN.The purpose of LerGAN is to solve two main issues that GAN encountered, i.e., zero-insertion that loads on storage and the complicated data transfer pattern causing the system bottleneck.The problems found in previous studies about GAN accelerators were the unnecessary zero-related works and poor I/O connection.Therefore, LerGAN involved two novel schemes to solve the problems.Mao et al. [56] proposed the zero free data reshaping (ZFDR) scheme and reconfigurable 3D connection scheme to tackle the above problems accordingly.The ZFDR scheme was supported by the memory controller and responsible for solving the zero-related task, whereas the 3D interconnection supports the complicated dataflow.
The design of the ZFDR scheme involves T-CONV ZFDR and W-CONV-S ZFDR components for the purpose of reshaping and reusing weight matrices.The types of each component can be further divided into CornerReshape, where no reuse of reshaped weights occurs; EdgeReshape, where more reuses occur; and InsideReshape, where less reuse occurs.The imbalance of reuse causes an instability of runtime; therefore, EdgeReshape and InsideReshape were replicated to a suitable amount.In the 3D data wire connection unit (3DCU), the wires were used to connect the nodes between different layers.Furthermore, switches were added somewhere between connections due to the bandwidth limitation, and adders were attached to each node, which were Smode and Cmode in 3DCU for memory read/write and computing, respectively.
Considering the two design architectures as a whole, the mapping of reshaped data from ZFDR was basically conducted by the compiler, while 3DCU was used as the memory controller.The two interfaces in ZFDR did not directly reshape the data, but dataflows and place holders were created to remove the zeros.There were mapping generators with both T-CONV and S-CONV and a mapping discriminator with only T-CONV.The data mappings were then recorded by the memory controller for management, and the switches were arranged to update the finite states.The experiment for evaluating the performance was conducted by comparing LerGAN with the FPGA-based GAN, a GAN running on a GPU and a GAN running on an improved ReRAM-based NN accelerator.The hardware used for LerGAN was 4 bits of a ReRAM cell, 16 bits of input, weight and output and a ReRAM array of size 128,128.The benchmark used was a state-of-the-art GAN.
ZFDR improved the overall parallelism by inserting more space for the weights.The transfer inputs was achieved by duplicating the weights after only using up 75% more space.In contrast, 3DUC achieved a reduction of data movements by aligning each part of the phase vertically.Besides, there were two possible results from the two methods: (1) split weights resulted in more spaces saved but less parallelism, and (2) duplicate weights caused more parallelism but used more spaces.The results of the experiment showed the improvements and achievements of LerGAN, where it achieved 47.2-fold, 21.42-fold and 7.46-fold speedup performance compared with the three other architectures, respectively.However, in terms of energy-saving performance, LerGAN consumed 1.04 times more energy than the FPGA-based GAN, but it still resulted in 9.75-fold and 7.68-fold energy savings compared to the other two architectures.

ReRAM-Based Accelerator for Sparse Neural Network
Lin et al. [7] proposed a mapping and pruning scheme for sparse NNs.Current NNs often bring immense amounts of pressure for storage and computation, as NNs usually contain large numbers of parameters.Hence, sparsity through weight pruning can be used as a powerful approach for compressing the NN model.The method of pruning is able to reduce the redundant parameters, bringing some relief from the storage and computation pressure.Pruning weights can be done by setting individual parameters to zero, which causes the NN to become sparse [57].However, a sparse matrix still has to be stored in the same way as a dense matrix as the crossbar structure has to retain the proper order of columns and rows.This causes the efforts of weights pruning to be wasted, as the zeros in NN cannot be eliminated.Therefore, Lin et al. [7] came up with a k-means clustering scheme for sparse NN mapping, as shown in Figure 3 which has been adapted from [7], and a crossbar-grained pruning algorithm.
In order to retain the computational complexity of O(1) for matrix-vector multiplication in ReRAM crossbar computing, the whole matrix has to be mapped into the crossbar, even though it includes a large number of zeroes.With k-means clustering, Lin et al. [7] is able to make non-zero elements more concentrated by using column exchanging.However, it is difficult to cluster a very large matrix.Therefore, the matrix goes through a pre-split before performing k-means clustering.The proposed system flow resembles PRIME's system flow, only with extra indexing units added due to the matrix having disordered columns and rows.
Lin et al. [7] proposed a mapping and pruning scheme for sparse NNs.Current NNs often bring immense amounts of pressure for storage and computation, as NNs usually contain large numbers of parameters.Hence, sparsity through weight pruning can be used as a powerful approach for compressing the NN model.The method of pruning is able to reduce the redundant parameters, bringing some relief from the storage and computation pressure.Pruning weights can be done by setting individual parameters to zero, which causes the NN to become sparse [57].However, a sparse matrix still has to be stored in the same way as a dense matrix as the crossbar structure has to retain the proper order of columns and rows.This causes the efforts of weights pruning to be wasted, as the zeros in NN cannot be eliminated.Therefore, Lin et al. [7] came up with a k-means clustering scheme for sparse NN mapping, as shown in Figure 3 which has been adapted from [7], and a crossbar-grained pruning algorithm.In order to retain the computational complexity of  1 for matrix-vector multiplication in ReRAM crossbar computing, the whole matrix has to be mapped into the crossbar, even though it includes a large number of zeroes.With k-means clustering, Lin et al. [7] is able to make non-zero elements more concentrated by using column exchanging.However, it is difficult to cluster a very large matrix.Therefore, the matrix goes through a pre-split before performing k-means clustering.The proposed system flow resembles PRIME's system flow, only with extra indexing units added due to the matrix having disordered columns and rows.
After concentrating the non-zero elements as closely as possible, the sparse NN is further compressed using crossbar-grained NN pruning.This allows ReRAM crossbars with only a few parameters to be discarded to minimize hardware resources.Then, the After concentrating the non-zero elements as closely as possible, the sparse NN is further compressed using crossbar-grained NN pruning.This allows ReRAM crossbars with only a few parameters to be discarded to minimize hardware resources.Then, the NN is fine-tuned to minimize accuracy loss after weights pruning.Using 8-bit precision, the degrading accuracy with deeper pruning represents less than 1% of accuracy loss.
Different NN structures were included in the stimulation, including CNNs such as ResNet-18 and RNNs such as LSTM-5 (long short-term memory).Based on the results, there was a 3 to 5-fold energy efficiency improvement after crossbar-grained pruning.The results are better for large layered NNs, as the layers occupy a large amount of the energy cost.This results in the NN ending up quite sparse after pruning.The system proposed by Lin et al. [7] also achieved more than a 2.5 to 6-fold speedup for most NNs.However, the result for convolutional layers, e.g., ResNet, is not as good.This is due to it being difficult to compress with pruning and map sparse irregular weights.

Sparse ReRAM Engine
Several proposed sparse NN/accelerators methods use very ideal architectures of ReRAM such as the PRIME design.The scheme proposed by Lin et al. [7] also uses idealized ReRAM architectures as it is similar to the PRIME system flow, where these optimal or ideal designs assume that the entire crossbar array can be triggered in one cycle.However, this idealized assumption causes a degradation of accuracy towards the inference by NN [58].This is due to the fact that simultaneously activating a large number of wordlines in the crossbar array causes the analog-to-digital converter (ADC) to have trouble accurately determining the accumulated values on the bitline.For practical applications of ReRAM Neural Network accelerators, a single cycle only allows a limited number of wordlines and bitlines to be turned on.Hence, in a cycle, the accelerator can only carry out a portion of computations.
Yang et al. [58] proposed a method to exploit sparsity in DNNs called Sparse ReRAM Engine (SRE).The maximum number of dot-product computations in one cycle is defined as one Operation Unit (OU).Instead of assuming that the whole of the crossbar array is activated in a single cycle, only a small section of the crossbar or OU is activated in one cycle.With sections of the crossbar turning on independently, this gives some form of advantage in terms of compressing weights and the activation of NNs.The proposed method uses both row compression and activation compression.
Row compression removes the row vectors containing all zero elements in a single OU.Then, the rest of the rows can be shifted up, as shown in Figure 4, which has been adapted from [58].Column compression is also similar to row compression except that it only involves column operations.However, column compression alters the output mapping of the bitline.Hence, row compression is more suitable as it does not change the output mapping and does not need any output indexing.As for activation compression, a method of Dynamic OU Formation is used, where computation is ignored when the inputs of the OU are all zeros.
Yang et al. [58] proposed a method to exploit sparsity in DNNs called Sparse ReRAM Engine (SRE).The maximum number of dot-product computations in one cycle is defined as one Operation Unit (OU).Instead of assuming that the whole of the crossbar array is activated in a single cycle, only a small section of the crossbar or OU is activated in one cycle.With sections of the crossbar turning on independently, this gives some form of advantage in terms of compressing weights and the activation of NNs.The proposed method uses both row compression and activation compression.
Row compression removes the row vectors containing all zero elements in a single OU.Then, the rest of the rows can be shifted up, as shown in Figure 4, which has been adapted from [58].Column compression is also similar to row compression except that it only involves column operations.However, column compression alters the output mapping of the bitline.Hence, row compression is more suitable as it does not change the output mapping and does not need any output indexing.As for activation compression, a method of Dynamic OU Formation is used, where computation is ignored when the inputs of the OU are all zeros.Comparing SRE with an idealized design (i.e., ISAAC [59]), SRE is able to provide better performance for NN models such as CIFAR-10, CaffeNet and GoogLeNet.In addition, SRE can save an average of 67.0% of energy over the ISAAC design.As for evaluation with other neural network models, SRE is capable of providing an average speedup of 13.1 times and an average energy saving of 85.3%.[8] proposed an in-memory processing design for multi-modal Deep Neural Network (DNN) acceleration using ReRAM called HitM.Unlike typical unimodal NN applications (e.g., applications dealing with image recognition and natural language processing), the multi-modal DNN is able to learn from heterogeneous data fused from various sensors.For example, videos with auto-captions are some of the data in the web that combine visual and linguistic information [60].Hence, combining Neural Network architectures such as recurrent and convolutional NNs will result in a multi-modal neural network.This has also caused the utilization of ReRAM-based PIM to be very challenging on multi-modal DNNs.Mainly because they consist of convolutional layers and linear layers inside CNNs and RNNs, respectively, both layers have to be considered to perform optimization on a multi-modal DNN.
HitM mainly consists of two stages, i.e., static analysis and adaptive optimization.The first stage involves static analysis, which mainly extracts structural information of the multi-modal DNN description.It parses the user input neural network description file and generates information of the ReRAM-based PIM design.This information includes the types of NN layers, layer dimension, number of ReRAM crossbars, etc., which is used in the optimization stage.In the adaptive optimization stage, a scheme to perform filter replication is produced with optimized throughput.The rate of production of a network or throughput is evaluated recursively using an estimation model.The next objective is to determine the highest throughput while being under resource constraints, specifically in terms of hardware resources.This can be done by precisely assigning the number of filter replications to each neural layer.Hence, the outputs including the optimized throughput and list of replication numbers can be mapped onto the ReRAM-based PIM.
For the evaluation of hardware and throughput utilization, HitM is compared with ReRAM-based PIM designs without filter replications and an ideal-optimal throughput design.HitM is able to achieve an optimal throughput and hardware utilization of 78.01% and 64.52%, respectively.While comparing with other similar in-memory processing models, HitM is able to improve power efficiency and computation efficiency by 3.80-10.34times and 4.21-13.08times respectively.It is expected that the research will extend in the direction of activation in multi-modal NNs and the utilization of weights' bit-widths.

Re2PIM: Reconfigurable ReRAM-Based Deep Neural Network Accelerator
Deep Neural Network (DNN) computing requires a massive amount of data movement; hence, it demands a large amount of energy and also causes long processing intervals.To overcome the limitations, in-memory processing techniques have emerged to act as accelerators of DNNs.Accelerators using ReRAMs show great capabilities as they are energy efficient and have a high density of computations.However, current ReRAM accelerators are still unable to reach their full potential in terms of computational efficiency.For example, Shafiee et al. [59] determined that the peak computational efficiency for a ReRAM crossbar with a size of 128 × 128 can go up to 1707 GOPS/s× mm 2 for 16-bit operations.However, due to peripheral circuits (e.g., buffers, buses and ADCs), an ISAAC architecture can only achieve a computational efficiency of 479 GOPS/s× mm 2 .Other works such as CASCADE and TIMELY architectures are also unable to achieve efficiencies anywhere close to the peak efficiency [61,62].Hence, overheads in peripheral circuits greatly impact the execution of ReRAM accelerators.
In a DNN accelerator, a ReRAM crossbar, also known as a computing unit (CU), is able to efficiently execute a matrix vector multiplication.However, the size of the CU can affect the energy efficiency and throughput of the accelerators [63].A small CU size will divide DNN weight matrices into small blocks, resulting in a high overhead of peripheral circuits.For a large CU size, the weight matrices will be divided into larger but fewer blocks, decreasing the overhead.However, a large CU size will limit the throughput of the system due to coarse granular division.
Zhao et al. [63] proposed an accelerator design with a CU size that is variable to adapt to weight matrices with different sizes called Re2PIM.The size of the CU can be changed by reconstructing the reconfigurable units (Rus).Re2PIM is compared to typical ReRAM in-memory architectures, including PRIME, ISAAC and TIMELY.Under constant chip areas, the proposed design of Zhao et al. [63] is able to achieve 27-fold, 34-fold and 1.5-fold average energy efficiency improvements over PRIME, ISAAC and TIMELY, respectively.As for improvements in throughput, Re2PIM is on average 5.7, 17 and 8.2 times better than PRIME, ISAAC and TIMELY, respectively.In terms of crossbar utilization, Re2PIM is slightly worse than ISAAC, as ISAAC has smaller granularity division.However, this causes ISAAC to have high overheads in the peripheral circuit, while Re2PIM benefits from low overhead and saves more energy.

Faults in ReRAM Based Neural Networks
Despite the outstanding characteristics of ReRAM over the conventional CMOS architecture, ReRAM still faces some challenges in terms of its reliability.Due to inexperienced chip fabrication and the fundamental nature of ReRAM itself, ReRAM faults will occur [64].These faults will lead to a degradation of performance and cause errors in the system.ReRAM faults mainly consist of hard faults and soft faults.
Hard faults, which are also known as permanent faults, are where the resistance of ReRAM is stuck or permanently fixed in a state.This is due to a defective wordline or switch state becoming permanent, which means that the ReRAM cell cannot be accessed.This will also lead to the cell being fixed in a permanent state.For example, the cell is mistaken to have high resistance while it is in the low resistance state.This is also known as stuck-at-one [65].These faulty wordlines or switches can be caused during chip fabrication or due to the low endurance of ReRAM [64].
In contrast, soft faults cause the saved state of ReRAM to be altered, where the actual resistance of the cell differs from the expected resistance, hence causing a degradation of accuracy in the system.ReRAM uses oxygen vacancies and oxygen ions to form conductive filaments.Soft faults are formed when oxygen vacancies are diffused out of the filaments [66].This can be further triggered by the repeated switching of ReRAM and the generated heat, thus increasing soft faults and causing disturbance in the read and write process.
To alleviate the effects of ReRAM faults, several methods are proposed.For example, Xia et al. [10] proposed a fault-tolerant training that also includes fault detection.The fault-tolerant training will be able to improve the lifetime of ReRAM by reducing the number of write operations and also updating the ReRAM status.When faults are detected through voltage comparison, the next training phase will be able to tolerate the faults.
W. Li et al. [64] presented RRAMedy, which is able to detect ReRAM faults and recover from them.The proposed method uses a detection method in which it can detect subtle faults that may be expensive to detect using traditional methods.This prevents small subtle faults from affecting the accuracy of the system.Moreover, if irreversible faults are detected, the model can be retrained by a cloud server to update the current parameters.
A faulty ReRAM cell can have a major influence on the system if not handled properly.This is because it affects the output map, and this will propagate through the layers of the NN, hence affecting the system as a whole and causing error.The aftereffect of these errors can be disastrous, as it may involve critical safety functions [67]; for example, the classification of objects in self-driving vehicles.Hence, it is important to minimize the effects of ReRAM faults for a system to have higher reliability.

No
Citation NN Type Architecture Summary

[2] RNN
ReRAM-based PIM RNN accelerator whereby the processing engine was divided into three subarrays for different functions, and a lower bit-precision number was utilized to enhance computing efficiency.

[7]
Sparse NN Sparse NN mapping approach using k-means clustering and pruning algorithm to minimize low utilization crossbars with minimal impact on accuracy.

[8]
Multimodal-DNN HitM: Combined NN accelerator model of CNN and RNN by extracting layer-wise information and determining optimized throughput.

[33]
MLP and CNN PRIME: PIM-based architecture using ReRAM crossbar arrays that be configured as the accelerator, with an input and synapse composing scheme.

6
[52] CNN FullReuse: CNN accelerator that aims for maximum utilization of data reusability, including input data, output data and weights.

[55]
GAN and FCN RED: ReRAM-based approach with pixel-wise mapping scheme and zero-skipping data flow for deconvolutional computation.This approach has solved the massive zero insertion and maximizes the reuse of input data.

[72] CNN
ReRAM-based network processing unit and training method with mixed precision.This scheme enables low power edge CNN with instant precision tuning.17 [73] Spiking NN A robust ReRAM-based PIM spiking NN accelerator with frequency-dependent stochastic spike-timing-dependent-plasticity (STDP), which achieved better accuracy under noisy input conditions.

[74] DNN
3D-ReG: ReRAM-based PIM architecture and Graphics Processing Unit (GPU) with three-dimensional integration.This design exploits the diversity of 3D-ReG to determine the optimal balance between efficiency and accuracy.

[75] DCNN
Deep Convolutional NN (DCNN) accelerator by avoiding expensive ADC and DAC processes.This is done by moving the computation of network layers to a CMOS peripheral circuit.

[76] RNN
ERA-LSTM: Tiled ReRAM-based PIM with an elaborate mapping scheme that enables the concatenation of tiles for large-scale long short-term memory (LSTM) and inter-tile pipeline.This approach supports LSTM pruning method and saves half of the ReRAM crossbar overhead.

[77] DCN
RECOIN: Deformable Convolution Network (DCN) with low power ReRAM-based PIM architecture by addressing irregular DCN data access.In addition, the authorsproposed a bilinear interpolation (BLI) processing engine to avoid high write latency.

Conclusions and Future Studies
Memristor or ReRAM technology has great potential to overcome the limitations faced by current commercial von Neuman architectures using CMOS, including faster access speed and multi-state storage power.The capabilities of ReRAM have also attracted the interest of many researchers in using ReRAM for in-memory computing architectures and building architectures for Neural Networks (NNs).In this paper, we have addressed different approaches and schemes for ReRAM-based NN accelerators.Different types of NNs are categorized into RNN, CNN, GAN, sparse NN and DNN.
The proposed schemes and methods of ReRAM-based NN accelerators mostly aim for higher computation speed, higher energy efficiency and maximum utilization.The accuracy of the ReRAM-based NN is also a significant factor, especially in sparse NN.This is due to sparse NN involving many zero parameters and requiring pruning or compression to reduce storage and computation pressure.However, the challenge comes when the degradation of accuracy must be kept at minimum while increasing efficiency.
Other than accuracy issues caused by compression, ReRAM still poses some risks of resulting in faulty outcomes, as ReRAM has faults such as hard faults and soft faults.These ReRAM faults will affect the resistance stored in the device, leading it to be inaccurate, resulting in bigger errors in the NN.Hence, faults in ReRAM must be addressed properly to their effects on the system.
ReRAM-based NN architectures should be explored more given their powerful capabilities.With this, a large scale of reliable and precise simulators, modeling tools and real prototypes will be needed.However, there is a lack of open-source or non-proprietary modeling tools.This causes many researchers to use existing tools or estimation from other works to determine ReRAM parameters, running the risk of being inaccurate [78].Hence, the development of non-proprietary tools for ReRAM-based NN testing and prototyping will bring improvements in integrating and adopting ReRAMs in real systems.

Figure 1 .
Figure 1.Illustration of basic NN architecture.

Figure 1 .
Figure 1.Illustration of basic NN architecture.

Table 2
summarizes the different types of ReRAM-based Neural Network accelerators, including some mentioned in previous sections and other additional recently presented ReRAM-based PIM schemes.

Table 2 .
Cont.ReRAM-based PIM accelerators addressing the bottlenecks of the high energy cost of data movement and ADC/DAC processes.This approach adopts analoe local buggers, time-domain interfaces and an only-once input read mapping scheme.ReRAM-based PIM accelerator that aims to support deep learning applications.Improved pipelined architecture contributed to enabling the continuous flow of data in consecutive cycles.Spike-based data scheme was used to eliminate overhead of DAC/ADC.Mobile PIM accelerator that is capable of quality of service (QoS), including high accuracy, low latency and low energy consumption.15[71]GANReRAM-based PIM GAN accelerator with a novel computation deformation technique that completely eliminates zero-insertion.Spatial parallelism and computation sharing has been proposed to improve the training efficiency.