In-Memory Computing with Resistive Memory Circuits: Status and Outlook

: In-memory computing (IMC) refers to non-von Neumann architectures where data are processed in situ within the memory by taking advantage of physical laws. Among the memory devices that have been considered for IMC, the resistive switching memory (RRAM), also known as memristor, is one of the most promising technologies due to its relatively easy integration and scaling. RRAM devices have been explored for both memory and IMC applications, such as neural network accelerators and neuromorphic processors. This work presents the status and outlook on the RRAM for analog computing, where the precision of the encoded coefﬁcients, such as the synaptic weights of a neural network, is one of the key requirements. We show the experimental study of the cycle-to-cycle variation of set and reset processes for HfO 2 -based RRAM, which indicate that gate-controlled pulses present the least variation in conductance. Assuming a constant variation of conductance σ G , we then evaluate and compare various mapping schemes, including multilevel, binary, unary, redundant and slicing techniques. We present analytical formulas for the standard deviation of the conductance and the maximum number of bits that still satisﬁes a given maximum error. Finally, we discuss RRAM performance for various analog computing tasks compared to other computational memory devices. RRAM appears as one of the most promising devices in terms of scaling, accuracy and low-current operation.


Introduction
According to More's law, novel computing concepts are being researched to mitigate the memory bottleneck typical of von Neumann architectures.Among these new concepts, in-memory computing (IMC) has attracted an increasing interest due to the ability to execute computing tasks directly within the memory array [1][2][3].Various IMC circuits have been proposed so far, including digital gates [4][5][6][7], physical unclonable functions (PUF) [8][9][10][11] and neuromorphic neurons and synapses [12][13][14][15][16].In these circuits, the computational function stems from a physical property of the memory device and circuit, such as set/reset dynamics of resistive switching memory (RRAM) for synaptic potentiation/depression and neuron integration and fire.As a result of the physics-based computation, most IMC circuits operate in the analog domain and in a continuous time scale.A typical example of analog IMC primitive is the matrix vector multiplication (MVM), which can be executed in one step in a crosspoint memory array [17][18][19].The parallel and analog IMC operation allows the MVM operation to be significantly sped up with respect to the conventional multiply-accumulate (MAC) algorithm of digital computers [20,21].Crosspoint-based MVM has been adopted in a number of computing applications, ranging from deep neural networks (DNNs) [22][23][24][25] to linear algebra accelerators [26][27][28][29].
Figure 1 illustrates the hierarchical design approach for IMC accelerators.Computation is enabled at the device level by transport phenomena, e.g., the Ohm's law enabling multiplication of voltage and conductance or the threshold switching enabling comparison between voltages [30].Devices are connected within circuits, allowing parallel flow of input and output signals, Kirchhoff's current summation and feedback connections, usually in the analog domain.Circuit primitives are organized within novel architectures to harness the full potential of the computing cores and accelerate data-intensive workloads such as neural networks.Based on such a hierarchical structure, it is clear that the proper optimization of the analog accelerators requires a co-design approach from materials to applications to take into account device characteristics, circuit/device non-idealities and possible architecture limitations.Algorithms such as training of neural networks can also be used to optimize precision in view of device and circuit non-idealities.In this regard, a key point is the precision of the physical representation or mapping of the computational coefficients and the overall computation, which might be affected by noise, instability and parasitic elements in the crosspoint array circuit [31].This work presents an overview of analog IMC with RRAM devices.We first address the programming characteristics of a typical HfO 2 RRAM devices to highlight the intrinsic conductance variations for various programming algorithms.Then we focus on the various options for mapping computational parameters, such as synaptic weights in a DNN, discussing the trade-off between precision and memory area.Finally, we extend our study to various memory parameters, including low-current operation, programming speed and cycling endurance, by discussing their importance for various computational applications and the memory devices that maximize these performance metrics.

RRAM Device Structure
Various nanoscale devices have been proposed as a new class of non-volatile memory, where information is stored as the physical configuration of active material, resulting in different conductance [1].For example, information can be stored and retrieved by the device spin magnetization in a magnetic random access memory (MRAM) [32], as the phase structure of the materials in phase change memories [33,34] or as the atomic configuration of conductive defects in resistive switching memories, or RRAM [35].The latter has attracted particular interest thanks to the simple structure, compatibility with CMOS process, fast operation and high density [36,37].Figure 2a shows the typical structure and operation of a RRAM device, consisting of an insulating metal-oxide layer interposed between a metallic top electrode (TE) and a metallic bottom electrode (BE).The insulating layer results in a typical high resistance following fabrication.A forming process consisting of the application of a relatively high positive voltage pulse on the TE induces a local modification of the material composition with the growth of a metallic filament resulting in the low resistance state (LRS).A high resistance state (HRS) can be recovered by means of a negative voltage applied to the TE, which results in the creation of a depletion region across the conductive filament, thus resulting in a decrease of conductance.In addition to the LRS and the HRS intermediate states can be programmed.Thus it is possible, e.g., by controlling the filament size via the maximum current flowing across the device during the set operation, i.e., the compliance current I C .This is relatively straightforward in the 1-transistor-1-resistor (1T1R) structure as shown in the inset Figure 2b, or by controlling the maximum voltage applied during the reset operation which results in a different thickness of the depletion gap.For instance, Figure 2b shows typical current-voltage (I-V) curves of a RRAM device in 1T1R configuration for increasing I C , controlled with the transistor gate voltage V G [38], demonstrating the possibility of analog programming.The multilevel cell (MLC) capability of Figure 2b is interesting not only for increasing the bit density of the memory, but also for enabling computing applications [39].In fact, by arranging RRAM in crosspoint configuration as shown in Figure 3, it is possible to directly encode arbitrary matrix entry A ij into the conductance value G ij [1,31].Then, by applying a voltage vector V as input on the columns and collecting the current flowing in each row I j , it is possible to compute the matrix vector multiplication (MVM) according to: where N is the dimension of the input vector, thus G is a N × N matrix.MVM is at the backbone of a variety of data-intensive applications that can be accelerated by IMC acceleration, such as neural network training and inference [40][41][42][43], neuromorphic computing [15,16,44,45], image processing [18,46], optimization problems [47][48][49][50] and the iterative solution of linear equations [26,27].Circuits able of solving matrix equations without iteration thanks to analog feedback have also been demonstrated [28,29,38].All these applications have in common the need for reliable analog memory, which still appears a challenge due to the inherent stochasticity of switching behavior in RRAM devices. .By programming the RRAM conductance to the matrix entries of A and applying a voltage vector V on the columns, the resulting current flowing in each row j tied to ground according to Kirchoff's law is Adapted from [29].

Analog Memory Programming Techniques and Variations
Programming the same device several times, in the same condition (or initial state) of programming pulse-width and amplitude, results in various conductance due to the stochastic process of ionic migration during set/reset operation, which is usually referred to as cycle-to-cycle variability [35].Figure 4a shows an example of a programming algorithm during set operation, namely incremental gate pulse programming (IGPP), where the TE voltage V TE is kept constant while the gate voltage V G is increased at each time step.This process was repeated 1200 times and the conductance was read with a relatively low voltage after each cycle [51].Figure 4b shows that the single traces (grey) suffer from relatively large variations, while the median value (blue) seems to grow linearly with pulse number (i.e., V G ).The cumulative distribution of the conductance for each cycle is reported in Figure 4c for increasing V G .The standard deviation of the conductance σ G as a function of the median conductance is reported in Figure 4d, indicating a fairly constant value σ G = 3.8 µS.This indicates a linear increase in relative resistance variation σ R /R with resistance R, since R = 1/G and, on a first approximation, variations obey the same relationship between differential quantities, hence σ R /R = σ G /G [35].The linear increase in σ R with R is generally observed in variability measurements of RRAM [52,53] and has been attributed to variation in the shape of the conductive filament.Other variability data have been reported indicating an increase in σ R /R as R 0.5 , which can be interpreted in terms of Poissonian variation of the number of defects in the conductive filament [54][55][56].Analog programming of RRAM is also possible in the opposite polarity, namely by increasing the negative reset voltage V TE applied on the TE for a fixed V G , an algorithm referred to as incremental reset pulse programming (IRPP), as shown in Figure 5a.In this case, by first initializing the conductance into the LRS, it is possible to characterize the variability of analog programming with reset voltage by applying IRPP several times (i.e., 1200 as in Figure 5) [51].Figure 5b again evidences that single traces (grey) corresponding to conductance read after each pulse of a single reset ramp, show high variability and fluctuation, while the median value (blue) shows a gradual decrease with the pulse number.The cumulative distributions of such conductance are reported in Figure 5c for increasing applied TE voltage |V STOP |.The resulting standard deviation of conductance σ G as a function of median conductance G is shown in Figure 5d (red): from the comparison with IGPP results of Figure 5 (blue), it is possible to see that σ G is generally larger in the reset process.We can conclude that gradual set programming is more convenient than gradual reset for accurate tuning of resistance; however, accurate program/verify (PV) algorithms are needed for reducing the error to acceptable levels (i.e., for having σ G < 1 µS).

Program-Verify Algorithms and Device-to-Device Variations
To date, only cycle-to-cyle variation on a single device was considered.To address device-to-device variation, conductance is typically characterized in a relatively large array (e.g., >1 kB), which allows one to study the main variability features with a relatively simple circuit and short experimental time.Figure 6a illustrates a conceptual schematic of a PV algorithm for 1T-1R RRAM, namely incremental-step program-verify algorithm (ISPVA) [43] applied on a 4 kB array.For a given V G (or I C ), the TE is incremented until the read value of G after programming reaches the target value G = L i .Figure 6b shows conductance traces as a function of V TE for increasing V G obtained with ISPVA for target levels L 2−5 .The experiment was repeated on all the devices in the array, and the read current probability distributions are shown in Figure 6d [43].The final average standard deviation is σ G = 7.5 µS, which is slightly larger than the case of Figure 4 despite the PV algorithm where the number of pulses is adapted to reach a certain conductance.The larger σ G can be understood by the superposition of cycle-to-cycle variation and device-to-device variation, the latter having the larger contribution to variability within the array.

Conductance Drift and Fluctuations
Unfortunately, once the device is programmed with a given precision, the conductance might still change due to time-dependent drift and fluctuation that affects the reliability of IMC. Figure 7a shows measured traces of conductance programmed with 4 levels on a 1Mb RRAM array as a function of time during annealing at 150 °C [58].The change of conductance can be explained by the thermally-activated atomic diffusion in the conductive filament.Note that, in addition to conductance drift with time, random fluctuations across the median value are present, as shown in Figure 7a.This is better illustrated in Figure 7b [59], where the resistance of a RRAM device in HRS is plotted as a function of time.The results show that the resistance can experience abrupt variations, called random walk, in addition to the typical random telegraph noise (RTN).For instance, the accuracy of neural network can decrease with time due to the introduction of a bias in the conductance as a result of the drift [43,58,60].Figure 7c shows the cumulative distribution of the initial and drifted conductance by annealing at T = 125 °C of Figure 6c [43], which confirms the time-dependent drift of weights of a two-layer neural network.The network schematic is represented in Figure 7d, and it is composed of a 14 × 14 input layer (corresponding to a downsized version of the MNIST dataset), a hidden layer of 20 neurons and 10 output neurons corresponding to 10 handwritten digits and resulting in a total of 14 × 14 × 20 + 20 × 10 weights, which can be mapped in a 4 kB array.Figure 7e shows the confusion map for testing the MNIST dataset before annealing, showing a relatively good average accuracy of 83%.However, this accuracy drops to 72% after annealing as shown in Figure 7f, demonstrating the need of stable states for reliable neural network inference.

RRAM Conductance Mapping Techniques
While Figures 2-5 focus on multilevel conductance mapping, aiming at an increase in the number of programmable levels and a reduction of the programming error, other techniques can be adopted to map a given computing coefficient, such as a synaptic weight, into one or multiple memory devices.Figure 8 summarizes the main programming methodologies for IMC [51], including multilevel (a) [43,61], binary (b) [62], unary (c) [63], multilevel with redundancy (d) [64] and slicing (e) [23].In the following, we compare the various techniques in terms of mapping accuracy, maximum number of bits and number of devices.The mapping accuracy is evaluated by the standard deviation of the error σ in programming a certain coefficient with a given number of bits N. The accuracy is evaluated by analytical formulas assuming a constant σ G in programming an individual memory device with a maximum conductance G max [51].

Multilevel
Analog memories can naturally map discretized levels, which is referred to as multilevel mapping.To store N bits, 2 N equally spaced conductance levels between 0 and G max are needed.As a result, each level is separated from the adjacent ones by a conductance step ∆G = G max /(2 N − 1).Given a standard deviation of the programming error σ G , such as the value σ G = 3.8 µS in Figure 4d, the resulting standard deviation σ of the error in programming N bits can be obtained as: Equation ( 2) allows one to estimate the maximum number of bits N max , which can be mapped in a RRAM device with a given acceptable error σ << 1, which yields N max = log 2 (1 + σ G max /σ G ).For example, by considering σ G = 2.2 µS, G max = 225 µS [57] and targeting a maximum error of σ = 10%, the resulting maximum number of bits is N max = 3.35 corresponding to 10 levels that can be written in the memory to satisfy the precision requirement.

Binary
Binary storage is the typical mapping of conventional digital memories, where a value x is converted in its binary representation with N bits written in N memory cells, each containing two states for 0 and 1.For instance, x = 14 can be written in binary representation as x bin = 1110 with 4 RRAM cells programmed to [G max , G max , G max , 0], respectively.A weighted summation of the conductance values is possible by multiplying the current flowing in each cell by the corresponding power of 2, namely 2 3 , 2 2 , 2 1 , 2 0 , respectively, thus allowing one to reconstruct the correct number as To estimate σ , we consider the average imprecision divided by the least significant bit (LSB), namely: where the square-root term combines the independent variation of each memory cell multiplied by its weight.The maximum number of bits thus can be obtained as: Assuming the same σ G , G max and σ of the estimation for multilevel mapping, we obtain N max = 4.15 with binary RRAM.

Unary
To increase the precision of binary mapping, unary (or thermometric) coding uses 2 N − 1 devices to represent the information, each one having equal weight, which requires no bit-specific gain in the current summation.In unary coding, the error is given by: which leads to a N max = 7.7 with the same σ G , G max and σ used in the previous estimation.However, note that the higher precision has been achieved at the cost of a larger number of RRAMs, namely 2 N max − 1 = 207 memory devices.

Multilevel with Redundancy
To reduce the impact of σ G on multilevel coding, M memory devices having the same nominal conductance can be programmed and operated in parallel.As a result, the error is reduced by a factor √ M thanks to the averaging among the M redundant cells (see Table 1).The maximum number of bits is equivalently enhanced.Assuming M = 4 and the same σ G , G max and σ used in the previous estimation, we obtain N max = 4.36 bits, i.e., one additional bit compared to the pure multilevel case with no redundancy.

Slicing
By encoding a given number in base l, with l = 2 N number of levels stored in a single memory element with multilevel mapping, and using M different memories to represent the data, it is possible to significantly increase the precision in a compact implementation.For example, x = 14 encoded in base l = 4 yields to x 4 = 32 such that by using two memory elements with weights 4 1 and 4 0 , the current summation yields x = 4 1 × 3 + 4 0 × 2. Slicing can thus increase the number of addressable levels l by the number of used cells.The error can be evaluated in the same way as the binary scheme, by summing the weighted error contribution of each cell, which yields: Assuming M = 4 and the same values of σ G , G max and σ as before, we obtain N max = 13.41 bits for slicing.

Simulation Results
Table 1 summarizes the formulas for calculating the error σ and the maximum number of bits N max for different programming techniques.To validate the analytical formulas, we performed Monte Carlo simulations of the various programming conditions and compared the results to the analytical calculations [51].

Technique
Figure 9 shows the results of the analytical formulas (top) compared with the results of MC simulations (bottom) for the standard deviation of the error σ as a function of the standard deviation of conductance σ G and the number of bits to encode N, for multilevel (a,f), binary (b,g), unary (c,h), multilevel with redundancy M = 4 (d,i) and slicing with M = 2 (e,j) [51].The analytical formulas and the MC simulations show a good agreement, thus confirming the correctness of the models in Table 1. Figure 10a shows the calculated σ as a function of σ G for a target number N = 7 of bits.The results indicate that slicing and unary programming have a significant advantage over the other techniques by approximately one order of magnitude.This result is confirmed in Figure 10b showing the maximum number of bits N max as a function of σ G for σ = 1%.However, unary and slicing techniques require several memory elements to encode the coefficients, which thus increase the energy consumption and chip area per bit.To address the precision/area trade-off, Figure 10c shows the bit density, evaluated as N max divided by the number of memory cells as a function of σ G .The results demonstrate the good trade-off for the case of the slicing technique.

Array-Level Reliability Issues
Device variations and errors are not the only origin for accuracy degradation in IMC circuits.In large memory arrays, the interconnect lines are generally affected by nonideal behavior due to the parasitic resistance and capacitance that can deteriorate the analog signal integrity.In particular, parasitic wire resistance introduces a significant error, due to the current-resistance (IR) drop on the array rows/columns, which results in a difference between the applied voltage and the one across each memory cell.Figure 11a shows a sketch of a 4 × 4 memory array highlighting the wire resistance, namely the input resistance R in , the output resistance R out and the row/column wire resistance r between each memory cell [65].Assuming for instance an inference operation with a typical read voltage V read = 0.1 V, an average RRAM resistance R = 100 kΩ, corresponding to a current I = 100 µA for each cell within a a 100 × 100 crosspoint array with wire resistance r = 1 Ω, then the overall IR drop can be computed as rI + 2rI + 3rI + • • • + NrI = rI N 2 2 = 5 mV corresponding to a 5% error in the summation current [66].Note that this error does not include any contribution due to variations in RRAM conductance.To mitigate this effect, it is possible to increase the device resistance to decrease the current at the array wires.Unfortunately, large device resistances are usually more prone to variations, drift, fluctuations and noise [52,55].Furthermore, a high cell resistance requires a longer readout time because of CMOS noise in the sensing circuit, thus resulting in longer program/verify algorithms.Finally, the higher cell resistance could increase the RC delay time for charging the BL.IR drop can be mitigated by changing the memory device topology, for example, by inserting a current-controlled synaptic element [67], such as a Flash memory, ionic transistor or FeFET, as shown in Figure 11b.A three-terminal memory transistor can be programmed to various saturation currents, each representing a synaptic weight.By encoding the input in the gate pulsewidth and integrating the current flowing in each column, it is possible to perform a current-based computation where the resulting charge is given by: which corresponds to a MVM of a current matrix times a pulsewidth vector.Figure 11c shows a a comparison of the impact of IR drop for ohmic and saturated characteristics, demonstrating a much smaller impact of the current controlled devices [67].At algorithmic level, IR drop and other non-idealities can be taken into account during training/inference or programming of computational memory devices.For instance, parasitic-aware program-and-verify algorithms have been presented to minimize the impact of IR drop [65,68].By iteratively programming the target conductance matrix G resulting in G , evaluating the current i = VG and comparing it to the ideal current i = vG, a new target matrix can be computed and programmed.The algorithm is repeated until the error is reduced below a certain tolerance, e.g., = |i − i | < 1%.
At circuit architecture level, various techniques have been proposed to mitigate the impact of IR drop.Typically, the synaptic weights in a neural network have a differential structure, thus two separate 1T1R memory devices are used for representing a single weight.This is shown in Figure 12a where two contiguous columns are used for representing positive and negative weights that are summed up in the digital domain [69].This configuration results in a relatively large impact of the IR drop since two array locations are used for each matrix entry.To mitigate this issues, Figure 12b shows a signed-weighted 2-transistor/2-resistor (SW-2T2R) structure to represent the positive and negative part of each weight, where the current summation is performed directly within the memory cell, thus effectively reducing the impact of IR drop by roughly a factor 2 [69].Another approach is to use small-size crosspoint arrays and/or organize the IMC architecture in computing tiles [70,71], where the overall problem is divided in smaller operations with smaller summation currents, hence a lower IR drop.
where I is the vector of the row input currents and V is the output vector of column voltages.(c) Regression problem solver performing v = −(X T X) −1 X T i where i is the vector of the input row currents at the left crosspoint array, X is the matrix of conductances in the two crosspoint arrays and v is the output voltage of the second stage of amplifiers.(d) Analog CAM cell, a range is stored in the memory conductance M1 and M2, the ML is pre-charged and an analog input is applied to the DL line.If the analog input is in the range of acceptance given by M1 and M2 the ML remains charged otherwise it discharges.Reprinted from [31,73].

MVM Accelerator
Figure 13a shows a circuit schematic for implementing the MVM accelerator of Equation ( 1).The conductance matrix G is stored in the RRAM device of the crosspoint array and the input voltage vector V is applied with a digital-to-analog converter (DAC) connected to the rows of the array.Columns are connected to the sensing circuit, consisting of a trans-impedance amplifier (TIA) that converts currents into voltages, and an analogto-digital converter (ADC), which encodes the analog signals into digital words.MVM is performed in a single step, irrespective of the matrix size, although typically the sensing operation is multiplexed between various columns to reduce the peripheral overhead [60].Since the forward propagation in a neural networks basically consists in extensive MVM of activation signal multiplied by synaptic weights [74], MVM crosspoint circuits have been heavily used for accelerating both the inference [22,43] and the training [41,42] stage of neural networks.Since the neuron activation function is typically performed in the digital domain, various training algorithms have been developed, including supervised training [42], unsupervised learning [75] and reinforcement learning [76].MVM can serve as the core operation of various neural networks, including fully-connected neural networks [43], convolutional neural networks (CNNs) [77] and recurrent neural networks, such as long short term memory (LSTM) networks [78].Integrated circuit comprising the crosspoint memory array for MVM and the routing/sensing units have been reported [24,79,80], demonstrating strong improvement in throughput and energy efficiency compared to conventional digital accelerators.
Among recurrent neural networks, the Hopfield neural network (HNN) [81] provides a brain-inspired structure that is capable of storing and recalling attractors, thus allowing an associative memory in hardware to be realized [50,82,83].Interestingly, HNN can also naturally perform gradient descent algorithms, thus accelerating the solution of optimization problems such as constraint satisfaction problems (CSPs) [84].By performing inference of an appropriate Hamiltonian representing the problem [85], the HNN converges to a stable state representing a minimum of the energy function of the problem, given by: where G is the encoded matrix and v i , v j are the states of neurons i and j, respectively.However, the most difficult problems are usually not expressed by a convex energy landscape; thus even HNN cannot solve them, as the steady state remains trapped in a local minimum of the energy function.To prevent locking of the HNN in a local minimum, noise can be added to the system to perform simulated annealing [86], thus allowing the state to escape from the local minimum and converge to the global minimum corresponding to the optimization problem solution.Since memristive devices are inherently stochastic, various implementations combining MVM acceleration with noise generation have been proposed for accelerating the solution of CSPs [47][48][49][50]87,88].Within this type of IMC accelerator, a major challenge consists of the extension of the circuit size to the scale of real-world intractable problems.
The MVM circuit can also be used for accelerating the training of a neural network according to the concept of outer product [91].According to the back-propagation technique of the stochastic gradient descent (SGD) training, the error between the true output and the actual output at a certain stage during training is used to train each synaptic weight according to the formula [74]: where ∆w ij is the synaptic weight increase/decrease, η is a learning rate, x i is the input or activation value and j is the back-propagated error.Equation ( 9), which represents the outer product between the input vector x and the error vector , can be executed in hardware, e.g., by applying fixed pulse-width pulses with amplitude proportional to x at the array rows and fixed-amplitude pulses with pulse-width proportional to at the array columns [91].This concept, which is the main idea for the hardware acceleration of timeand energy-consuming DNN training, relies on the linearity of the weight update on both the time and the voltage amplitude, which is rarely demonstrated in practical memory devices [3,92].

Analog Closed-Loop Accelerators
Analog in-memory circuits can solve algebraic problems without the need for digital iterations [28,29,38].Figure 13b shows a circuit for the non-iterative solution of a system of linear equations [28].Given a matrix problem Ax = b, it is possible to encode the coefficients A in the conductance matrix G, while the input vector b is applied as currents i at the crosspoint rows, which are connected to the negative input of the operational amplifiers.Since the output of the operational amplifiers are connected to the array columns, the Kirchhoff's law for the crosspoint array reads i + Gv = 0, where v is the output voltage vector on the columns and where we have neglected the input currents entering the high-impedance input nodes of the operational amplifiers.This leads to which corresponds to the solution of the system of linear equations Ax = b.Note that the solution is obtained in one step, without iterations.It has been shown, on a first approximation, that the time complexity for solving linear systems in this circuit does not depend explicitly on the matrix size [93]; i.e., it displays an O(1) complexity for solving linear systems, since the speed of solution solely depends on the configuration of poles which are limited by the smallest eigenvalue of the conductance matrix G.The circuit can implement both positive and negative coefficients of matrix A by adding an inverting buffer along each array column and connecting a second crosspoint array with the negative coefficients [28].In addition to linear systems Ax = b, closed-loop crosspoint arrays allow one to solve eigenvector problems in the form (A − λI)x = 0, where λ is the principal eigenvalue and I is the identity matrix [38].For instance, the Pagerank algorithm [94], which is at the backbone of many computing tasks for searching, ranking and recommendation, relies on the calculation of the principal eigenvector of a given link matrix.Similar to the linear system solution, the IMC-accelerated computation of eigenvectors has a time complexity that does not depend explicitly on the matrix size, thus displaying O(1) time complexity [95].Figure 13c illustrates the IMC circuit for analog closed-loop linear-regression problems [29].Assuming a set of N data points where the values of M independent variables are stored in matrix X of dimension N × M and y contains the values of the dependent variable, a regression consists of the calculation of the M coefficients α, which minimize the square error of the matrix equation Xα = y.This problem is non-iteratively solved by the circuit in Figure 13c where the input current vector i is applied to the rows of a rectangular crosspoint array that maps the matrix X, which are in turn connected to the negative input terminals of operational amplifiers A 1 with TIA configuration and gain G T .The TIA's output terminals are connected to the rows of a second rectangular crosspoint array X.The columns of the second array are connected to the positive input terminals of operational amplifiers A 2 , whose output terminals are connected to the columns of the first array.From applying Kirchhoff's laws at the first array, the output voltage of the set of TIAs is given by v = G T (i + Xv), where v is the output voltage of the operational amplifiers A 2 .Since no current can flow in the high-impedance input terminals of operational amplifiers A 2 , we can assume X T v = 0, which can be rearranged as follows: Here, the output voltage v is given by the product of the Moore-Penrose pseudo-inverse of X times the input vector y mapped by i, which provides the least-square solution α by minimizing the norm ||Xw − y|| [29].Linear and logistic regression accelerators with the circuit configuration of Figure 13c have been demonstrated [29], where the application can range from predicting the price of a house based on a set of descriptions to the training of the output layer of an extreme learning machine, a particular neural network with a wide and random input layer and an output layer that can be trained with logistic regression [29].

Analog CAM
In a conventional random access memory (RAM), an address is given as input and the stored word is returned as output.CAMs work on the opposite direction; i.e., the data content is provided as input word, while its location in the memory (or address) is returned as output, thus serving as data search and data matching circuits [96].CAM is generally implemented by SRAM-based circuits, which can be relatively bulky and power-hungry; thus RRAM-based CAMs have been proposed [97][98][99].A distinctive advantage of RRAM with respect to SRAM is the possibility for multilevel CAM [72,100] and analog CAM [73], thus allowing to increase significantly the memory density.
Figure 13d shows the schematic of an analog CAM cell [73] based on RRAM devices.By storing in the conductance of RRAMs M 1 and M 2 two different values, pre-charging the match line (ML) and applying an analog search input data to the data line (DL), ML will remain charged only if the voltage on DL is such that f (M 1 ) < V DL < f (M 2 ).This property can be used for implementing a multilevel CAM; e.g., if the stored number to be searched is x = 15, M 1 could be set to the level corresponding to 14.5 while M 2 to the level corresponding to 15.5 where the 0.5 range represents the acceptance tolerance within an error of LSB/2.Interestingly, analog CAMs have been used for accelerating machine learning tasks such as one-shot learning in memory augmented neural networks [72], or tree-based models [101].In the latter case, each threshold of a root-to-leaf path in a decision tree is mapped to an analog CAM row, thus allowing the inference in parallel within a large amount of trees to be accelerated.

Outlook on Memory Technologies and Computing Applications
For each specific application of analog IMC, such as the training of a neural network or the accelerated solution of algebra problems, different properties are required from a device and circuit perspective.To illustrate the device-and application-dependent requirements, Figure 14a shows a radar/spider chart summarizing the device properties in terms of cycling endurance, low-current operation, scaling and possibility of 3D integration, ohmic/linear conduction behavior, programming speed, analog precision and linear conductance upgrade.Each property is shown on a relative scale for various IMC tasks, including algebra accelerators, DNN training/inference accelerators and spiking neural networks (SNNs).Among these computing applications, DNN training accelerators is one with the highest demand for high-performance memory devices.This is because training acceleration relies on the synaptic weights to be updated online, typically in parallel via the outer product operation, which requires a linearity in both time and voltage for accurate and fast convergence [41,102].This property is extremely difficult to achieve with resistive memory technologies due to the tendency for abrupt increase/decrease of conductivity, followed by a saturating change of conductance [3,92].A typical figure of merit for linearity is the exponent n pot in the formula: which describes the increase in conductance G as a function of the number p of potentiation pulses, where G min and G 0 are fitting parameters.A similar exponent n dep describes the linearity under depression.Parameters n pot and n dep should generally have similar values to allow for symmetric potentiation/depression, which is another key requirement for online training.Previous simulation results have shown that the recognition accuracy can range between about 82% for PCMO-based RRAM and a highest possible accuracy of about 94% for perfectly linear and symmetric characteristics in the case of a 3-layer neural network for MNIST recognition [103].On-line training accelerators also generally require a high programming speed and cycling endurance, due to the programming-intensive update of the synaptic elements.On the one hand, gradual set/reset operation reduces the stress on the memory device compared with full set/reset in the binary or memory application.However, the ability for analog-type programming generally degrades with cycling [104,105].Analog closed-loop algebra accelerators also show a high demand of device properties, including highly-linear conduction characteristics to prevent unstable and oscillatory behavior of the system.DNN inference accelerators show less stringent requirements, thanks to the mostly-read operation of the memory array device for accelerating the MVM, while a relatively small number of program/verify operations are needed to reconfigure the system for a new AI task, which considerably reduces the requirements in terms of endurance, programming speed and linear weight update.In the case of non-ohmic conduction, the array can be operated in shift-and-add fashion such that the input is applied as digital word and the output is reconstructed with post-processing [23].Non-linear characteristics are generally observed in RRAM devices with low conductance, which is essential for all computing schemes.When a device is programmed in the low conductance range, close to the HRS, transport typically takes place via Poole-Frenkel phenomena, which have a non-linear dependence on voltage [106].The desired conductance range for parallel MVM is generally below 10 µS, which would allow for an overall error due to IR drop around 5% for a 100 × 100 array (see Section 5).Achieving a lower conductance in the sub-µS range would enable the scaling-up of the computational array, with advantages of throughput, energy efficiency and area efficiency due to the smaller peripheral circuits.
Another key general requirement is the precision of conductance, i.e., ensuring a low σ G .For the case of inference accelerators, it has been recently shown that the network accuracy decreases only from 91.6% to 91.2% for a σ G between 0 and 10 µS for a 2-layer perceptron for MNIST recognition [57].Studies on deeper networks have indicated that the sensitivity to conductance variation can vary widely depending on the specific neural network [107].For instance, a ResNet model shows an increasing sensitivity to conductance for increasing number of layers, which can be understood by error accumulation in the forward propagation.On the other hand, AlexNet CNN shows a decreasing sensitivity at increasing size of the convolutional filter, due to error averaging within larger filters.
SNN show the most relaxed requirements thanks to neuro-biological frequencies in the 100 Hz range, which significantly relaxes the demand in terms of programming speed.Furthermore, update/conduction non-linearity and stochasticity are generally well tolerated or even potentially harnessed to perform brain-inspired adaptation and computations [50,108].All applications generally require high scalability and 3D integration of the memory elements to take advantage of high density of information for data-intensive computing.For instance, a recent neural network model for natural language processing (NLP) called generative pre-trained transformer 3 (GPT-3) includes 175 billion parameters, which approximately corresponds to 175 GB of memory devices [109].
Figure 14b shows the figures of merit for two-terminal devices, namely RRAM, phase change memory (PCM) [33,110,111] and spin-transfer torque (STT) magnetic random access memory (MRAM) [112].RRAM and PCM show comparable properties, the main difference being the analog precision, which is typically lower in PCM devices because of drift phenomena [113].STT-MRAM offers high programming speed, good endurance and highly-ohmic conduction [11]; however, the conductance is generally limited to two states, corresponding to the parallel and the antiparallel magnetic polarization in the magnetic tunnel junction.As a result, use of the STT-MRAM device is limited to digital computing, such as binarized neural networks (BNNs) [114,115].Figure 14c shows the relative performance of three-terminal devices for accelerating analog IMC, including both CMOS-based memory technologies and memristive technologies [116].Static random access memory (SRAM) is typically limited to digital operation, whereas Flash, ferroelectric field-effect transistor (FEFET) [117] and electrochemical random access memory (ECRAM) [118] show well-tunable analog conductance and low current operation.Compared to 2-terminal devices, transistor-type memory devices can display a lower conductance thanks to the sub-threshold operation regime [119].The relatively low conductance in ECRAM transistors can be explained by the use of low-mobility channels, usually consisting of metal oxides such as WO 3 [120] and TiO 2 [121].ECRAM devices have also shown well-tunable conductance levels, which translates in a large number of multilevel states [120].The precision of conductance can be attributed to the bulk-type conduction process within the switchable metal-oxide channel, as opposed to the filamentary conduction in typical 2-terminal RRAM [121].The weight of ECRAM can be updated with extremely high linearity, thus offering an ideal device for online training accelerators [118].For instance, the non-linearity exponents in Equation ( 12) are n pot = 0.347 and n dep = 0.268 in Li-based ECRAM, compared to a minimum n pot of about 2 for most RRAM and PCM devices [118].While CMOS technology has limited capability for 3D integration, the memristive FEFET and ECRAM seems most suitable for high density, back-end integrated memory arrays for analog computation of large-scale problems.

Conclusions
This work reviews the status and outlook of in-memory computing with RRAM devices.The RRAM device concept and the programming techniques aimed at high-precision analog conductance are presented.The possible coding of computational coefficient in the RRAM, including binary, unary and various multilevel approaches, are compared in terms of precision and circuit area.The typical challenges for analog precision of conductance are discussed, in terms of both device reliability (programming variations, drift and time-dependent fluctuations) and circuit-level parasitic resistance leading to IR drop errors.The most relevant analog IMC circuit primitives, including MVM, linear algebra accelerators and CAM, are presented.Finally, RRAM is compared to other computational memory devices in terms of reliability, precision, low current operation and scaling.From this overview, RRAM appears as one of the most mature and most promising technologies, despite significant challenges remain in reducing the operational current and controlling time-dependent fluctuations and programming variations.

Figure 1 .
Figure 1.A conceptual illustration of the different scales of in-memory computing.Computing relies on fundamental physical laws that are implemented in various types of memory devices and circuit designs.To perform large-scale computation, new architectures have to be developed for accelerating real world applications.New applications can also arise given the possibility of performing highly-parallel computation.The design and optimization of each different level should be performed by considering all the hierarchical stack.

Figure 2 .
Figure 2. RRAM structure and operation.(a) RRAM device made of a metallic TE and BE, with an interposed dielectric layer.After a forming procedure it is possible to set/reset the device and switch from LRS to HRS and vice versa.(b) Typical I-V curve of a RRAM device with 1T1R configuration for increasing gate voltage V G during set operation demonstrating analog programmability.Adapted from [31,38].

Figure 3 .
Figure 3. Crosspoint memory structure to perform analog MVM.At the intersection of each TE lines (orange) with each BE line (grey), a RRAM is placed (blue).By programming the RRAM conductance to the matrix entries of A and applying a voltage vector V on the columns, the resulting current flowing in each row j tied to ground according to Kirchoff's law is I j = ∑ N i=1 G ij V i .Adapted from[29].

Figure 4 .
Figure 4. Analog programming with set pulses at increasing I C according to the IGPP algorithm.(a) Conceptual schematic of the IGPP algorithm.(b) Conductance as a function of pulse number for multiple iterations (grey lines) and the average behavior (blue).(c) Cumulative distribution function (CDF) of the conductance for increasing gate voltage V G .(d) Standard deviation of the conductance σ G as a function of the average conductance G. Adapted from [51].

Figure 5 .
Figure 5. Analog programming with reset pulses.(a) Conceptual schematic of the IRPP algorithm.(b) Conductance as a function of pulse number for multiple iterations (grey lines) and the average behavior (blue).(c) Cumulative distribution function (CDF) of the conductance for increasing stop voltage V STOP .(d) Standard deviation of the conductance σ G as a function of the average conductance G for IRPP and IGPP algorithms.Adapted from [51].

Figure 6 .
Figure 6.Program and verify algorithm.(a) Conceptual schematic of ISPVA program and verify algorithm.(b) Mean conductance as a function of set voltage V TE for multiple values of the gate voltage V G .(c) Probability density function (PDF) of programmed conductance levels.Reprinted from [43,57].

Figure 7 .
Figure 7. Drift and fluctuations in RRAM devices.(a) Conductance as function of time for 4 different analog levels after heating at 150 °C.(b) Different fluctuations' effect as a function of time.(c) Cumulative distributions of 5 programmed levels before and after annealing at T = 125 °C.(d) Conceptual schematic of the neural network used to evaluate the effect of drift and its accuracy in classifying the MNIST dataset before (e) and after (f) annealing.Adapted from [43,58,59].

Figure 9 .
Figure 9.Comparison between analytical formula (top) and MC simulations (bottom) of standard deviation of the programming error σ as a function of the number of bits N and the standard deviation of the conductance σ G for multilevel (a,f), binary (b,g), unary (c,h), multilevel with redundancy factor M = 4 (d,i) and slicing in M = 2 units (e,j).Adapted from [51].

Figure 10 .
Figure 10.Figures of merit of various programming strategies.(a) Standard deviation of the overall programming error σ G as a function of the conductance error σ G assuming N = 7 bits.(b) Maximum number of bits N max as a function of the conductance error σ G considering a programming error σ = 1%.(c) Bit density as a function of the conductance error σ G .Adapted from [51].

Figure 11 .
Figure 11.IR drop in crosspoint arrays.(a) Explicit representation of the parasitic resistance in a crosspoint array, namely the input resistance R in , output resistance R out and wire resistance r.(b) Memory array with current controlled memory element programmed to various saturation currents.(c) Comparison of the impact of IR drop for ohmic and saturated characteristics.Reprinted from [67].

Figure 12 .
Figure 12.Bipolar conductance mapping and IR drop.(a) Typical bipolar conductance mapping in two adjacent 1T1R columns with the positive one (red) encoding the positive part of the weights and the negative one (blue) encoding the negative part of the weights.Currents are then subtracted in the digital domain after conversion.(b) To reduce the impact of IR drop, conductance representing the positive and negative weights can be summed up in the analog domain with a dedicated ST-2T2R circuit structure.Reprinted from [69].

Figure 13 .
Figure13.Schematic of IMC circuits for various applications.(a) MVM accelerator performing I = AV, where V is the input voltage vector, A is the conductance stored in the crosspoint array and I is the vector of the current in each row.(b) Linear system solver performing V = A −1 I where I is the vector of the row input currents and V is the output vector of column voltages.(c) Regression problem solver performing v = −(X T X) −1 X T i where i is the vector of the input row currents at the left crosspoint array, X is the matrix of conductances in the two crosspoint arrays and v is the output voltage of the second stage of amplifiers.(d) Analog CAM cell, a range is stored in the memory conductance M1 and M2, the ML is pre-charged and an analog input is applied to the DL line.If the analog input is in the range of acceptance given by M1 and M2 the ML remains charged otherwise it discharges.Reprinted from[31,73].

Figure 14 .
Figure 14.Application requirements (a) and figures of merit for various memory technologies with 2-terminal (b) and -terminal structure (c).

Table 1 .
Comparison of various mapping schemes, in terms of conductance range, variability-induced error and resulting maximum number of programmable bits.