Compute-in-Memory for Numerical Computations

In recent years, compute-in-memory (CIM) has been extensively studied to improve the energy efficiency of computing by reducing data movement. At present, CIM is frequently used in data-intensive computing. Data-intensive computing applications, such as all kinds of neural networks (NNs) in machine learning (ML), are regarded as ‘soft’ computing tasks. The ‘soft’ computing tasks are computations that can tolerate low computing precision with little accuracy degradation. However, ‘hard’ tasks aimed at numerical computations require high-precision computing and are also accompanied by energy efficiency problems. Numerical computations exist in lots of applications, including partial differential equations (PDEs) and large-scale matrix multiplication. Therefore, it is necessary to study CIM for numerical computations. This article reviews the recent developments of CIM for numerical computations. The different kinds of numerical methods solving partial differential equations and the transformation of matrixes are deduced in detail. This paper also discusses the iterative computation of a large-scale matrix, which tremendously affects the efficiency of numerical computations. The working procedure of the ReRAM-based partial differential equation solver is emphatically introduced. Moreover, other PDEs solvers, and other research about CIM for numerical computations, are also summarized. Finally, prospects and the future of CIM for numerical computations with high accuracy are discussed.


Introduction
Different from data-intensive computing (typically all kinds of neural networks), highprecision computing is aimed at accurate numerical computations like large-scale matrix multiplication and partial differential equations (PDEs). At present, neural networks (NNs) have a lot of applications and are widely used in daily life. However, applications of high-precision computing could not be solved by NNs, whether in scientific research or in the actual scene. The requirements of throughput and energy efficiency for computing are constantly improving; therefore, CIM (computing-in-memory) is proposed as the solution of the Von Neumann bottleneck. The progress of CIM for numerical computations has great value in finance, engineering, computer science and other disciplines. It is ubiquitous in the field of scientific research and engineering. For example, improving the physical authenticity of virtual reality (VR), analyzing SIS infectious diseases with age structure, studying the BSM equations of derivative pricing theory, preprocessing and extracting image information and many other practical problems involve partial differential equations.

The Development of ReRAM as NVM
With the explosive growth of portable electronic devices, the requirement and storage capacity of memory devices have increased rapidly. Higher density, faster speed and lower cost have become the goal of new memory devices. ReRAM, as a kind of nonvolatile memory (NVM) [8], was regarded as one of the continuations of NAND flash memory [9], though there were many emerging nonvolatile memory (eNVM) devices, such as phase-change memory (PCM) [10], magnetic random-access memory (MRAM) [11] and ferroelectric random-access memory (FeRAM) [12], over the same period. Table 1 lists the types of NVMs and the category of ReRAM. ReRAM is a two-terminal device with a variable resistance based on a physical mechanism of conducting filament formation and rupture [13]. According to the types of filamentary, ReRAM can be divided into oxide ReRAM (OxRAM) and conductive bridge ReRAM (CBRAM) [14]. ReRAM changes between high-resistance states (HRS) and lowresistance states (LRS) under different operating conditions, representing logic 0 and 1, separately. The formation of the conducting filament corresponds to the LRS, and the HRS is the opposite. Figure 1a is the structure of the OxRAM, and there is a metal oxide material between the two electrodes in the OxRAM. When a positive voltage is applied between the top electrode (TE) and the bottom electrode (BE), a conductive filament is formed between the two electrodes. While in Figure 1b, the electrode of the CBRAM is injected with copper or silver metal (Cu or Ag). Moreover, the CBRAM forms the conductive bridge by diffusing Cu or Ag into the oxide or chalcogenide (like GeS 2 ). When the voltage is sufficiently positive, there will be the oxidation of Cu or Ag at TE, and they will be reduced and deposited at BE. When the voltage transfers to the negative, there will be the reduction of Cu or Ag at Micromachines 2022, 13, 731 3 of 16 TE, and then the conductive filament connecting the two electrodes, and the state of the CBRAM, changes from HRS to LRS [15].
Micromachines 2022, 13, x FOR PEER REVIEW 3 of 16 positive, there will be the oxidation of Cu or Ag at TE, and they will be reduced and deposited at BE. When the voltage transfers to the negative, there will be the reduction of Cu or Ag at TE, and then the conductive filament connecting the two electrodes, and the state of the CBRAM, changes from HRS to LRS [15].
(a) (b) Since 2000, the research on ReRAM have explosively increased. The first NiOx-based ReRAM with promising device characteristics and reliability was proposed by I. Baek in 2005 [16]; the HfO2/Ti device was made with fully conventional fab materials [17]; the 3D vertical ReRAM emerged in 2009 [18]; the 10 × 10 nm 2 Hf/HfOx crossbar resistive RAM was produced in 2011 [19]; the first 16-Gb ReRAM integrated chip with copper oxide material [20], etc. However, because of the 15 nm critical dimension (CD) and the development of 3D NAND flash memory, using the ReRAM in high-density applications became more and more difficult.

ReRAM in CIM
In recent years, compute-in-memory widely emerged in machine learning (ML) and data-intensive computing. CIM is an effective method to break the Von Neumann bottleneck when computing large-scale data [21], which can achieve high speed and low-power computing by reducing data handling. Edge AI applications based on deep neural networks (DNNs) are designed to find the solution to achieving portable, fast, accurate and convenient computing. Computing efficiency (defined as terra-operations-per-secondper-millimeter-squared, TOPS/mm 2 ) and energy efficiency (defined as terra-operationsper-second-per-watt, TOPS/W) are the two most significant parameters to measure the performance of computing. For digital neural network accelerators, multiplication and addition are calculated in the processing element (PE). However, the global buffer or cache is urgently needed to store the weights and the inputs/outputs, which increases the data storage and handling. There is quite a lot of research on optimizing data flow at the chip, micro and SOC (system on chip) levels, but computation and memory are all separated, which leads to efficiency degradation. The memory not only stores the weights and the inputs but also achieves analog computation. That is what computing-in-memory means. Compared with traditional digital signal accelerators, CIM as a mix-signal Since 2000, the research on ReRAM have explosively increased. The first NiO x -based ReRAM with promising device characteristics and reliability was proposed by I. Baek in 2005 [16]; the HfO 2 /Ti device was made with fully conventional fab materials [17]; the 3D vertical ReRAM emerged in 2009 [18]; the 10 × 10 nm 2 Hf/HfO x crossbar resistive RAM was produced in 2011 [19]; the first 16-Gb ReRAM integrated chip with copper oxide material [20], etc. However, because of the 15 nm critical dimension (CD) and the development of 3D NAND flash memory, using the ReRAM in high-density applications became more and more difficult.

ReRAM in CIM
In recent years, compute-in-memory widely emerged in machine learning (ML) and data-intensive computing. CIM is an effective method to break the Von Neumann bottleneck when computing large-scale data [21], which can achieve high speed and low-power computing by reducing data handling. Edge AI applications based on deep neural networks (DNNs) are designed to find the solution to achieving portable, fast, accurate and convenient computing. Computing efficiency (defined as terra-operations-per-secondper-millimeter-squared, TOPS/mm 2 ) and energy efficiency (defined as terra-operationsper-second-per-watt, TOPS/W) are the two most significant parameters to measure the performance of computing. For digital neural network accelerators, multiplication and addition are calculated in the processing element (PE). However, the global buffer or cache is urgently needed to store the weights and the inputs/outputs, which increases the data storage and handling. There is quite a lot of research on optimizing data flow at the chip, micro and SOC (system on chip) levels, but computation and memory are all separated, which leads to efficiency degradation. The memory not only stores the weights and the inputs but also achieves analog computation. That is what computing-in-memory means. Compared with traditional digital signal accelerators, CIM as a mix-signal processing tremendously increases throughput, area efficiency and energy efficiency, but with the decline of accuracy. Though the requirement of analog-to-digital converters (ADCs) is inevitable, CIM still has enormous appeal in power consumption, no matter whether now or in the future.
Because of the resistive properties of ReRAM, ReRAM could be a natural electrical multiplier, with the function of storage following Ohm's laws and Kirchhoff's current laws (KCL). Utilizing I = V·G and the sum of the current, multiplication and accumulation are calculated by the ReRAM array in the analog domain, respectively. Obviously, eNVM, including resistive random-access memory, is an excellent memory device for CIM, and other eNVM devices such as PCM, MRAM and FeRAM are also of interest for CIM [22]. ReRAM is a better choice for compute-in-memory because of the 22 nm high reliability and the compatibility with the complementary metal-oxide-semiconductor (CMOS) process at present. In addition, ReRAM could potentially offer multi-bit per cell capability [22]. Figure 2a,b shows the principle of calculation in the ReRAM crossbar array. One terminal of each RRAM is connected to the bit line (BL) collecting the current, and the other is connected to the word line (WL) as the input of the voltage. Additionally, the currents through the BLs are added as the outputs of the ReRAM array to achieve the accumulation. One of the outputs could be given by I j = ∑ V i ·G i,j (where I j is the current of the jth column, V i is the input voltage of the ith row, and G i,j is the conductance of the ith row and jth column in ReRAM array). Due to the fact that the outputs are the analog currents, ADC is an indispensable part of the ReRAM crossbar peripheral circuit.

ReRAM Crossbar
processing tremendously increases throughput, area efficiency and energy efficiency, but with the decline of accuracy. Though the requirement of analog-to-digital converters (ADCs) is inevitable, CIM still has enormous appeal in power consumption, no matter whether now or in the future.
Because of the resistive properties of ReRAM, ReRAM could be a natural electrical multiplier, with the function of storage following Ohm's laws and Kirchhoff's current laws (KCL). Utilizing = • and the sum of the current, multiplication and accumulation are calculated by the ReRAM array in the analog domain, respectively. Obviously, eNVM, including resistive random-access memory, is an excellent memory device for CIM, and other eNVM devices such as PCM, MRAM and FeRAM are also of interest for CIM [22]. ReRAM is a better choice for compute-in-memory because of the 22 nm high reliability and the compatibility with the complementary metal-oxide-semiconductor (CMOS) process at present. In addition, ReRAM could potentially offer multi-bit per cell capability [22]. Figure 2a,b shows the principle of calculation in the ReRAM crossbar array. One terminal of each RRAM is connected to the bit line (BL) collecting the current, and the other is connected to the word line (WL) as the input of the voltage. Additionally, the currents through the BLs are added as the outputs of the ReRAM array to achieve the accumulation. One of the outputs could be given by = ∑ • , (where is the current of the th column, is the input voltage of the th row, and , is the conductance of the th row and th column in ReRAM array). Due to the fact that the outputs are the analog currents, ADC is an indispensable part of the ReRAM crossbar peripheral circuit.

Partial Differential Equation
A partial differential equation is any equation with a function of multiple variables and their partial derivatives [23]. The function : and the Partial differential equation can be defined as: , , ⋯ , , , , , ⋯ , , , ⋯ = 0 Typical results of partial differential equations have two forms: the analytical solution and the numerical solution. The analytical solution that can be expressed by an analytical expression is an exact combination of finite common operations. Given any

Partial Differential Equation
A partial differential equation is any equation with a function of multiple variables and their partial derivatives [23]. The function u: and the Partial differential equation can be defined as: Typical results of partial differential equations have two forms: the analytical solution and the numerical solution. The analytical solution that can be expressed by an analytical expression is an exact combination of finite common operations. Given any independent variable, its dependent variable could be solved, so the analytical solution is also known as the closed-form solution. The numerical solution needs to be calculated iteratively from the boundary condition step-by-step [24], and it is the emphasis of the method in PDEs solver research. With the decrease of the step size, the numerical solution will be more accurate. There are multiple numerical methods, including the Euler method, Runge-Kutta, finite-difference [25], finite-element method [26] and finite-volume method [23].

Finite-Difference Method
The principle of the finite-difference method (FDM) is understandable: converting the continuous problem to its corresponding discrete form and getting results within a finite number of calculations. The core mechanism of FDM is to approximate the partial derivatives at each point using its nearby values based on Taylor's theorem. There are three basic steps in the finite-difference method: (1) Regional discretization. According to the appropriate step size, the domain that needs to be calculated is divided into finite grids and using the function values on discrete grid points to approximate the continuous function values.
(2) Transformation of partial differential equations. Using the difference coefficient to approximate the exact derivatives. (3) Solution of partial differential equations. Bringing the boundary conditions into the equation and repeating calculations to solve a large number of equations.
The first-order finite-difference of g(x) of variable x can be defined as: where ∆x is step size, or the so-called spacing between two grid points, and the first-order difference coefficient of g(x) of variable x can be defined as: So that the forward difference coefficient, backward difference coefficient and central difference coefficient can be expressed as: The finite-difference method uses the difference coefficient to approximate the exact derivative. Similarly, the second-order difference coefficient can be expressed as: Taking a simple one-dimensional heat diffusion equation without a heat source as an example, the equation and boundary conditions are as follows: where u(x, t) is the temperature at grid point x at time t, a 2 is the thermal diffusivity, f (x, t) is the heat source and v1 and v2 are the constant boundary conditions at grid point a and b, respectively. Then, the solution domain is divided into finite grids in Figure 3. The step size in the t-axis direction is taken as ∆t, and the step size in the x-axis direction is taken as ∆x. Therefore, According to (4) and (6), the continuous function (7) will change to the function of values on discrete grid points: Excluding the boundary values, when = ∆ , the actual temperature of each point is The above system of linear equations contains − 1 equations, and we can rewrite them into a matrix equation: (13) can also be expressed as: where is the coefficient matrix, is the matrix at time and ∆ is the matrix at time + ∆ . The one-dimensional equation is the most ideal and simplest physical scenario. However, researchers and engineers need to solve the problems of two-dimensional or even multidimensional space aimed at the actual requirements. At this time, the two-point difference method is not suitable. As an example, there is a two-dimensional equation, a Laplace equation: According to (4) and (6), the continuous function (7) will change to the function of values on discrete grid points: Excluding the boundary values, when t = ∆t, the actual temperature of each point is The above system of linear equations contains N − 1 equations, and we can rewrite them into a matrix equation: (13) can also be expressed as: where A is the coefficient matrix, U t is the matrix at time t and U t+∆t is the matrix at time t + ∆t. The one-dimensional equation is the most ideal and simplest physical scenario. However, researchers and engineers need to solve the problems of two-dimensional or even multidimensional space aimed at the actual requirements. At this time, the two-point difference method is not suitable. As an example, there is a two-dimensional equation, a Laplace equation: According to (6), the Laplace equation can be written as: where the step size in the x-axis direction is taken as ∆x, and the step size in the y-axis direction is taken as ∆y. In Figure 4, the function value of each point can be calculated by the function values of its four neighboring points, and the whole domain is divided into N × N grid points.
According to (6), the Laplace equation can be written as: where the step size in the x-axis direction is taken as ∆ , and the step size in the y-axis direction is taken as ∆ . In Figure 4, the function value of each point can be calculated by the function values of its four neighboring points, and the whole domain is divided into × grid points.
Thus, (15) can be transformed into a matrix form as: Consequently, (18) can also be expressed as: where is the coefficient matrix, is the matrix in grid point of ( , ) and is the matrix composed of the boundary conditions.

Runge-Kutta Method
The Runge-Kutta method includes two kinds of methods: second-order Runge-Kutta and fourth-order Runge-Kutta. The equation in the grid point of (i, j) can be expressed as: Thus, (15) can be transformed into a matrix form as: where Consequently, (18) can also be expressed as: where A is the coefficient matrix, U is the matrix in grid point of (i, j) and U b is the matrix composed of the boundary conditions.

Runge-Kutta Method
The Runge-Kutta method includes two kinds of methods: second-order Runge-Kutta and fourth-order Runge-Kutta.
Second-Order Runge-Kutta: k 1 and k 2 can be calculated by the following equations, The solution of the PDE can be expressed as: Fourth-Order Runge-Kutta: k 1 , k 2 , k 3 and k 4 can be calculated by the following equations, The solution of the PDE can be expressed as:

Matrix Iterative Methods
After getting the matrix equations, the next main task is the iterative computation of a large-scale matrix, and common iterative methods include the Jacobi method, the Guass Seidel method and the SOR method.

Jacobi Method
The principle of the Jacobi method is to disassemble the coefficient matrix A into a diagonal matrix D, a negative upper triangular matrix U and a negative lower triangular matrix L. Consequently, A can be written as: . . . . . . . . . 0 −a n,1 · · · −a n,n−1 0 For an equation of a matrix like B = A·X, replacing the coefficient matrix A, it will change to: After the iterative calculation, the calculation result of the (K + 1)th iteration is

Guass Seidel Method
The principle of the Guass Seidel method is similar to the Jacobi method, where the difference is the derivation process, the Guass Seidel method rewrites B = A·X into: After the iterative calculation, the calculation result of the (K + 1)th iteration is In most cases, the Guass Seidel method converges faster than the Jacobi method, and only one set of storage units is needed to store (D − L) −1 , but D −1 (L + U) and D −1 are required to store in the Jacobi method.

SOR Method
Based on the Guass Seidel method, a convergence factor ω is added to the SOR method in order to improve the convergence speed. The calculation result of the (K + 1)th iteration is

Krylov Subspace Method
For an equation of a matrix like B = A·X, the result X can be expressed as A −1 ·B directly. However, if the matrix A has a large size or is a sparse matrix, A −1 will be very hard to solve. The principle of the Krylov subspace method is to approximate A −1 ·B.
where β 0 , β 1 , β 2 , · · · , β m−1 are unknown coefficients, the step size m is related to the accuracy of the approximation and is less than the dimension of matrix A. After the discussion above, the number of iterations used by the Jacobi method, the Guass Seidel method and the SOR method is decreasing, which also means the improving of the calculation accuracy. Additionally, in terms of the hardware consumption and hardware implementation, the Guass Seidel method is better than the others. Moreover, the Krylov subspace method sacrifices accuracy to improve speed and is generally used large-scale matrixes. Relatively speaking, the SOR method is the most accurate and efficient method. However, the matrixes could not be inputted into the ReRAM arrays iteratively as weights. Calculating the matrixes directly will waste computing power in the multiplication of zero elements, because the matrixes used in numerical computations are generally sparse matrixes which include more than 60% of zero elements.

Rearrangement and Split
Rearrangement and split are proposed to solve the multiplication of sparse matrixes whose majority of elements are zero elements. There are a number of studies focusing on cutting or splitting matrixes to improve computational efficiency, while many matrixes cannot be cut or split directly all the time. Therefore, rearrangement of sparse matrixes is needed to change the layout of matrix elements. In sparse matrix-vector multiplication (SpMV), the rearrangement matrix and the splitting matrix can replace sparse matrixes with dense matrix operations in many cases, which can greatly save memory and reduce computational overhead [27]. Moreover, the methods reducing the bandwidth of sparse matrixes in SpMV are quite useful for matrixes got from the PDEs. Taking a 1024 × 1024 sparse matrix as an example in Figure 5, the gray part of the matrix is composed of zero elements, and the blue part is composed of nonzero elements. Firstly, the 1024 × 1024 sparse matrix is rearranged to a diagonal aggregation matrix (not a diagonal matrix). In addition, it is rewritten as a combination of B and I. At last, the matrix is divided into four types of 16 × 16 slices (a, b, c, d).
Micromachines 2022, 13, x FOR PEER REVIEW 10 of 16 addition, it is rewritten as a combination of and . At last, the matrix is divided into four types of 16 × 16 slices (a, b, c, d). According to the discussion of the matrix iterative methods in Section 3.2, the number of iterations used by the Jacobi method, the Guass Seidel method and the SOR method is decreasing. Though the Guass Seidel method has a lower hardware consumption and the SOR method has the highest computational efficiency, utilizing the Jacobi method with the splitting matrix can not only improve the accuracy but also cut down the number of iterations efficiently. Because the principal concern is which method has fewer zero elements, the Jacobi method is still the best choice for the CIM-based PDEs solver at present.

CIM-Based Partial Differential Equation Solver
With the discussion of matrix iterative methods, the problem of the PDEs will be changed to the multiplication of the large-scale matrix. Therefore, the essence of the PDEs Solver is to achieve the high-performance multiplication. The CIM-based PDEs Solver could be composed with input drivers, shifters, adders, a computing array and DAC/ADCs. Matrix will be stored in the array as weight, and matrix will be entered into the array as input. Because of the limited size and precision of the array, the array maps a single column of the matrix to serval columns of an array. After the shifters and adders, the results of the serval columns will be collected and then be quantified in ADCs. The ADCs in the CIM PDEs solvers are usually SAR ADCs to balance the area, power and speed. Recently, CIM-based partial differential equation solvers can be based on different CIMs, including ReRAM, SRAM, flash memory and PCM. Each of them has advantages and their own suitable calculating methods and circuits.

ReRAM-Based Partial Differential Equation Solver
The coefficient matrix has been divided into several slices, and next they will be written as resistance values into the ReRAM array. Matrix will be entered into the ReRAM array as input, and the slices can be used several times without replacement. That means the ReRAM array rarely needs to be written. The larger the size of the slices, the smaller the number of slices with the decrease of write times. On the contrary, the times of the iterative calculation will increase, and the computational efficiency will be lower. After the multiplication in the simulation domain, ADCs are required to convert the analog According to the discussion of the matrix iterative methods in Section 3.2, the number of iterations used by the Jacobi method, the Guass Seidel method and the SOR method is decreasing. Though the Guass Seidel method has a lower hardware consumption and the SOR method has the highest computational efficiency, utilizing the Jacobi method with the splitting matrix can not only improve the accuracy but also cut down the number of iterations efficiently. Because the principal concern is which method has fewer zero elements, the Jacobi method is still the best choice for the CIM-based PDEs solver at present.

CIM-Based Partial Differential Equation Solver
With the discussion of matrix iterative methods, the problem of the PDEs will be changed to the multiplication of the large-scale matrix. Therefore, the essence of the PDEs Solver is to achieve the high-performance multiplication. The CIM-based PDEs Solver could be composed with input drivers, shifters, adders, a computing array and DAC/ADCs. Matrix A will be stored in the array as weight, and matrix X will be entered into the array as input. Because of the limited size and precision of the array, the array maps a single column of the matrix to serval columns of an array. After the shifters and adders, the results of the serval columns will be collected and then be quantified in ADCs. The ADCs in the CIM PDEs solvers are usually SAR ADCs to balance the area, power and speed. Recently, CIM-based partial differential equation solvers can be based on different CIMs, including ReRAM, SRAM, flash memory and PCM. Each of them has advantages and their own suitable calculating methods and circuits.

ReRAM-Based Partial Differential Equation Solver
The coefficient matrix has been divided into several slices, and next they will be written as resistance values into the ReRAM array. Matrix X will be entered into the ReRAM array as input, and the slices can be used several times without replacement. That means the ReRAM array rarely needs to be written. The larger the size of the slices, the smaller the number of slices with the decrease of write times. On the contrary, the times of the iterative calculation will increase, and the computational efficiency will be lower. After the multiplication in the simulation domain, ADCs are required to convert the analog signals into digital signals. Then, after the digital signal processing, the result will be iterative, as will the input of the ReRAM array, and the final result can be got from the ReRAM-based partial differential equation solver.
In the work of Mohammed A. Zidan, a general memristor-based partial differential equation solver is proposed with the finite-difference method. To solve the general matrix Equation (20), they use the Jacobi method to decrease the calculation of zero elements. Their memristor is composed of a Ta top electrode, a Pd bottom electrode and a thin Ta 2 O 5−x metal oxide. Additionally, the memristor crossbar has extremely high energy efficiency and area utilization, but a lower accuracy because of the variation. With the write-verify approach, they decrease the conductance variation from 5.3% to 0.85%, which overcomes the accuracy defects of the ReRAM immensely. They divide the matrix into equally sized slices, and practical crossbar sizes can be mapped onto the active slices exactly. Cutting the matrix not only minimizes the effects of the series resistance, sneak currents and virtual grounds, but also reduces the operation of zero elements [23]. With the memristor-based hardware and matching software system, they get high-precision computing results. Shichao Li and Wenchao Chen simulated fully coupled multiphysics based on bipolar resistive random-access memory in 2017 [28]. They utilized the finite-difference method and the Scharfetter-Gummel method to solve the PDEs, solving three fully coupled partial differential equations by the crossbar of the HfO x -based ReRAM [29]. Like the work of Mohammed A. Zidan, the accuracy has not been effectively improved, while more PDEs are discussed in this work.
In recent work by S. S. Ensan and S. Ghosh, a ReRAM-based linear first-order PDE solver (ReLOPE) is proposed to solve PDEs of the following form: Unlike the general memristor-based partial differential equation solver [23], ReLOPE is the first PDEs solver purely based on hardware and used only for linear first-order PDEs [20]. Moreover, the principle of the ReLOPE is based on the second-order Runge-Kutta method. Though theoretically fourth-order Runge-Kutta offers higher accuracy, the value of the resistance tremendously interferes with the iteration under the limitation of the accurate programming of the ReRAM. Substituting (3) and (4) into (37), (37) will change to: Similarly, substituting Equations (3)-(6) into (37), (37) will change to: y n+2 − y n+1 = 1 + 2 3 h·a + 1 3 h 2 ·a 2 + h 3 ·a 3 +2h 3 ·a 2 +h 4 ·a 4

24
(y n+1 − y n ) Figure 6 is the overview of ReLOPE. ReLOPE includes a fully ReRAM crossbar-based CIM, shifters, adders and DAC/ADCs. It expands the operating range of the solution by exploiting shifters to shift input data and output data. ReLOPE improves its power consumption, solving a PDE by 31.4×. The above methods with ReLOPE used have two limitations: (1) it is unachievable to program RRAMs with this accuracy (six decimal points for fourth-order Runge-Kutta in the paper of ReLOPE) at the current technical level; (2) the accuracy has loss due to the nonlinear variation of resistance with voltage and the iterative use of ADC. Therefore, ReLOPE cannot further improve its accuracy with the method of Runge-Kutta. Micromachines 2022, 13, x FOR PEER REVIEW 12 of 16 Figure 6. Overview of ReLOPE.

SRAM-Based Partial Differential Equation Solver
Yannis Tsividis et al. proposed a programmable, clockless, continuous-time 8-bit hybrid (mixed analog/digital) architecture (ADC + SRAM + DAC) for solving ordinary and partial differential equations. The architecture, shown in Figure 7, is used to achieve nonlinear functions. The system consisted of an analog multiplier and analog adder/subtractor. The hybrid nonlinear function generator achieves 16× lower power dissipation and the computational accuracy of about 0.5% to 5%. In 2019, Thomas Chen and Jacob Botimer proposed a SRAM-based accelerator for solving PDEs [30]. They reformulated the multigrid Jacobi method in a residual form. By interleaving coarse-grid iterations with fine-grid iterations, their system reduced low-frequency errors to accelerate convergence. Their system contains 4 MAC-SRAMs, and each is a 320 × 64 8T SRAM array. The architecture is shown in Figure 8a

SRAM-Based Partial Differential Equation Solver
Yannis Tsividis et al. proposed a programmable, clockless, continuous-time 8-bit hybrid (mixed analog/digital) architecture (ADC + SRAM + DAC) for solving ordinary and partial differential equations. The architecture, shown in Figure 7, is used to achieve nonlinear functions. The system consisted of an analog multiplier and analog adder/subtractor. The hybrid nonlinear function generator achieves 16× lower power dissipation and the computational accuracy of about 0.5% to 5%.

SRAM-Based Partial Differential Equation Solver
Yannis Tsividis et al. proposed a programmable, clockless, continuous-time 8-bit hybrid (mixed analog/digital) architecture (ADC + SRAM + DAC) for solving ordinary and partial differential equations. The architecture, shown in Figure 7, is used to achieve nonlinear functions. The system consisted of an analog multiplier and analog adder/subtractor. The hybrid nonlinear function generator achieves 16× lower power dissipation and the computational accuracy of about 0.5% to 5%. In 2019, Thomas Chen and Jacob Botimer proposed a SRAM-based accelerator for solving PDEs [30]. They reformulated the multigrid Jacobi method in a residual form. By interleaving coarse-grid iterations with fine-grid iterations, their system reduced low-frequency errors to accelerate convergence. Their system contains 4 MAC-SRAMs, and each is a 320 × 64 8T SRAM array. The architecture is shown in Figure 8a  In 2019, Thomas Chen and Jacob Botimer proposed a SRAM-based accelerator for solving PDEs [30]. They reformulated the multigrid Jacobi method in a residual form. By interleaving coarse-grid iterations with fine-grid iterations, their system reduced lowfrequency errors to accelerate convergence. Their system contains 4 MAC-SRAMs, and each is a 320 × 64 8T SRAM array. The architecture is shown in Figure 8a,b. A year later, they updated the mapping of the MAC-SRAM [31] on the basis of their previous research. Finally, the SRAM-based accelerator achieved 56.9-GOPS, consuming 16.6 mW at 200-MHz. However, the SRAM-based CIM has the same accuracy problems due to limited multiplicand precision and limited ADC resolution.
MHz. However, the SRAM-based CIM has the same accuracy problems due to limited multiplicand precision and limited ADC resolution.

Flash Memory-Based Partial Differential Equation Solver
Jiezhi Chen et al. proposed a flash memory-based CIM hardware system to improve the computation efficiency of the time-dependent partial differential equations [32]. Based on the FDM and Jacobi algorithm, they got the matrix equation, then the coefficient matrix was mapped into the flash memory array as threshold voltages. Matrix is transformed into pulse time as input. Compared with ReRAM, flash memory enables the realization of vector-matrix multiplication with high accuracy and a good tolerance for device error. Moreover, it also has the advantages of ultra-high density and low cost.

PCM-Based Partial Differential Equation Solver
In 2018, Manuel Le Gallo and Abu Sebastian et al. used mixed-precision in-memory computing which combined a von Neumann machine with a computational memory unit to solve PDEs [33]. The mixed-precision CIM PDEs solver uses a low-precision computational memory unit to obtain the approximate solution of the first part and high-precision processing iteratively to improve accuracy in the second part. It achieves the low-precision matrix-vector multiplication by using a PCM crossbar array based on the iterative Krylov-subspace method. The PCM-based PDEs solver could offer up to 80 times lower energy consumption than the FPGA solution because of the architecture of PCM-based CIM and the mixed-precision system.

Discussion of Partial Differential Equation Solver
The memristive crossbar CIM PDEs solver based on the PCM chip could already offer up to 80 times lower energy consumption than the FPGA solution. The energy efficiency and speed of the memristive crossbar CIM PDEs solver is several hundred times than the PDEs solvers based on IBM POWER8 central processing unit (CPU) and NVIDIA Titan RTX graphics processing unit (GPU) [26,34]. However, the CIM-based PDEs usually solves the 4-bit to 8-bit PDEs computation in satisfaction of accuracy requirements. Table 2 summaries the representative CIM-based PDEs solvers recently and their performance are compared.

Flash Memory-Based Partial Differential Equation Solver
Jiezhi Chen et al. proposed a flash memory-based CIM hardware system to improve the computation efficiency of the time-dependent partial differential equations [32]. Based on the FDM and Jacobi algorithm, they got the matrix equation, then the coefficient matrix was mapped into the flash memory array as threshold voltages. Matrix X is transformed into pulse time as input. Compared with ReRAM, flash memory enables the realization of vector-matrix multiplication with high accuracy and a good tolerance for device error. Moreover, it also has the advantages of ultra-high density and low cost.

PCM-Based Partial Differential Equation Solver
In 2018, Manuel Le Gallo and Abu Sebastian et al. used mixed-precision in-memory computing which combined a von Neumann machine with a computational memory unit to solve PDEs [33]. The mixed-precision CIM PDEs solver uses a low-precision computational memory unit to obtain the approximate solution of the first part and highprecision processing iteratively to improve accuracy in the second part. It achieves the low-precision matrix-vector multiplication by using a PCM crossbar array based on the iterative Krylov-subspace method. The PCM-based PDEs solver could offer up to 80 times lower energy consumption than the FPGA solution because of the architecture of PCMbased CIM and the mixed-precision system.

Discussion of Partial Differential Equation Solver
The memristive crossbar CIM PDEs solver based on the PCM chip could already offer up to 80 times lower energy consumption than the FPGA solution. The energy efficiency and speed of the memristive crossbar CIM PDEs solver is several hundred times than the PDEs solvers based on IBM POWER8 central processing unit (CPU) and NVIDIA Titan RTX graphics processing unit (GPU) [26,34]. However, the CIM-based PDEs usually solves the 4-bit to 8-bit PDEs computation in satisfaction of accuracy requirements. Table 2 summaries the representative CIM-based PDEs solvers recently and their performance are compared. The deficiencies of ReRAM, including low inherent accuracy, nonlinearity and susceptibility to environmental changes, directly determine the use of ReRAM for high-precision computing. With the constant development of process levels, ReRAM devices will be manufactured accurately and get close to high-precision computing to a certain extent. At the same time, the write-verify method or multiread/write method can also solve accuracy problems, but with an increase in latency. Both of them have a lot of worth in research in ReRAM-based PDEs solvers in the future. To reduce the memristor device variability and nonlinearity, in the work of Mohammed A. Zidan, a general memristor-based partial differential equation solver used a write-verify method to write and update the coefficient values in the crossbar. The write-verify operation is based on a sequence of write-read pulse pairs, each pair including a programming (set or reset) pulse and a subsequent read pulse. When the conductance reaches within a predetermined range of the target value, the write operation is considered complete. The write-verify feedback method could decrease the cell-to-cell variation of <1% from 5.3%.
Though flash memory and SRAM have higher accuracy and density compared with ReRAM for CIM in numerical computations, ReRAM with nonvolatile and multivalued characteristics is still the most extensively studied for CIM-based PDEs solvers.
The sparse coefficient matrix, with only a small number of nonzero elements, usually has a large matrix size. The coefficient matrix must be divided into a certain number of slices, whose sizes are typically 16 × 16 or 32 × 32. When the ReRAM array, having a huge size and multiple rows or columns, are unopened, the result of the ReRAM array may have a large error because of the unopened leakage currents. Hence, the size of the ReRAM array, and the size of the slices above-discussed, not only depends on the system working, but also is determined by the hardware operation of the ReRAM array. Moreover, the collected currents need ADC for the next computation, and there are losses of accuracy in the process. Other CIM-based PDEs solvers also have similar problems.
How to cut apart the matrix into slices, how small the size of crossbar arrays is and how the single array is to work are urgent problems to be solved at the system level. At the same time, rearrangement of sparse matrixes and SpMV will play an important role in the future of CIMs for high-precision computing tasks.

Summary and Outlook
Different from neural network computations, numerical computations are used in early, basic disciplines, receiving general attention from researchers all the time. In the past few years, the research has become more and more popular to solve numerical computations using CIM. CIM for numerical computations improves energy efficiency and computational efficiency. All kinds of CIM PDEs based ReRAM, SRAM, flash memory and PCM have been proposed with various characteristics. The recent developments of CIM for numerical computations were compared. This article described the ReRAM-based CIM technology in detail. Then, it reviewed the numerical methods of PDEs and matrix iterative methods. Finally, the future of CIM for numerical computations can be summarized as follows: