An 8-bit Radix-4 Non-Volatile Parallel Multiplier

: The data movement between the processing and storage units has been one of the most critical issues in modern computer systems. The emerging Resistive Random Access Memory (RRAM) technology has drawn tremendous attention due to its non-volatile ability and the potential in computation application. These properties make them a perfect choice for application in modern computing systems. In this paper, an 8-bit radix-4 non-volatile parallel multiplier is proposed, with improved computational capabilities. The corresponding booth encoding scheme, read-out circuit, simpliﬁed Wallace tree, and Manchester carry chain are presented, which help to short the delay of the proposed multiplier. While the presence of RRAM save computational time and overall power as multiplicand is stored beforehand. The area of the proposed non-volatile multiplier is reduced with improved computing speed. The proposed multiplier has an area of 785.2 µ m 2 with Generic Processing Design Kit 45 nm process. The simulation results show that the proposed multiplier structure has a low computing power at 161.19 µ W and a short delay of 0.83 ns with 1.2 V supply voltage. Comparative analyses are performed to demonstrate the effectiveness of the proposed multiplier design. Compared with conventional booth multipliers, the proposed multiplier structure reduces the energy and delay by more than 70% and 19%, respectively.


Introduction
In the past decade, oceans of data need to transfer and process in the big data era due to advancements in the fields of cloud computing, the Internet of Things (IoT), machine learning, and image processing. These technologies require both high data transfer rates and processing speeds. However, the modern computing systems are confined by data transfer rates in the bus [1]. Conventional computing architectures was not able to satisfy high computing demands with low computing energy requirements due to the segregated computing and storage units [2]. For instance, Von Neumann architecture carries out processing and the storing operation in the central processing unit (CPU) and memory, respectively [3]. These operations require the transfer of data elements between CPU and memory, thus limiting the overall computational speed. The phenomenon is known as the Von Neumann bottleneck [4,5]. A large number of studies have been conducted to address this problem. Recent research has shown that all Boolean operations can be implemented using non-volatile memory-based computing systems [6]. These computing systems integrate the non-volatile memory and computing capabilities in the same physical location to increase the computing speed.
The multipliers and adders are critical building blocks of modern computer system as virtually every arithmetic calculation involve a addition and multiplication. These units become even more significant due to its widespread utilization in computationally-heavy applications such as machine learning and the IoT. There are several different adder and multiplier designs reported in the literature [7][8][9][10]. The multiplier circuit can be classified based on its logic used as either serial or parallel multiplier [11]. Serial multipliers generate partial product in a sequential way and add new generated partial product to accumulated sum to get the product. Serial multiplier has small area and simple design for this reason they are still used in some application where time delay is not important factor. In array multipliers partial products are generated in parallel way. The array multiplier structure is simple and easy to expand. They have small area and easier design but the time delay is still higher [12]. Wallace tree multiplier is another type of multiplier structure which is faster compared with array multipliers. However, Wallace tree design is more complex and hard to implement. Booth multiplier is another type fast multiplier with low area and low power. Booth multiplier can be used in different 26 modes, i.e., radix-2, radix-4, radix-8, etc. Modified booth multiplier is used to avoid variable size of partial product generated [13]. The comparison of these multiplier designs is present in Table 1. This scarce presence of multipliers in the literature is remarkable since multiplication is among the most predominant functions of arithmetic logical units (ALUs). In this paper, a non-volatile parallel multiplier design based on Resistive Random Access Memory (RRAM) is proposed with improved computational capabilities. Only a few examples are being reported to showcase the effectiveness of implementing non-volatile memory-based multiplier circuit designs to complex computing tasks [15,16]. In this paper, we have proposed an RRAM-based non-volatile multiplier design, which is suitable for low power application with less delay. We chose the RRAM in our multiplier design due to its potential in computation application utilizing smaller cell dimension, faster-switching speed, low I/V demands for read-write operation, and high OFF/ON resistance ratio [17]. In addition, it has higher reliability, data retention, and cycle endurance [18][19][20][21]. These characteristics make it a superior option for in-memory computing. Our design significantly improves the computational speed and reduces power consumption. Specifically, four partial products of 8 × 8 radix-4 booth encoding are formulated and connected sequentially to construct a multiplier. Results are obtained using a new Wallace tree structure containing only two stage. Fewer stages in our design make it faster than the conventional multipliers. The proposed 8-bit radix-4 non-volatile parallel multiplier has higher density compared to the conventional multiplier. The proposed multiplier circuit stores the multiplicator beforehand which makes it considerably faster by shortening the overall computational time. Furthermore, the multiplier is designed using RRAM which enables zero standby power consumption. The simulation results show that the proposed multiplier has better performance in terms of delay, power, and power delay production (PDP). The proposed multiplier has 70% less PDP, 63% lower power consumption, and 19% less delay, than the radix-4 booth multilevel resistance cell switching multiplier design.
The rest of the paper is structured as follows: Section 2 is the related work of conventional and nonvolatile multipliers, Section 3 provides the detailed description of the proposed 8-bit non-volatile radix-4 booth multiplier, Section 4 provides the simulation results and corresponding comparison. Finally, the conclusion is drawn in Section 5.

Resistive Non-Volatile Memory
The emerging Non-Volatile Memory (NVM) technologies provide quality hardware architecture due to its instant power-on speed, CMOS compatibility, high density [22,23], and standby power free [24]. To build up the non-volatile multiplier using NVMs, the full adder [25], flip-flop [26], and other basic arithmetic logic circuits are the most critical steps in constructing an in-memory computing system. The NVM technology is emerging with variety of representative candidates such as RRAM, Ferroelectric Random Access Memory (FeRAM), Phase Change Memory (PCM), and Magnetic Random Access Memory (MRAM) [27].
The development of crossbar memory structures based on RRAM technology may result in feasible and innovative solutions to effective non-volatile multipliers [19,20]. Among different types of NVMs, RRAM has shown great potential in the in-memory applications due to its merits such as nanometers cell dimension [18,28,29], nanoseconds switching speed [19,30,31], and microampere of read-write current with a low operation voltage [20,32,33]. Furthermore, it has a high OFF/ON resistance ratio [17,34,35] and good reliability, including data retention and cycle endurance [21,36,37]. Therefore, RRAM is chosen for this design to implement an 8-bit radix-4 non-volatile logic. Over the last decade, some non-volatile multipliers circuits based on innovative devices and circuits have been proposed. For instance, Wang et al. have proposed an implementation method with functionally complete Boolean logic incorporated in the 1T1R RRAM structure [38]. The method has low computation complexity and compatible with existing integration process of CMOS transistor. Further, in the NVM-based multiplier the power and computation time can be reduced without changing computing algorithm. However, technical solutions are still lacking for implementation of complex computing tasks like multiplication.
The RRAM has two terminals, the top electrode (TE) and the bottom electrode (BE). In the middle of these two electrodes, the generation and recombination of the oxygen vacancies in the oxide layer with applied positive/negative voltage, playing a significant role in resistance transformation. In the model of [39], the RRAM scheme is simplified to a single Conductive Filament (CF) in one dimension. Gap distance (g), a variable defined as the average distance between the TE and tip of the CF, is used to measure the RRAM resistance. According to electron tunneling conduction, the RRAM resistance can exponentially increase with g. Furthermore, the authors [39] also provides the nonlinear relationship between resistance and applied voltage (V) as follows: where I 0 , g 0 and V 0 are I-V fitting parameters. The gap growth/dissolution velocity can be expressed by the following equations: where k, q, a 0 , E ag , E ar , L are Boltzmann constant, elementary unit charge, atomic hopping distance, the activation energy for vacancy generation, the activation energy for vacancy recombination, and oxide thickness, respectively. v 0 , γ 0 , β are gap dynamics fitting parameters. γ, T are local field enhancement factors and temperature, respectively. Thus, a positive SET voltage is given to RRAM for the generation of oxygen vacancies and oxygen ions at the tip of the CF. The CF layer will grow along with decrease in the gap distance, which switches the RRAM from high resistance state (HRS) to low resistance state (LRS). Similarly, the reverse transition is called RESET process. This mechanism of RRAM implements a logic storage function.
The RRAM has two general circuit-level applications: one is programming scheme design on 1-Transistor-1-Resistor (1T1R) configuration, and the other is 1-Selector-1-Resistor (1S1R) cross-point array design exploration for a large-scale array considering the device variations. The 1S1R structure is generally used to design large arrays and storage due to its small area of 4F 2 , where F is the feature size. Although 1T1R has a larger area (6F 2 ) it can solve the cross-talk problem between adjacent storage cells and protect the resistor cell in a better way [39]. Therefore, 1T1R is more suitable for the multiplier design described in this paper. It is chosen to configure and store the booth encoding of the multiplicator.

Booth Multiplier
Hardware multiplier is one of the essential components in modern control and highspeed arithmetic operation with universal applications in some micro control units and digital signal processors. The hardware multiplier utilizes logic circuits to realize multiplication [40]. Generally, hardware multiplier adds partial products to complete the multiplication. For N-bit multiplication, there are N 2 partial products in the array multiplier [41]. When N becomes large, the adder tree will become complex, which subsequently increases area, time delay, and power consumption.Therefore, a growing number of novel algorithms like the Booth algorithm and optimized adder trees such as Wallace adder tree and Baugh-Wooley have been proposed to accelerate the multiplication speed. According to current design methods, the mainstream is to reduce the number of partial products so as to speed up the multiplication by combining the booth algorithm with Wallace tree architecture.
A conventional booth multiplier mainly consists of four parts, i.e., a booth encoder, partial product generator, carry-save adder (CSA) tree, and a final carry-look-ahead adder (CLA, etc.), as shown in Figure 1. The number of partial products is proportional to the radix-k of booth encoding by a factor of log 2 (k), which means that the number of partial products is halved while radix k increases four times. Although high radix may reduce the number of partial products and simplify the adder tree, higher radix causes other problems. Firstly, high radix needs extra logic to implement booth encoding. For example, radix-4 requires operations such as ±2A, ±A while radix-16 will generate odd multiples of multiplicand A, like ±3A, ±5A, ±7A [42], which makes the design more complex. Secondly, more input multiplexers will be required in high-radix multipliers. For instance, 5-to-1 multiplexers (MUXs) are used in the radix-4 Booth encoder, while 9-to-1 MUXs and 17-to-1 MUXs) are needed in radix-8 and radix-16 booth encoders, respectively. Thus, high radix inevitably results in higher power consumption and time delay. To reach a balance between power consumption, time delay and computational speed, radix-4 has been extensively adopted to design booth multiplier [43].

Proposed 8-bit Non-Volatile Booth Multiplier
In this paper, we present a scheme of an 8-bit non-volatile radix-4 booth multiplier. In our multiplier, we combine booth encoder with partial product generator, in which RRAM is used to configure and store multiplicand B. Therefore, the proposed multiplier has shorter latency and lower dynamic power. A novel readout circuit is also proposed to read out the partial product bits for subsequent addition. Considering the special output bits of partial product, we also simplify the full adder and 4-2 compressor in subsequent Wallace Tree. Finally, the Manchester carry chain is used as the final stage adder. The proposed RRAM-based multiplier structure diagram is shown in Figure 2, along with the cell of partial product generator array, analog-to-digital (AD) circuits, and two phase non-overlap clock circuit.
The structure of conventional CMOS-based computing system and non-volatile computing system is shown in Figure 3. In a conventional CMOS-based computing system, each computation requires to wait for the transfer of operands from the regfile to the CMOS-based multiplier. The regfile is designed to store the operands and outputs (see Figure 3a). The CMOS-based multiplier structure can be found in [44,45]. However, the non-volatile memory-based multiplier can avoid the migration of operand B when the frequency of change of B is not very high. The frequency of B can be reduced by proper algorithm and compilation of software. Because operand B is encoded in the RRAM array directly, which is already in the computing unit, there is no need for the operand B to migrate again (see Figure 3b). Thus, non-volatile memory-based multipliers have a low overall computation power and high speed. The non-volatile memory-based multiplier has been found in [3,46].
Our proposed multiplier has three main parts: the combination of booth encoding and partial product generator, adder tree (proposed Wallace tree), final adder (Manchester carry chain). These three parts are discussed in the following subsections.

Booth Encoding and Partial Product Generator
Compared with the conventional booth multiplier structure in Figure 1, we try to combine the booth encoding unit with a partial product generator, shown in Figure 2.
As shown in Figure 2, we develop two multipliers dividing four partial products into two-row and one-row structures. The design of the two-row structure is similar to the one-row structure except for AD circuits. The two-row structure use only two kinds of ADs. However, the one-row structure uses four types of ADs. The detailed structural design of each cell of partial product array is shown in Figure 2.
As mentioned earlier, we use RRAMs to configure and achieve different booth encoding operations. For each cell shown in Figure 2, we use 4 RRAMs with 1T1R configuration to select different booth coding types: ±2A, ±1A or ±0A. Similarly, every cell should be set with the same pattern for the whole partial product array.
By setting different values to four types of RRAMs, we can implement booth encoding of multiplicand B. By taking B = 0110 0011 for example, we can simply get its booth encoding as +2A, −2A, +1A, −1A (most significant bit to least significant bit) [47], so the RRAM of each cell of partial product generator array is configured as below: 1.
When the A i or A 2C i = 1, the transistor connected to corresponding branch switches on and we get a specific current. After converted by readout circuit, we get corresponding partial product bits.

Sign Bit
As for the 8-bit radix-4 booth multiplier, the partial product from booth encoding is extended to 15 bits by adding corresponding sign bits, which means more full adders are required.The conventional extension for partial products can be found in [47]. Actually, complexity can be reduced by using following method: A two's complement number can be written as: S S S S S S S S Z 7 Z 6 Z 5 Z 4 Z 3 Z 2 Z 1 This pattern can be replaced by: − s * 2 14 + (s * 2 13 + s * 2 12 + s * 2 11 + s * 2 10 + s * 2 9 + s * 2 8 ) Then for 8-bit radix-4 booth multiplier, the four partial products can be simplified in [47]. For convenience of digital circuit design, we change the pattern as Figure 4. S2 S3 S4 S1 S1 1 S1 1 Figure 4. Generation of modified partial products for Radix-4 booth multiplier.
With this method, we can significantly reduce the number of storage units for sign bits. In the above example that reflects booth encoding of ±1A, ±1A, ±1A, ±1A, we only need eight storage units for sign bits which is decreased by 60% compared to conventional extension [47]. Thus, using this method to reduce sign bits results in less area and power consumption.

Readout Circuit
One of the key components in our design is AD circuits . The ADs can be divided into four types: AD1b with one output bit, AD1.5b with two output bits, AD3b with three output bits and AD4b with four output bits. The schematic of AD1.5b shown in Figure 2d. The encoding of each type of AD is given in Table 2, with capacitance power. In terms of each branch of AD, only one RRAM can be set in each cell. The AD1b is divided into two parts: (1) current sensing circuit changing load current into corresponding voltage V X ; (2) latched comparator to distinguish among states of different V X and latch the data.

Current Sensing Circuit
The current sensing circuit is simply based on capacitance charge and discharge. The MOSFET M 1 controls discharge circuit while MOSFET M 3 controls charge circuit Figure 2d. The M 2 is used to reduce the effect of charge injection and clock feed-through effect when M 1 is closing.
In the charge circuit, when M 3 switches on and M 1 switches off, voltage reference charges the V X junction nearly to V re f .
In the discharge circuit, when M 1 switches on and M 3 switches off, C 1 begins to discharge through load resistance which is connected to M 1 . As mentioned above, each branch will only have one of three different load resistance values: HRS 4 , LRS, and LRS 2 .The equation of discharge is given as: With different resistance values, the rate of capacitance discharge is different. Thus, we can distinguish V X at the same time. To satisfy the design requirement, a two-phase non-overlap clock circuit is needed to generate gate signal Charge and Discharge shown in Figure 2e. Two delay units can regulate a non-overlap time on two sides. Using nonoverlap clock has two major advantages: on the one hand, Charge and Discharge will not be valid concurrently at any time, in case that both charge circuit and discharge circuit switch on at the same time; on the other hand, when both these two signals are invalid, the capacitance-voltage V X can be latched for next comparison.
In order to get wide swing, a latched comparator is also needed so that the output swings between V DD and V SS .

Latched Comparator
The capacitance-voltage V X varies as mentioned above which causes the variation of current passing through M 6 . The current I M6 is compared with the constant current generated by M 7 due to constant voltage V c . The M 10 , M 13 transistors are used to reduce kick-back noise, and the Latch signal voltage goes back to the input signal which may alter the data.
The value of V c should be set properly to distinguish between V OL , V OMID and V OH or to distinguish between V OH , V OMID and V OL . Resorting to this method, we can distinguish encoding between 0 and 1 for AD1b; Similarly, For AD1.5b, we need 2 latched comparators and 2 different values of V c to distinguish encoding among 00, 01 and 11.

Proposed Wallace Tree
Conventional Wallace tree use carry-save adders to decrease critical path delay and the number of adder units [48]. However, in our design, the output signals Out0 and Out1 generated by the readout circuit only have 3 cases: 00, 10, 11. Thus, we can simplify the logic function of the 1-bit full adder, 4-bit adder, and 4-2 compressor so that we can further reduce the critical path delay and the number of transistors. For example, the truth table of simplified full adder is given in Table 3.
By depicting the corresponding K-map, we can get the simplified logic function of the proposed full adder. The schematic of full adder used in our work is shown in Figure 5a. Compared with a conventional full adder consisting of 28 transistors [49], our proposed full adder uses 26 transistors to achieve the same function. Furthermore, our full adder decreases the critical path delay, meaning the clock frequency has enough room to increase. Thus, the performance can be improved. Similarly, the 4-bit full adder and 4-2 compressor can also be simplified as shown in Figure 5b,c. x 0 x 1 ' y 0 y 1 ' y 1 x 1 C out ' x 0 x 1 ' The conventional Wallace Tree diagram is shown in the Figure 6 and the proposed Wallace tree diagram, with above simplified key components, is demonstrated in Figure 7. Compared with the conventional Wallace tree [50], three stages are reduced to two stages in our work, which means fewer transistors and less area is used. Further, the speed of the whole multiplier is enhanced.

Manchester Carry Chain
At the final stage, we used an 11-bit Manchester carry chain to generate the final product. Compared with CLAs or other carry chains, the Manchester carry chain is simple and efficient. In the meantime, the worst case of the Manchester carry chain is when the carry chain discharges through the entire path, at which the path delay reaches the maximum. Besides, the Manchester carry chain is clock-controlled with pre-charge stage and discharge stage so that the output varies. Therefore, 15-bit D flip-flops are required to latch and synchronize the final result.The schematic of the Manchester carry chain can be found in [51].

Structure of Partial Product Generator
In Figure 2, we obtained the partial products in two ways: one-row and two-row. In the case of one-row, four partial products are needed, while in the case of two-row, only two partial products are needed in each row. Compared with the one-row structure in Figure 2a. The two-row structure in Figure 2a uses fewer types of AD circuits. However, some branches have more resistance values such HRS, LRS, LRS 2 , LRS 3 , LRS 4 , approximately. Therefore, the capacitance needs to discharge/charge for a little more time due to the presence of a smaller resistance value. New components AD3b and AD4b are also needed for some branches. One of the significant advantages of this one-row structure is the further improvement of our proposed Wallace tree. For example, in one-row structure, the output of AD4b only have five types: 0000, 0001, 0011, 0111 or 1111. So, a 4-2 compressor can be further simplified due to fewer input conditions. Similarly, the proposed 1-bit or 4-bit full adder can be also simplified.

Simulation Result and Comparison
The proposed non-volatile multiplier is implemented and simulated by a low v t 45 nm Generic Processing Design Kit (GPDK) with a supply voltage of 1 V and 1.2 V.

RRAM Circuit Simulation
The RRAM model is taken from [39] and modeled by Verilog-A. The key parameters of the RRAM model are shown in Table 4.
The MOSFET schematic diagram is shown in Figure 8b note that the gate of MOSFET is controlled by the word line (WL) with the source of the transistor connected to the source line (SL), while the drain is connected to the bit line (BL). The SET/RESET operation is performed by applying voltage pulses at the WL and BL/SL terminals. The DC switching IV characteristic of the RRAM to formulate radix-4 8 × 8 multiplier circuit is shown in Figure 8a. The device is simulated from −3 V to 3 V to obtain the IV-curve, which fits the results given in [39]. As shown in Figure 8a, the reading voltage is kept to 0.002 V in our design. The OFF/ON resistance is chosen as more than 100 to ensure the robustness of the RRAM design. while the reading voltage is less than the set voltage (V th = 1.5 V).
The simulation result of the OFF/ON resistance ratio is presented in Figure 8c. The resistance of LRS and HRS differs by two orders of magnitude, which meets the design requirements [35]. It is difficult to distinguish two states if the resistance has a difference of fewer than two orders of magnitude.  Gap dynamics fitting parameter 1 × 10 −9 β Gap dynamics fitting parameter 1.25

AD Circuit Simulation
The transient simulation of the current sensing circuit is performed, and the waveform of V X with two phase non-overlap clock is shown in Figure 9. When the Latch signal is valid (low), different data inputs A0 and A1 result in different V X . . Transient simulation of current sensing circuit considering different combinations of data inputs A0 and A1.
In Figure 10, to verify the function of AD1.5bs, we change data input A0 and A1 to AD1.5b. In case, the Latch signal is valid (low), the correct conversion results Out0 and Out1 are valid.

Transient Simulation Analysis
The transient simulation results of the proposed non-volatile booth multiplier are shown in Figure 11. In this simulation, multiplicand A = 01101101 is from external registers, while multiplicator B is configured as 00110011 for functional verification. According to Figure 11, during configuration state, Pre_con f ig signal is low. The corresponding RRAMs are set to LRS and multiplicator B is stored in RRAM when SET becomes high. After 14 ns, Pre_con f ig rises to high and the multiplier transits to normal working state. During this state, BL become high-impedance. The branches are connected to ADs, and the gate of transistors in 1T1R accept multiplicand A input.
In working state, we use primitive CLK signal to generate a two phase non-overlap clock Charge and Discharge. The initial voltage of the capacitance is set to 400 mV. Firstly, at the rising edge of Discharge, discharge circuits are available and capacitances begin to discharge. When capacitance voltage falls to a certain value, Discharge switches to be invalid (low) and the voltage becomes a constant V X . Then, at the falling edge of Latch, latched comparators compare the V x with given reference voltage V c and outputs partial products. At the rising edge of the Charge, the outputs of latched comparators are latched and computed by proposed Wallace tree and the capacitances begin to recharge. When Discharge is low, the Manchester carry chain is available. At this state, the outputs from wallace tree feed to Manchester carry chain for final addition. Finally, at the falling edge of Charge, the DFFs output the multiplication result. Based on the input data, the output data should be 001010110110111. The simulation results show the proposed multiplier works correctly since that the waveforms of D0, D1, D2, D3 . . . D14 meet with the expected data.

PVT Analysis
In this subsection, numerous simulations are conducted using the proposed architecture of multiplier to perform PVT analysis to authenticate the better working of the circuit under different processes, voltages and temperatures. The average input current and delay of the multiplier under SS, FF, MC (Monte Carlo) process is shown in Table 5, while keeping supply voltage at 1V and room temperature. As shown in Table 5, SS process has the lowest average current but largest delay. The highest average current is obtained under the FF process but with the smallest delay. The simulation of temperature and voltage sweep is carried out under MC process. The temperature is from −40 • C to 80 • C and supply voltage range is from 0.9 V to 1.2 V. If supply voltage is lower than 0.9 V, the multiplier can not work correctly. Figure 12 shows the temperature and voltage response curves of the average input current. The average input current increases and shows overall tendency to ascend with temperature.

Performance Analysis and Comparison
In this subsection comparison of power, delay and area is presented. The factors are necessary to estimate and better understanding of the working efficiency of the resistive memory circuit design. Compared with conventional CMOS multipliers, our non-volatile multiplier has special characteristics in its partial product generator. Total power consumption in one computing cycle is the sum of the static and dynamic power of the CMOS circuit and energy consumed in capacitance charge/discharge of the circuit. The total power of the circuit is given as: P sum = P static + P dynamic + P capacitance (6) Note, the capacitance charge and discharge once in each cycle. Therefore, the P capacitance can be expressed more specifically as where, V charge and V discharge represent the difference between V X and V re f while charging and discharging, respectively. Further, V X stable reading voltage and V re f fully charged voltage. Note in Equation (7), different V charge and V discharge result in different P capacitance . Thus, to reduce the total power of the proposed multiplier circuits, the capacitance value should be kept low and the discharge time should not be much long. In the proposed multiplier, the CMOS parasitic capacitors are used as P capacitance . Since the one-row structure and two-row structure have different numbers of current sensing circuits and different types of ADs, the values of P capacitance obtained are different, as shown in Table 2. The supply voltage is kept at 1 V and V re f = 0.4 V to get the result in Table 2. Note that P capacitance of two structures are almost same. Dynamic power is the significant part of total power for the proposed multiplier. For different multiplicand A, the energy of one computing cycle is different. The transistors in the 1T1R structure are switched off when the multiplicand bits and two's complement bits are 0. Thus, the partial product bits are 0 and the dynamic power is at minimum. Conversely, for all data bits are 1, the dynamic power is at maximum. Here, the worst case is used in comparison.
To verify the robustness of the proposed design to process and device mismatch, 2000 Monte Carlo simulation runs are performed at various voltages. Figure 13 Table 6. Comparison of conventional booth multipliers [44,45]. NVM-based multipliers [3,46] and the proposed multiplier. The multiplier circuits in [3,44,45] are simulated for the comparison. The critical parameters determining the effectiveness the multiplier are presented in Table 6. The proposed 8 bit radix-4 non-volatile parallel multiplier has better performance in terms of delay, power and PDP. For instance, the one-row and two-row multipliers have about 63% and 70% power and PDP reduction compared to the conventional radix-4 8 × 8 booth multiplier, respectively. The proposed multipliers have a 19% delay reduction on the same supply voltage condition. Since the proposed multipliers use 1T1R structure to integrate the booth encoder to the partial product generator, the area of the partial product generator is smaller. Additionally, the extra Manchester chain is employed to improve the power and delay of the proposed multipliers, which results in the extra area. Furthermore, the proposed multiplier is non-volatile, i.e., the data are not lost when the system is powered off. Thus, the reliability and security obtain further improvement.

System Power Comparison
The proposed RRAM-based non-volatile booth multiplier can be applied to neural networks, filters, and other similar applications. The RRAM-based systems include configuration mode, normal working mode. Configuration mode is used to configure system parameters. We mainly estimate the system power and compare it with the conventional computing system.
Power comparison: since total power consumption in one computing cycle is the sum of the static and dynamic power. There is no static power for the non-volatile computing system. However, there is static power for conventional computing systems. The conventional and non-volatile computing system both have the same power on loading A and computing but different in loading B. The conventional computing system needs to reload multiplicand B every time but non-volatile systems only need to load B once. This also helps to save power.
Data storing: In the conventional SRAM multiplier-based system [52], one SRAM cell consists of 6 transistors. By contrast, in the RRAM multiplier-based system, one storage unit only consists of one transistor and one RRAM. Also, the RRAMs can be 3D stacked to reduce storage area substantially.

Conclusions
In this paper, we have proposed an 8-bit RRAM-based non-volatile radix-4 booth multiplier. Specifically, partial product generator, Wallace tree, Manchester carry chain, and associated AD circuit structures are designed. A two-stage Wallace tree is designed for the proposed multiplier, which reduced time delay and area compared to the conventional Wallace tree. Furthermore, the use of non-volatile RRAM enables the multiplier to store multiplicand beforehand, reducing the overall computation time. The comparative results showed that the proposed multiplier has 70% less PDP, 63% lower power consumption, and 19% less delay than the regular CMOS-based radix-4 booth multiplier. The results suggest that our proposed multiplier could be a promising component in processing units. It can be employed in various low-power and high-speed applications, in which certain data will not frequently change, such as the weight of the convolutional neural network.