Multi-Input Logic-in-Memory for Ultra-Low Power Non-Von Neumann Computing

Logic-in-memory (LIM) circuits based on the material implication logic (IMPLY) and resistive random access memory (RRAM) technologies are a candidate solution for the development of ultra-low power non-von Neumann computing architectures. Such architectures could enable the energy-efficient implementation of hardware accelerators for novel edge computing paradigms such as binarized neural networks (BNNs) which rely on the execution of logic operations. In this work, we present the multi-input IMPLY operation implemented on a recently developed smart IMPLY architecture, SIMPLY, which improves the circuit reliability, reduces energy consumption, and breaks the strict design trade-offs of conventional architectures. We show that the generalization of the typical logic schemes used in LIM circuits to multi-input operations strongly reduces the execution time of complex functions needed for BNNs inference tasks (e.g., the 1-bit Full Addition, XNOR, Popcount). The performance of four different RRAM technologies is compared using circuit simulations leveraging a physics-based RRAM compact model. The proposed solution approaches the performance of its CMOS equivalent while bypassing the von Neumann bottleneck, which gives a huge improvement in bit error rate (by a factor of at least 108) and energy-delay product (projected up to a factor of 1010).


Introduction
With the number of connected devices in use exceeding 17 billion, the volume of exchanged data rapidly rises. From this standpoint, edge computing ensures a decrease in the amount of data to be exchanged, relaxing data transfer and power constraints with obvious benefits for consumer and industrial Internet of Things (IoT), smart cities, artificial intelligence (AI), machine learning, and 5G industry. Still, its implementation requires ultra-low power hardware solutions, mainly hindered by the von Neumann bottleneck (VNB) [1][2][3], i.e., the slow and energy-hungry data transfer between CPUs and off-chip non-volatile memories. As suggested in the latest IRDS report [4], logic-in-memory (LIM) architectures that allow executing Boolean operations directly inside the memory could circumvent the VNB. Developing LIM hardware accelerators would enable the deployment at the edge of powerful and data-intensive computing paradigms such as binarized neural networks (BNNs) [5][6][7] and hyperdimensional computing [8][9][10], which strongly rely on the energy-efficient execution of logic operations. Among LIM solutions [11][12][13][14][15][16][17], circuits based on resistive memory (RRAM) technology and the implication logic (IMPLY) offer ultra-dense back end of line (BEOL) integration. Currently, the main showstoppers [12,18] hindering the introduction of RRAM-based LIM circuits are the high energy per operation (as compared to CMOS gates), the degradation of the logic values of RRAMs during circuit operation, and the need to apply very precise voltage pulses (mV accuracy may be required [12,18,19]). Moreover, while in CMOS logic multiple operations can be computed in parallel on the same inputs, in IMPLY-based LIM circuits operations are carried out multiple operations can be computed in parallel on the same inputs, in IMPLY-based LIM circuits operations are carried out sequentially. Thus, reducing the execution time is critical. Recently a novel low-power non-von Neumann LIM solution, called smart-IMPLY (SIMPLY) [20], was introduced and shown to reduce the energy per operation, to solve the problem of logic state degradation of RRAM cells, and to eliminate the need for very precise voltage pulses. Moreover, active research interest has been directed towards the study of circuit implementations of the multi-input IMPLY operation, as Siemon et al. [21] proposed the three-input IMPLY operation (named ORNOR by the authors). However, studies on circuit implementations considering the IMPLY operation generalized to more than three inputs have never been presented.
In this paper, we demonstrate the multi-input IMPLY logic scheme in the framework of SIMPLY architecture by exploiting multi-input (> 2) operations, strongly reducing the execution time of complex logic functions (e.g., the set of logic operations to implement BNN inference tasks). We experimentally verify the correct circuit functionality of the core operation used in SIMPLY on TiN/Ti/HfOx/TiN devices from SEMATECH [22] (technology 4 in this work, electrical characteristics shown in Figure 1d,h). Then, for accurate and trustable results circuit simulations are performed using a fully physicsbased RRAM compact model from [23,24] that includes thermal effects (also self-heating), variability, and multilevel random telegraph noise (RTN). We demonstrate the feasibility of the proposed architecture by comparing the performance obtained with three RRAM technologies from the literature (a multi-layer Pt/TiOx/HfOx/TiOx/HfOx/TiN RRAM referred to in this work as technology 1 [25,26], a TiN/HfO2/Ti/TiN RRAM technology 2 [27,28], and a TiN/HfOx/AlOx/Pt RRAM technology 3 [29]). The results obtained with all the technologies show how combining the proposed innovations allows realizing LIM circuits that outperform both existing LIM solutions and CMOS gates performance when the VNB time and energy overhead is considered, coming close to CMOS gates performance alone (excluding the VNB). Finally, we benchmark the performance improvement brought by the introduction of the multi-input IMPLY operation when executing a BNN inference task with respect to the previous IMPLY-based implementation from [30].

RRAM Physics-Based Compact Model for Circuit Simulations
To study LIM circuits, physics-based compact models that include device non-idealities, such as device-to-device and cycle-to-cycle variability, RTN, self-heating, and thermal effects, represent a very important tool that enables achieving accurate results [18] through circuit simulations. Neglecting RRAMs nonidealities can easily result in poor circuit designs [12,18,31] with low reliability or circuits that do not work at all. In this work, we use the Verilog-A RRAM compact model from [23], which is fully physics-based and supported by the results of advanced physical multi-scale simulations. The total device resistance is modeled as the sum of a conductive filament (CF) and a dielectric barrier contribution. The barrier thickness is modeled dynamically with differential equations that reproduce the device behavior during the reset operation by considering the field-driven oxygen ions drift and recombination, and during the set operation by considering the field-accelerated bond breaking and related defect generation [23,24]. Moreover, thermal effects are modeled dynamically, considering both the thermal conductance and thermal capacitance of the CF and the barrier, thus enabling accurate predictions also when using very short pulses [18]. The model includes the intrinsic variability of both resistive states and the effects of multilevel RTN and its statistical variations [24]. In particular, to introduce cycle-to-cycle variability we add appropriate zero-mean normally distributed random noise sources on the dielectric barrier thickness during reset, and on the CF cross-section while performing a set transition [23,24]. The model, for all the four different RRAM technologies [22,25,27,29] (see Figure 1a-h) explored in this work, correctly reproduces the quasi-static IV, the response to fast reset pulses, and the experimental cycle-to-cycle variability using a single set of parameters per technology (see Figures 1 and 2), thus highlighting the quality of the modeling approach and avoiding the need to design multiple parameter calibrations to reproduce the device behavior under different operating conditions. Technology 4 was experimentally characterized using the Keithley 4200-SCS parameter analyzer. 500 µA, technology 4 in this work. (d) DC I-V curve. (h) Pulsed reset curves using a train of 10 µs pulses with different voltages. Data collected experimentally.

RRAM Physics-Based Compact Model for Circuit Simulations
To study LIM circuits, physics-based compact models that include device nonidealities, such as device-to-device and cycle-to-cycle variability, RTN, self-heating, and thermal effects, represent a very important tool that enables achieving accurate results [18] through circuit simulations. Neglecting RRAMs nonidealities can easily result in poor circuit designs [12,18,31] with low reliability or circuits that do not work at all. In this work, we use the Verilog-A RRAM compact model from [23], which is fully physics-based and supported by the results of advanced physical multi-scale simulations. The total device resistance is modeled as the sum of a conductive filament (CF) and a dielectric barrier contribution. The barrier thickness is modeled dynamically with differential equations that reproduce the device behavior during the reset operation by considering the field-driven oxygen ions drift and recombination, and during the set operation by considering the field-accelerated bond breaking and related defect generation [23,24]. Moreover, thermal effects are modeled dynamically, considering both the thermal conductance and thermal capacitance of the CF and the barrier, thus enabling accurate predictions also when using very short pulses [18]. The model includes the intrinsic variability of both resistive states and the effects of multilevel RTN and its statistical variations [24]. In particular, to introduce cycle-to-cycle variability we add appropriate zero-mean normally distributed random noise sources on the dielectric barrier thickness during reset, and on the CF cross-section while performing a set transition [23,24]. The model, for all the four different RRAM technologies [22,25,27,29] (see Figure 1a-h) explored in this work, correctly reproduces the quasi-static IV, the response to fast reset pulses, and the experimental cycle-to-cycle variability using a single set of parameters per technology (see Figures 1 and 2), thus highlighting the quality of the modeling approach and avoiding the need to design multiple parameter calibrations to reproduce the device behavior under different operating conditions. Technology 4 was experimentally characterized using the Keithley 4200-SCS parameter analyzer. The calibrated compact model was used to perform circuit simulations with the Cadence Virtuoso ® software, to determine the performance and reliability of different LIM solutions. The calibrated compact model was used to perform circuit simulations with the Cadence Virtuoso ® software, to determine the performance and reliability of different LIM solutions.

Logic-in-Memory with RRAM Devices and the Material Implication Logic
The material implication logic is a functionally complete logic that can be effectively implemented with RRAM devices [11]. In fact, RRAMs enable its efficient implementation using a circuit architecture like the one in Figure 3a. A control logic equipped with analog tri-state buffers is needed to deliver appropriate voltages at the top electrode (TE) of each RRAM device of the array, with the devices in the array having their bottom electrodes (BE) connected to the same resistor R G . Differently from traditional logic gates, in RRAMbased LIM schemes, the logic values are not encoded as voltages at circuit nodes, but as nonvolatile resistive states of RRAMs (HRS for logic 0, LRS for logic 1). All possible logic gates can be defined with two fundamental operations [32,33], namely the IMPLY (2-input 1-output operation, truth table in Figure 3d) and the FALSE (1-input 1-output operation always yielding logic 0). The COPY function (realized with IMPLY and FALSE) allows cascaded operations [12]. FALSE is executed by applying a negative voltage pulse to the RRAM (i.e., the "classical" reset operation), see Figure 3c. IMPLY between two logic values (stored in RRAMs P and Q) is executed by simultaneously applying two different positive voltage pulses at P (V COND ) and Q (V SET ), see Figure 3b. The result is stored in Q, while P must preserve its state. These requirements introduce several constraints that make the design space for the definition of V SET and V COND values very narrow [12,18], requiring a fine control down to the tens of mV, which is hard to achieve without huge overheads [12,18]. Further, the choice of R G suffers from trade-offs [12]. Moreover, the repeated execution of IMPLY causes state drifts in P and Q, eventually causing bit corruption, as reported in [12,18]. Upon this, a refresh is needed that requires moving all data back and forth to a new memory location, wasting energy and time [12,18].
only at IC = 20 µA (variability parameters σx = 0.8 nm, σS = 24.9 nm 2 ). The inset shows the experimental (square symbols) and simulated (lines) DC IV curves, where only kcf (kcf 20µA= kcf 200µA/100), kex (kex 20µA= kex 200µA/10), and S (S20µA= S200µA/10) parameters were changed to account for the different characteristic of the very narrow CF obtained with such low current compliance; (c) reports the variability obtained with technology 3 [29] under quasi-static DC conditions (variability parameters σx = 0.81 nm, σS = 0.7785 nm 2 ); (d) reports the variability experimentally measured on technology 4. HRS variability data are obtained using 10 µs reset voltage pulses, while LRS data with a quasi-static positive voltage ramp as this was the only way to provide an accurate current compliance with our test setup. (Variability parameters σx = 0.7nm, σS = 0.350 nm 2 ).

Logic-in-Memory with RRAM Devices and the Material Implication Logic
The material implication logic is a functionally complete logic that can be effectively implemented with RRAM devices [11]. In fact, RRAMs enable its efficient implementation using a circuit architecture like the one in Figure 3a. A control logic equipped with analog tri-state buffers is needed to deliver appropriate voltages at the top electrode (TE) of each RRAM device of the array, with the devices in the array having their bottom electrodes (BE) connected to the same resistor RG. Differently from traditional logic gates, in RRAMbased LIM schemes, the logic values are not encoded as voltages at circuit nodes, but as nonvolatile resistive states of RRAMs (HRS for logic 0, LRS for logic 1). All possible logic gates can be defined with two fundamental operations [32,33], namely the IMPLY (2-input 1-output operation, truth table in Figure 3d) and the FALSE (1-input 1-output operation always yielding logic 0). The COPY function (realized with IMPLY and FALSE) allows cascaded operations [12]. FALSE is executed by applying a negative voltage pulse to the RRAM (i.e., the "classical" reset operation), see Figure 3c. IMPLY between two logic values (stored in RRAMs P and Q) is executed by simultaneously applying two different positive voltage pulses at P (VCOND) and Q (VSET), see Figure 3b. The result is stored in Q, while P must preserve its state. These requirements introduce several constraints that make the design space for the definition of VSET and VCOND values very narrow [12,18], requiring a fine control down to the tens of mV, which is hard to achieve without huge overheads [12,18]. Further, the choice of RG suffers from trade-offs [12]. Moreover, the repeated execution of IMPLY causes state drifts in P and Q, eventually causing bit corruption, as reported in [12,18]. Upon this, a refresh is needed that requires moving all data back and forth to a new memory location, wasting energy and time [12,18].

The "SIMPLY" Architecture
To overcome the issues of the traditional IMPLY architecture, SIMPLY was introduced in [20]. SIMPLY is based on the implication logic and originates from the observation that Q must change its state only when P = Q = 0 (case 1 of the truth table in Figure 3d). SIMPLY detects this condition by applying a small VREAD voltage pulse (200 mV in this work) to both P and Q, Figure 4b. The voltage at node N (VN, Figure 4a) is

The "SIMPLY" Architecture
To overcome the issues of the traditional IMPLY architecture, SIMPLY was introduced in [20]. SIMPLY is based on the implication logic and originates from the observation that Q must change its state only when P = Q = 0 (case 1 of the truth table in Figure 3d). SIMPLY detects this condition by applying a small V READ voltage pulse (200 mV in this work) to both P and Q, Figure 4b. The voltage at node N (V N , Figure 4a) is compared against a threshold (V TH ) to determine if P = Q = 0. If so, the control logic which is equipped with tri-state buffers, as in the IMPLY architecture, pulses V SET on Q, keeping the driver of P at high impedance; Figure 4b, blue lines. Otherwise, both P and Q are kept at high impedance; Figure 4b, dashed red lines. The condition P = Q = 0 is easily detectable since V N is lower in this case than in all other cases, ensuring a sufficient margin (i.e., read margin-RM) also when considering RTN and variability, as shown in Figure 4e. Moreover, the RM increases with V READ , Figure 4e, which allows trading a higher RM (i.e., lower Bit Error Rate-BER) for power, thus setting precise parameter tradeoffs. SIMPLY requires no V COND and no fine control of the voltage pulses, overcoming the tradeoff between V SET and V COND and the need for very precise voltage control. The problem of logic state degradation is virtually resolved, as only V READ is applied to the devices that must retain their logic state. In fact, no degradation is observed up to (at least) 10 8 cycles, reducing BER by at least 10 8 compared to IMPLY, with no energy penalty (see Figure 5). Besides, SIMPLY needs significantly less energy as, for 3 out of 4 input combinations, only the two V READ pulses are delivered instead of the much larger V SET and V COND pulses, Figure 4b. To ensure high energy efficiency, we use the low-power and compact comparator design from [30], implemented in a 45-nm technology [34] (Figure 4d), which consumes 8fJ per comparison when T is in the range 0 to 85 • C and V DD is 2 V. The energy breakdown, including the comparator contribution, for all input combinations for IMPLY and SIMPLY, stresses the remarkable performance improvement, as SIMPLY reduces the energy per operation by 4× on average and up to 57× for the P = 0 Q = 1 input configuration, when considering technology 3 as an example (see Table 1). Depending on the RRAM technology, the energy consumption of the FALSE operation may be higher or lower than that of the IMPLYoperation (see Tables 1 and 2). So, it is important to also reduce the energy of the FALSE. This can be easily done in the SIMPLY architecture, by preventing the use of the high |V FALSE | when a device is already in HRS. The state of the device is read using the same V READ and comparator threshold used for the IMPLY operation, and the analog buffers are enabled by considering the inverted output of the comparator, see Figure 4c. This smart FALSE (sFALSE [30]) operation results in relevant energy savings (see Table 2).
Moreover, the RM increases with VREAD, Figure 4e, which allows trading a higher RM (i.e., lower Bit Error Rate-BER) for power, thus setting precise parameter tradeoffs. SIMPLY requires no VCOND and no fine control of the voltage pulses, overcoming the tradeoff between VSET and VCOND and the need for very precise voltage control. The problem of logic state degradation is virtually resolved, as only VREAD is applied to the devices that must retain their logic state. In fact, no degradation is observed up to (at least) 10 8 cycles, reducing BER by at least 10 8 compared to IMPLY, with no energy penalty (see Figure 5). Besides, SIMPLY needs significantly less energy as, for 3 out of 4 input combinations, only the two VREAD pulses are delivered instead of the much larger VSET and VCOND pulses, Figure 4b. To ensure high energy efficiency, we use the low-power and compact comparator design from [30], implemented in a 45-nm technology [34] (Figure 4d), which consumes 8fJ per comparison when T is in the range 0 to 85 °C and VDD is 2 V. The energy breakdown, including the comparator contribution, for all input combinations for IMPLY and SIMPLY, stresses the remarkable performance improvement, as SIMPLY reduces the energy per operation by 4× on average and up to 57× for the P = 0 Q = 1 input configuration, when considering technology 3 as an example (see Table 1). Depending on the RRAM technology, the energy consumption of the FALSE operation may be higher or lower than that of the IMPLY operation (see Tables 1 and 2). So, it is important to also reduce the energy of the FALSE. This can be easily done in the SIMPLY architecture, by preventing the use of the high |VFALSE| when a device is already in HRS. The state of the device is read using the same VREAD and comparator threshold used for the IMPLY operation, and the analog buffers are enabled by considering the inverted output of the comparator, see Figure 4c. This smart FALSE (sFALSE [30]) operation results in relevant energy savings (see Table 2).
In addition, both the traditional IMPLY and the SIMPLY architectures can be implemented in a crossbar array [12,30]. An example of the SIMPLY crossbar implementation is shown in Figure 6. FET devices in the array periphery are used to implement RG, to connect adjacent columns on-demand to realize COPY between any two devices, and to enable specific columns of the array. The parallel computation of SIMPLY or sFALSE operations can be implemented by increasing the number of comparators in the architecture as discussed in [30] and shown in Figure 6.   [29]) resulting from circuit simulations including variability and RTN (50 trials) when P = Q = 0 (violet bands) and P ≠ Q (grey bands) for a the 2-SIMPLY operation at different VREAD. The read margins (RM-blue arrows) and associated reshold voltages (VTH-orange stars) for the comparator are evidenced. Black whiskers indic e distributions. Red crosses indicate outliers due to RTN. VN when P = Q = 1 is always much ses (thus is not reported in these box plots).     [29]). R P degrades when repeating IMPLY for input combination P = Q = 0; R Q degrades when repeating IMPLY for input combination P = 1 and Q = 0. The degradation depends on the initial values of R Q and R P , (worst-cases for R Q and R P shown here, considering variability and RTN). Bit corruption occurs potentially after only 7 cycles (solid black line). No noticeable degradation occurs using SIMPLY (only worst-case reported) up to at least 10 8 cycles, also when performing multi-input operations (i.e., n-SIMPLY when n = 2, 3, 4-dotted red line, red triangles, and red squares, respectively).   In addition, both the traditional IMPLY and the SIMPLY architectures can be implemented in a crossbar array [12,30]. An example of the SIMPLY crossbar implementation is shown in Figure 6. FET devices in the array periphery are used to implement R G, to connect adjacent columns on-demand to realize COPY between any two devices, and to enable specific columns of the array. The parallel computation of SIMPLY or sFALSE operations can be implemented by increasing the number of comparators in the architecture as discussed in [30] and shown in Figure 6.
The degradation depends on the initial values of RQ and RP, (wo considering variability and RTN). Bit corruption occurs potent line). No noticeable degradation occurs using SIMPLY (only w cycles, also when performing multi-input operations (i.e., n-S line, red triangles, and red squares, respectively).

Multi-Input Operations: "n-SIMPLY"
In traditional static complementary CMOS logic gates, the computation of more complex operations (e.g., large fan-in operations), involves the use of additional transistors, therefore an increase in the required circuit area and complexity. In IMPLY-based architectures, the computation of complex operations instead translates to an increased number of steps of IMPLY and FALSE operations. Therefore, in these architectures, minimizing the number of steps is critical to improve the circuit performance. As suggested by theoretical works, the IMPLY logic can be extended to multi-input (>2) operations [32,35] by simultaneously applying V SET to one device (Q) and V COND to many devices (P, S, T, . . . ). It is only recently that the three input IMPLY operation (i.e., equivalent to the OR(Q, NOR(P, S)) has been demonstrated by performing circuit simulations of a conventional IMPLY architecture using a physics-based compact model calibrated on a Pt/TaO x /Ta RRAM technology [21] and was shown to reduce the number of steps required for implementing a full adder. However, the reliability of the traditional IMPLY architecture worsens when the number of inputs of the IMPLY operation is increased [21]. Nevertheless, generalizing the IMPLY operation to more than three inputs allows further reducing the number of computing steps, by executing in a single step the more complex function OR Q, AND P, S, T, . . . = OR(Q, NOR(P, S, T, . . .)), hereafter called n-IMPLY (n being the number of inputs) generalizing the IMPLY with 2 inputs (henceforth 2-IMPLY). Just like in 2-IMPLY, also in n-IMPLY Q must change its state to logic 1 only if all inputs are concurrently 0, that is exactly the condition detected by SIMPLY. So, n-IMPLY can be implemented in SIMPLY (henceforth n-SIMPLY) by applying V READ to all n devices simultaneously with no architectural changes. We simulated the n-SIMPLY (with n = 2, 3, 4) for the 4 RRAM technologies explored in this work using the Cadence ® Virtuoso software and verified its correct functionality by comparing the circuit parameters and energy per operation (see Table 3), including variability and RTN. While the minimum and maximum energy per n-SIMPLY operation do not change considerably for increasing values of n, the RM decreases when the number of RRAMs read in parallel increases. Nevertheless, the RM with n = 2, 3, 4 was verified to be always large enough, experimentally on technology 4 (see Figure 7a), and by using simulations on the other three technologies, as shown in Figure 7b. Moreover, in all cases, no logic state degradation is observed (see red curves in Figure 5). Table 3. Circuit parameters and energy consumption of n-SIMPLY operation (n = 2, 3, 4) with t pulse = 1 ns.

Parameters
Tech.   Red crosses indicate outliers due to RTN. VN when P = Q = 1 is always much higher than in all other cases (thus is not reported in these box plots).

Full Adder Implementation with n-SIMPLY
Exploiting n-SIMPLY, we propose a novel implementation of a LIM 1-bit full adder (FA). The schematic of the array and the sequence of operations are shown in Figure 8a,d. Red crosses indicate outliers due to RTN. V N when P = Q = 1 is always much higher than in all other cases (thus is not reported in these box plots).

Full Adder Implementation with n-SIMPLY
Exploiting n-SIMPLY, we propose a novel implementation of a LIM 1-bit full adder (FA). The schematic of the array and the sequence of operations are shown in Figure 8a,d. We simulated the FA at 0.5 GHz (see the example in Figure 8c) using the 4 RRAM technologies and compared the total energy consumption (including the comparator and the effects of variability and RTN) in Figure 8b. Not surprisingly, technology 3 [29], which is the technology with the lowest current compliance (I C = 100 µA, see Figure 1c), results in the lowest energy consumption. We considered the worst-case result (i.e., Energy FA max column in Figure 8b) obtained with technology 3 and benchmarked it in detail against existing LIM solutions in the literature and its CMOS counterpart in Tables 4 and 5, respectively. Considering the worst-case energy contribution for each 1-bit FA operation results in an energy consumption overestimation, that, as a first-order approximation, provides enough room to account for additional energy dissipated by components in the peripheral circuitry that were not included in the circuit simulations (i.e., the analog tri-state buffers). The proposed FA outperforms all existing IMPLY-based LIM solutions (both simulation and experimental works [12,20,[36][37][38]), in terms of the number of steps (delay) and energy consumption, using few devices and lowering the energy-delay product (EDP) by a factor >10 3 , bringing it much closer to the CMOS one. We compare the performance of the proposed solution vs. CMOS (with and without considering the VNB energy and time overhead) in Table 5 considering the parallel execution of 512 32-bit FA operations (simple ripple carry architecture), which entails 4 kB data, which is the common memory page size [39].  Table 5 considering the parallel execution of 512 32-bit FA operations (simple ripple carry architecture), which entails 4kB data, which is the common memory page size [39]. The proposed solution strongly outperforms different CMOS FA solutions [40][41][42] (>10 6 improvement in EDP) when VNB data exchange overhead [39] is included. Projections show that substantial improvements are obtained by using devices formed at lower IC [43] and shorter pulses [44], as in Table 5, achieving the performance of CMOS gates without the huge VNB penalty (up to a ≈2.4•× 10 10 improvement in EDP as compared to a CMOS FA that suffers from VNB).

Energy FA Min
Energy FA Max Tech. 1 6.3 pJ 9.9 pJ Tech. 2 2.5 pJ 5.8 pJ Tech. 3 2.2 pJ 4.2 pJ Tech. 4 5.94 pJ 7.28 pJ   The proposed solution strongly outperforms different CMOS FA solutions [40][41][42] (>10 6 improvement in EDP) when VNB data exchange overhead [39] is included. Projections show that substantial improvements are obtained by using devices formed at lower I C [43] and shorter pulses [44], as in Table 5, achieving the performance of CMOS gates without the huge VNB penalty (up to a ≈2.4 × 10 10 improvement in EDP as compared to a CMOS FA that suffers from VNB). * Reported means that the value was explicitly reported by the authors in the paper. Estimated means that the corresponding value has been inferred from data in the paper, although not explicitly reported by the authors. ** This information can be trusted only if a physics-based model is used. *, **, estimates with (w/) and without (w/o) including the VNB energy and delay overhead for reading and writing 4 kB data [39]. ** CMOS FA performances were estimated projecting the time and energies for different 1-bit FA schemes and different CMOS technologies (i.e., 0.18 µm, 45 nm, and 10 nm) from [40][41][42] combined in a ripple carry configuration. ***, **** Worst-case energy estimates considering the maximum energy for each single 1-bit FA operation using Technology 3, see Figure 8b. This results in an energy overestimation that approximately accounts for additional energy overhead possibly introduced by the peripheral circuit (i.e., analog tri-state buffers contributions). **** Projections using optimized devices with I C = 10 nA [43] and f = 1 GHz [44].

Binarized Neural Networks Applications
As the adoption of deep learning and neural networks is becoming more and more pervasive, the demand for energy efficient hardware accelerators is rapidly growing. When considering neural networks, the most effective way to improve energy efficiency is to avoid the VNB by performing computations in-memory [45][46][47][48]. A common approach exploits resistive memory crossbar arrays to implement in analog the vector matrix multiplication in a single step [45]. However, reliability issues affecting RRAM devices, such as the large cycle-to-cycle variability, limit the number of bits that can be reliably stored in a single device [49] thus suggesting that low-bit precision neural networks are a more suitable solution for the current state of the art RRAM devices. The extreme case of low-bit precision neural networks are BNNs, which have been shown to retain high classification accuracy despite the use of single-bit neuron weights and activations [5,50,51]. While the in-memory computation of the analog vector matrix multiplication is extremely attractive [6,7,50,52], it has some limitations and tradeoffs. In fact, this approach lacks the possibility to reconfigure the type of operation to be computed in-memory.
Since neurons in BNNs perform logic operations, the BNNs inference can be implemented in the SIMPLY architecture, as shown in [30] where a BNN was implemented using the 2-SIMPLY. Compared to an analog implementation, the SIMPLY architecture trades reconfigurability for larger computation latency. To address the higher latency issue, the multi-input IMPLY operation can be used as discussed below.
As shown in [30], in BNNs, multiplications between the inputs and each neuron's weights are implemented with bitwise XNOR operations, the accumulation as the popcount operation, and the activation as a comparison with a trained threshold. In addition, batch normalization layers can be implemented with full adders for which we have already discussed the optimized n-SIMPLY implementation. The effectiveness of the multi-input IMPLY operation in reducing the number of computing steps is analyzed for each logic function since it is strongly dependent on the associated specific sequence of operations. As shown in Figure 9, by using the 3-SIMPLY the XNOR operation is computed in 5 steps while using a single additional device, instead of requiring 9 steps and 2 additional support devices when using only the 2-SIMPLY. For the popcount and activation function operations, different optimization strategies that minimize the number of computing steps for different fan-in (i.e., n) of the n-SIMPLY operation can be employed. As shown in [30], the accumulator can be implemented as a chain of log 2 (#input bits) 1-bit half adders (HAs) where the first HA receives in input the bit that needs to be accumulated and its current output (i.e., S 0 in Figure 10a), while the following HAs receive at their inputs the carry-out from the previous HA and their current output (i.e., as in in Figure 10a but also for more than two HAs). When accumulating a sequence of bits, all the bits are input sequentially to the HA representing the LSB. The operations of the subsequent HAs are performed only after a number of input bits equal to their relative bit position have been summed (e.g., HA with bit position 3 is activated only after 2 3 input bits have been summed). Thus, the length of the HA chain grows as more input bits have been summed. By using the 3-SIMPLY operation, the number of steps for each HA operation is reduced from 13 to 11, see Supplementary Figure S1. An effective strategy that can further reduce the total number of computing steps involves the use of the n-SIMPLY operation to compute the equivalent logic function resulting from a sequence of more than one 1-bit HAs. Thereby, the equivalent carry-out of a sequence of more than 1 HA (output C in Figure 10a), can be computed in a single step by exploiting the n-SIMPLY operation (with n = #HAs + 2). For instance, the carry-out can be computed with a 4-SIMPLY or a 5-SIMPLY when a sequence of 2 or 3 HAs is considered, respectively. Moreover, the number of intermediate computing steps required to calculate the output bits S i (see Figure 10a) is reduced using the n-SIMPLY, as shown in Figure 10b and Supplementary Figure S2 Figure S4), for a sequence of 2, 3, and 4 HAs, respectively, are required instead of 26, 39, and 52 steps when using only the 2-SIMPLY. The overall number of computing steps scales rapidly with the number of output bits, but the use of n-SIMPLY with n up to 3 (i.e., only the chain of 2 HAs is optimized) provides a saving of around 15% of steps while for n > 3 leads to a step saving above 25%. However, most of the reduction in the number of computing steps is achieved when using n-SIMPLY with n up to 4 while the use of n > 4 provides only a limited improvement, as shown in Figure 10d, that does not justify the risk of reducing the reliability of the circuit due to the smaller available RMs. consumption. Further, devices with higher endurance than most RRAM technologies, such as PCM [54] and STT-MRAM [53], could be evaluated on the SIMPLY architecture. consumption. Further, devices with higher endurance than most RRAM technologies, such as PCM [54] and STT-MRAM [53], could be evaluated on the SIMPLY architecture. For the activation (i.e., comparison with a trained threshold), we improved the 2-SIMPLY implementation from [30], by removing unnecessary computing steps. The resulting sequence of computing steps can be built using the flowchart reported in Supplementary Figure S5, which consists in the computation of a bitwise XNOR between the input and the trained threshold, and additional AND/OR operations. By following this method, the number of computing steps scales exponentially with the number of compared input bits (i.e., 2-SIMPLY baseline in Figure 11). Thus, we also consider an alternative approach in which the number of computing steps scales linearly with the number of input bits, reducing the number of computing steps when more than 9 bits are compared (see 2-SIMPLY Opt.WideWords in Figure 11 and Supplementary Figure S6). The same two approaches can be optimized with the 3-SIMPLY operation (see 3-SIMPLY OnlyXNOROpt. and 3-SIMPLY Opt. ) providing a considerable reduction in the number of computing steps (see Supplementary Figures S7 and S8). While the use of the 4-SIMPLY operation enables additional step savings (see 4-SIMPLY Opt. in Figure 11 and Supplementary Figure S9

Conclusions
In this work, we presented the advantages of the multi-input IMPLY operatio formed on SIMPLY (n-SIMPLY), a new LIM edge computing architecture that over all the relevant issues of traditional IMPLY solutions. The advantages of mult SIMPLY operations were analyzed in detail using a comprehensive physics-based compact model calibrated on three different RRAM technologies in the literature an idated by the experimental analysis carried out on fabricated devices. The perfor comparison of a 1-bit full adder based on n-SIMPLY and state-of-the-art LIM altern present in the literature shows that n-SIMPLY allows realizing a LIM solution th proaches the performance of CMOS gates while bypassing VNB, with a huge im ment in BER (by a factor of at least 10 8 ) and EDP (up to a factor 10 10 ). Moreover, we a the advantages brought by n-SIMPLY on a BNN inference task and show that it en latency reduction of 29% with respect to previous studies.  Thus, some applications, such as the BNN inference, can largely benefit from a limited increase in n-SIMPLY operation parallelism. In fact, most of the step reduction for a 1-bit FA, the bitwise XNOR operation, the popcounting operation, and the activation function are achieved by using only up to the 4-SIMPLY operation. Further increasing the parallelism of the n-SIMPLY operation would not provide relevant additional energy savings while causing a reduction in the available RM, thus potentially reducing the circuit reliability, and increasing the instruction set complexity.
To benchmark the latency improvement brought by the n-SIMPLY operation when performing an inference task, we consider the results of the 2-SIMPLY implementation of a BNN reported in [30], which consists of a single fully connected hidden layer with 1000 neurons performing a handwritten digit recognition task, achieving an accuracy of 91.4%. The use of n-SIMPLY reduces the number of computing steps from 190,657 to 135,539, therefore resulting in a remarkable 29% latency reduction leading to a 542 µs inference time at 0.5 GHz, which further highlights the advantages of the SIMPLY architecture as a solution for ultra-low power devices for edge computing applications.
Additionally, unlike conventional IMPLY architectures, SIMPLY is not constrained to employ bipolar RRAM devices [18] and can use unipolar RRAM devices as well as other memory technologies such as phase change memories (PCM) [53], ferroelectric FETs (FeFETs) [54], ferroelectric tunnel junctions (FTJs) [54], and virtually any 2-or 3-terminal non-volatile memory. Therefore, device/circuit co-design strategies can be implemented easily to further improve performance. For instance, memory technologies with ultralow power consumption like FeFETs and FTJs [54] may be targeted to improve power consumption. Further, devices with higher endurance than most RRAM technologies, such as PCM [54] and STT-MRAM [53], could be evaluated on the SIMPLY architecture.

Conclusions
In this work, we presented the advantages of the multi-input IMPLY operation performed on SIMPLY (n-SIMPLY), a new LIM edge computing architecture that overcomes all the relevant issues of traditional IMPLY solutions. The advantages of multi-input SIMPLY operations were analyzed in detail using a comprehensive physics-based RRAM compact model calibrated on three different RRAM technologies in the literature and validated by the experimental analysis carried out on fabricated devices. The performance comparison of a 1-bit full adder based on n-SIMPLY and state-of-the-art LIM alternatives present in the literature shows that n-SIMPLY allows realizing a LIM solution that approaches the performance of CMOS gates while bypassing VNB, with a huge improvement in BER (by a factor of at least 10 8 ) and EDP (up to a factor 10 10 ). Moreover, we analyze the advantages brought by n-SIMPLY on a BNN inference task and show that it enables a latency reduction of 29% with respect to previous studies.  Figure S5. Flowchart building the sequence of computing steps required for implementing the activation function of a BNN neuron, i.e., the comparison with a threshold (TH), when using the 2-SIMPLY operation (i.e., 2-SIMPLY-baseline in Figure 11). The formula for the comparison operation is reported together with the scaling of the number of computing steps (i.e., #steps) as a function of the number of compared input bits (n). Figure S6. Flowchart building the sequence of computing steps optimized for large number of input bits, that is used to implement the activation function of a BNN neuron, i.e., the comparison with a threshold (TH), when using the 2-SIMPLY operation (i.e., 2-SIMPLY Opt. Wide Words in Figure 11). The formula for the comparison operation is reported together with the scaling of the number of computing steps (i.e., #steps) as a function of the number of compared input bits (n). Figure S7. Flowchart building the sequence of computing steps that is used to implement the activation function of a BNN neuron, i.e., the comparison with a threshold (TH), when using the 3-SIMPLY operation and optimizing only the steps required for computing the intermediate XNOR operations (i.e., 3-SIMPLY Only XNOR in Figure 11). The formula for the comparison operation is reported together with the scaling of the number of computing steps (i.e., #steps) as a function of the number of compared input bits (n). Figure S8. Flowchart building the optimized sequence of computing steps that is used to implement the activation function of a BNN neuron, i.e., the comparison with a threshold (TH), when using the 3-SIMPLY operation (i.e., 3-SIMPLY Opt. in Figure 11). The formula for the comparison operation is reported together with the scaling of the number of computing steps (i.e., #steps) as a function of the number of compared input bits (n). Figure S9. Flowchart building the optimized sequence of computing steps that is used to implement the activation function of a BNN neuron, i.e., the comparison with a threshold (TH), when using the 4-SIMPLY operation (i.e., 4-SIMPLY Opt. in Figure 11). The formula for the comparison operation is reported together with the scaling of the number of computing steps (i.e., #steps) as a function of the number of compared input bits (n).
Funding: This research received no external funding.

Conflicts of Interest:
The authors declare no conflict of interests.