Readout Circuit Design for RRAM Array-Based Computing in Memory Architecture

Xu, Xingjie; Wang, Aili; Shui, Yuhang

doi:10.3390/electronics13132478

Open AccessArticle

Readout Circuit Design for RRAM Array-Based Computing in Memory Architecture

by

Xingjie Xu

,

Aili Wang

^* and

Yuhang Shui

Zhejiang University-University of Illinois Urbana-Champaign Institute, Zhejiang University, Haining 314400, China

^*

Author to whom correspondence should be addressed.

Electronics 2024, 13(13), 2478; https://doi.org/10.3390/electronics13132478

Submission received: 15 May 2024 / Revised: 6 June 2024 / Accepted: 17 June 2024 / Published: 25 June 2024

(This article belongs to the Special Issue Analog and Mixed-Signal Circuit Designs and Their Applications)

Download

Browse Figures

Versions Notes

Abstract

In recent advancements, the traditional von Neumann architecture has been challenged by the computational needs of AI. This is due to its high power and data transfer costs. As a solution, the computing-in-memory (CIM) architecture, which combines storage and computation, has gained attention for its superior computational power and energy efficiency. Within CIM, using resistive random access memory (RRAM) arrays, the readout circuit, which converts analog outputs from multiply–accumulate operations into digital signals, faces limitations due to its area and power consumption. There are mainly two types of CIM readout circuits for analog types: the traditional ADC type and the non-traditional type. This paper presents two types of readout circuit designs. The first is a low-power, compact successive approximation register (SAR) analog-to-digital converter (ADC) readout circuit. The core circuit is an 8-bit SAR ADC operating at 70 MS/s. It incorporates a linearity-improved bootstrapped switch to minimize leakage and enhance linearity, whose spurious-free dynamic range (SFDR) has been improved by 10.1 dB from 76.78 dB to 86.88 dB, and whose signal-to-noise and distortion ratio (SNDR) has increased by 4.56 dB from 75.13 dB to 79.69 dB. The delay of a transconductance-enhanced dynamic comparator is reduced from 184 ps to 149 ps, presenting a performance improvement of approximately 20%. Concurrently, the energy consumption decreased from 178 μm to 132 μm, attaining an improvement of roughly 26%. A “sandwich” capacitor structure is used that reduces the overall area of the layout. After layout and post-simulation, this circuit occupies only 49.6 μm × 51.5 μm, consumes 553 μW power, has a SINAD of 46.22 dB, and has an SFDR of 57.21 dB. The second is a current controlled oscillator (CCO)-type readout circuit, which comprises a CCO oscillator with low process-sensitivity. The readout circuit also utilizes an op-amp and current mirrors for a negative feedback loop, ensuring a constant voltage across the RRAM arrays. The frequency generated through the CCO is controlled by the current, and quantified by a counter, supporting different weights quantification per ReRAM column without additional digital weighting. This circuit achieves 95-level resolution, 5.2 μs delay, and an average consumption of 183.1 μW. A comparative analysis highlights that traditional ADC readout circuits offer high resolution and speed but are limited by their high power and area costs, often overshadowing CIM arrays’ benefits. Thus, for applications with more lenient resolution and speed requirements, non-traditional readout circuits present considerable advantages.

Keywords:

CIM; RRAM; AI; analog-to-digital converter; SAR ADC; CCO

1. Introduction

In the field of information technology, an AI revolution is unfolding. The computational prowess required for AI is predominantly provided by proprietary hardware chips. However, due to the deceleration of Moore’s law and bottlenecks associated with semiconductor advancement, hardware performance enhancements have not kept pace with the demands of algorithms, leading to the so-called “von Neumann bottleneck”. To transcend these constraints, various solutions, particularly CIM technology, have been proposed. CIM synergistically integrates storage and computation, thereby reducing data transfer needs. At the algorithmic level, the convolutional neural network (CNN) is suitable for feature extraction and data classification tasks. On the hardware front, RRAM displays competitive prowess within the non-volatile memory market due to its speed, low power consumption, and simplicity. Integrating RRAM into neural networks allows for the analog-domain computation of multiply-and-accumulate (MAC), mapping weight parameters to resistance values, which accelerates computing and improves energy efficiency ratios [1].

The parallel processing offered by RRAM arrays facilitates rapid and efficient data handling. Nonetheless, in certain CIM chips, the power consumption associated with the readout circuitry can be nearly 70%, while its area cost might reach up to 90%, severely constraining the energy efficiency of CIM [2]. Consequently, innovating high energy-efficient and space-conserving readout circuits becomes critical within the realm of CIM research.

There are mainly two types of CIM readout circuits for analog types: the traditional ADC type and the non-traditional type. For the first type, [3] features a variable precision circuit based on split-capacitance for 5/6 bit accuracy. This design reduces the ADC area by employing split-capacitance in the DAC and adjusting the total capacitance at the DAC output node by controlling the MSB capacitor. The authors of [4] utilize a Flash ADC readout circuit designed for 4-bit array operations. The choice of Flash ADC is attributed to its rapid speed. The strategy implements input sparsity sensing (ISS) with the Flash ADC to enhance chip energy efficiency and integration. In [5], a shared scheme is suggested where a single readout circuit is utilized across multiple columns. Further, [6] achieves a hybrid approach with analog-domain floating-point and single-slope A/D conversion to derive 2-bit exponent codes and 5-bit mantissa codes. For the second type, [7] introduces a voltage controlled oscillator (VCO)-based readout circuit, which boasts a smaller area and reduced power consumption compared to shared traditional ADC structures, making it applicable on a per-column basis. In [8], a scalable integrate-and-fire (IF) readout circuit is described, demonstrating greater suitability for spiking neural network algorithms.

This research presents the design of two readout circuit types of CIM based on RRAM arrays: a traditional ADC-based circuit and a non-traditional circuit. Section 2 outlines the design of an SAR ADC for traditional ADC readouts. Section 3 details the design of a readout circuit based on a CCO circuit. Section 4 discusses simulation results for the readout circuits described in Section 2 and Section 3. Section 5 summarizes these two types of readout circuits.

2. SAR ADC Design for Traditional ADC Readout Circuits

This section focuses on designing a low-power, compact SAR ADC architecture for traditional ADC-type readout circuits. The module converts analog voltage signals, obtained through current sampling circuits, into digital signals. Initially, we present the overall framework for an 8-bit SAR ADC, elaborating on performance metrics and key technologies related to readout circuits in CIM chips. We then detail the design process for components of the SAR ADC circuit, including the sampling switch circuit, dynamic latch comparator, DAC capacitor array.

The overall structure and timing diagram of the designed SAR ADC are as depicted in Figure 1. This includes an improved bootstrapped switch, capacitive digital-to-analog converter (CDAC) circuit, an enhanced dynamic comparator, and SAR asynchronous sequence logic. During the sampling phase, differential inputs sample on the CDAC. Triggered by the falling edge of the sampling clock, the comparator activates to retrieve the most significant bit (MSB) information. Subsequently, the MSB data are stored in the successive approximation register and used for higher-order capacitor switching in the CDAC, continuing until the least significant bit (LSB) is attained. Among them, CLKs represents the sampling period, and Clkc represents the comparison period of the comparator (in Figure 1, it is replaced by a uniform period, but in actuality, when different voltages are compared, the comparison time is different), and the change of Clkn from low level to high level represents the completion of the nth comparison.

Design highlights are shown as follows:

Improved Bootstrapped Switch: From the hold to the sampling phase, the linearity is enhanced by reducing the charge leakage related to Vin through the addition of a MOS transistor;
Upgraded Dynamic Comparator: During the latching phase, the comparative speed is elevated via increase in the equivalent transconductance (gm);
CDAC Capacitor Array: Employing a “sandwich” three-layer capacitor structure augmented with a dummy structure, this element features a minimal area and high matching precision;
Switching Strategy: For the top four bits, a splitting capacitor scheme is used to secure the common-mode voltage at 1/2 $V_{r e f}$ . The lower four bits employ a monotonic switching strategy to avoid excessive capacitor mismatches caused by excessively small unit capacitors;
SAR Logic: To enhance the overall comparison efficiency, we adopt an asynchronous timing scheme in SAR logic [9].

2.1. Linearity-Improved Bootstrapped Switch

To mitigate the effect of

V_{I N}

variation on sampling accuracy, researchers have proposed a gate voltage bootstrapped switch that reduces the nonlinearity of the

R_{o n}

. The schematic of the classical gate voltage bootstrapped switch is shown in Figure 2. When CK = 0, the switch is in the hold phase, charging the bootstrapped capacitor

C_{B}

up to VDD; when CK = 1, the switch enters the sampling phase, and the gate of the sampling switch

M_{1}

is connected to the bootstrapped capacitor

C_{B}

, resulting in a gate voltage of

V_{I N} + V_{D D}

. Therefore, the gate-to-source voltage of

M_{1}

becomes a fixed

V_{D D}

, decoupling its resistance from input variations, effectively reducing switch nonlinearity and enhancing overall sampling accuracy.

The inclusion of

M_{4}

in the classical gate voltage bootstrapped switch is to limit the

V_{D S 5}

voltage in

M_{5}

to

V_{X} - V_{T H 4}

, preventing excessive source–drain voltage in

M_{5}

from causing breakdown [10]. However, the introduction of

M_{4}

introduces new nonidealities; as the mode changes from hold to sampling, the voltage at point X rises from 0 V to

V_{X 1}

due to the presence of parasitic capacitance

C_{P}

, as shown in the equation below:

V_{X 1} = V_{I N} + \frac{C_{B}}{C_{B} + C_{P}} V_{D D}

(1)

C_{B}

is the bootstrapped capacitor. When

M_{5}

is off and point Y is floating during the sampling phase, if

V_{D D} - V_{X 1}

is less than

V_{T H 4}

,

M_{4}

does not fully turn off, causing charge to leak from X to Y. Since

V_{X 1}

is related to

V_{I N}

, charge leakage becomes related to

V_{I N}

, introducing nonlinearity [11].

To address the nonlinearity issue of the bootstrapped switch, an improved gate voltage bootstrapped switch is proposed, as depicted in Figure 2. This structure incorporates an additional PMOS transistor

M_{11}

to reduce nonlinearity. During the transition from hold to sampling,

M_{11}

is turned on, setting the voltage at point Y to

V_{D D}

and ensuring

V_{D S 4}

of

M_{4}

is much less than

V_{T H 4}

. Therefore, by turning off

M_{4}

, the influence of

V_{X}

is minimized, which, in turn, decreases the leakage from X to Y through

M_{4}

, thereby enhancing the dynamic performance of the bootstrapped switch.

Several considerations for the gate voltage bootstrapped switch design are as follows: To achieve area optimization while ensuring stable elevation of the bootstrapped switch’s gate-to-source voltage and to fulfill the SNDR design requirement, the bootstrapped capacitor

C_{B}

is set to 135 fF; other MOS transistors are sized relatively small to diminish the effects of parasitic capacitance.

The dynamic performance of the proposed linearity-improved bootstrapped sampling circuit is compared with the traditional one in Figure 3. The SFDR improved by 10.1 dB from 76.78 dB to 86.88 dB, and the SNDR improved by 4.56 dB from 75.13 dB to 79.69 dB at 70 MS/s.

2.2. Transconductance-Enhanced Dynamic Comparator

In SAR ADC design, comparators are utilized primarily for comparing voltages across the CDAC capacitor array. Comparator designs are classified into static and dynamic types: static comparators provide lower input offset voltages with rapid response speeds, but high-resolution instances usually depend on multi-stage amplifiers, leading to high static power consumption and larger area occupancy; in contrast, dynamic comparators merge preamplification and latching stages, controlled by clock signals, typically devoid of static power consumption. Given the low energy consumption and compact size requirements of the designed readout circuit, this research opts for the dynamic comparator approach.

A traditional two-stage dynamic comparator structure is depicted in Figure 4a, comprising two operation phases: reset and compare. During reset, with CLK at low level, the comparator’s preamplification stage outputs are charged up to VDD through MP3 and MP4, activating MN1 and MN4, and grounding the output of the latching stage. In the compare phase, with CLK at high level, MN7 conducts, causing differential voltages Vinp and Vinn to discharge the preamplification stage outputs at different rates, generating a differential voltage

Δ V

. Concurrently, in the latching stage, MP5 conducts, allowing

Δ V

to propagate through MN1 and MN4 to the cross-coupled inverter structure, initiating a positive feedback mechanism. Ultimately, one high and one low level are produced at the output nodes.

The two-stage dynamic comparator, consisting of a preamplifier stage and a latching stage, allows for a flexible balance between speed and offset in low-voltage operations. During latching, the effective transconductance

g_{m, e f f}

significantly influences speed. In the outlined structure, when MN2 and MN3 are off,

g_{m, e f f}

is solely contributed by MP1 and MP2, affecting the comparator’s comparison speed. To achieve low power consumption and compact design in the readout circuit, while ensuring the comparator’s response speed meets specifications, a transconductance-enhanced two-stage dynamic comparator is proposed [12], illustrated in Figure 4b.

In the transconductance-enhanced two-stage dynamic comparator, the latching stage structure is improved by adding M11, M12, and M13, M14. During reset, with CLK at low, the preamplification stage functions as the conventional structure. High preamplification stage outputs lead to conduction through M5 and M6, driving M9, M10. The dual action of M13, M14 conducting sets comparator outputs

V_{o u t p}

and

V_{o u t n}

to high; during comparison, with CLK at high, preamplification stage outputs discharge at differing rates. As

V_{o u t p}

exceeds

V_{o u t n}

, point A discharges faster than B, causing M5 and M6’s

V_{D S}

to diverge. High CLK1 leads to differing discharge rates at points C and D. The positive feedback formed by M7, M8, M9, and M10 ultimately sets

V_{o u t p}

high and

V_{o u t n}

low.

M11 and M12 ensure NMOS and PMOS (M7, M9, or M8, M10) stay in the deep inversion region during latching, thus improving

g_{m, e f f}

, as shown in Equation (2):

g_{m, e f f} = g_{m n} + g_{m p} = μ_{n} C_{O X} {(\frac{W}{L})}_{n} V_{d s n} + μ_{p} C_{O X} {(\frac{W}{L})}_{p} | V_{d s p} |

(2)

CLK1 is CLK plus a delay

Δ t

and an increased

Δ V

, aimed at reducing power consumption and ensuring effective conduction of M11, M12:

Δ t

delays M11 and M12’s conduction at CLK’s rising edge, reducing short-circuit current through M5 and M6;

Δ V

ensures M11 and M12 better meet

V_{g s} > V_{t h n}

. However, this structure results in longer reset times, affecting the comparator’s overall comparison speed. Thus, M13 and M14 are introduced to reduce reset time.

Comparator offset is a critical factor, particularly in terms of its evident impact on dynamic performance due to process mismatches. The transconductance-enhanced comparator runs the Monte Carlo simulation, as illustrated in Figure 5. The results indicate an average offset voltage of

μ

= −0.836 mV, with a standard deviation of

σ

= 5.03669 mV.

Waveform comparisons between the transconductance-enhanced two-stage dynamic comparator and traditional two-stage dynamic comparator are shown in Figure 6. The data indicate that for a 1 LSB input, comparator comparison delay decreased from 184 ps to 149 ps, a performance improvement of around 20%. Simultaneously, energy consumption dropped from 178 μm to 132 μm, an improvement of approximately 26%.

2.3. CDAC Switching Scheme and Unit Capacitor

To reduce the energy consumption during the switching in conventional strategies, a split-capacitor switching strategy was introduced, as detailed in [13]. This approach halves the capacitor at the most significant bit into two equal sub-capacitors, resulting in a 38% reduction in power consumption compared to traditional strategies. The monotonic switching strategy, which involves discharge only, requires

2^{(n - 1)}

unit capacitors (single-ended) for an n-bit ADC, reducing the total capacitor value by half and decreasing energy consumption by 81% relative to traditional strategies. Although the monotonic strategy is efficient in saving energy, it changes the common-mode voltage of the top-plate in a differential structure. In contrast, the split-capacitor strategy maintains the common-mode voltage during switching, but the unit capacitor

C_{u n i t}

used for the least significant bit is twice as much as that in the monotonic strategy.

To achieve low power consumption, reduce common-mode interference, and improve ADC accuracy, a combined switching strategy of split-capacitor and monotonic is adopted: the first four bits (MSBs) use the split-capacitor method to prevent substantial changes in the common-mode voltage during the bit variations; the last four bits (LSBs) employ the monotonic method to avoid the excessive use of

C_{u n i t}

and reduce power consumption.

For a low-power, compact readout circuit applied in CIM, a “sandwich” capacitor structure is used to meet the requirements of small area, which is shown in Figure 7. The structure in SMIC 110 nm process design uses layers 4, 5, and 6, with M5 as the top plate and M4 and M6 as the bottom plate of the capacitor. To reduce the influence of parasitic capacitance on the relative accuracy of the unit capacitor, a shielding layer is added between the unit capacitors, essentially providing better protection by dummy layers and reducing capacitor mismatches at the boundaries.

The DNL and INL are estimated by the method of relative standard deviation. In the designed process, the relative standard deviation of a 10 fF MOM capacitor is 1.3%. The relative standard deviation of the designed 0.8 fF MOM capacitor can be inferred to be approximately 4.2% based on the ratio of standard deviation to capacitance [14]. As shown in Figure 8, the DNL and INL are +0.74/−0.44 LSB and +0.75/−0.73 LSB, respectively.

3. CCO-Type Readout Circuit

In general, the operation speed and accuracy of CIM arrays are at least an order of magnitude lower than those of the SAR ADC converters designed in Section 2. Therefore, the SAR ADC readout circuits proposed in these applications are typically utilized in shared CIM arrays, where different columns of a single CIM array or similar columns across multiple CIM arrays share a single SAR ADC readout circuit. While this type of readout circuit offers the advantage of high precision, its power consumption and area requirements are less than ideal. Hence, this section introduces a non-traditional type of readout circuit design. Leveraging the benefits of low power consumption and compact size, this readout circuit design can be applied to every column of the CIM array.

For the study of CIM readout circuits in this section, RRAM is chosen as the storage compute array element. RRAM modulates resistance through the formation and rupture of conductive filaments to enable data storage and read-write operations. The dynamism of the conductive filaments involves migration of oxygen ions and the generation and annihilation of oxygen vacancies, influenced by stochastic variations and thermal effects, leading to the switching of RRAM resistance values [15]. To simplify the model, this study adopts a linear RRAM model for designing the readout circuit, disregarding other nonlinear factors. For quantification ease, this model utilizes a dual-weight system, where high resistance states (HRS) and low-resistance states (LRS) store binary data “0” and “1”, respectively, with analog resistance using a high/low resistance ratio of 20 K

Ω

/2 K

Ω

. The 1T1R configuration is selected for simulating the array, as shown in Figure 9.

In this model, the source line (SL) connects to the transistor’s source; the bit line (BL) connects the storage unit with the column signal sensor; the word line (WL) is used for selecting the row address and triggering row signals. To use a specific 1T1R unit, the WL is enabled, and current flows out from the BL when the SL voltage is 1.2 V. A

4 \times 4

CIM array constructed with 1T1R to minimize sneak current is illustrated in Figure 9. Sneak current leakage occurs when some computing units in a CIM column are disabled (SL = 0), and the current from enabled computing units (SL = 1) may flow through the BL of the disabled units and exit through SL.

The

4 \times 4

CIM array’s SL1–SL4 end voltage is 1.2 V; WL1–WL4 are enabled at 0 V and disabled at 1.2 V; BL1–BL4 serve as current output ends. The readout circuit designed for the 4 × 4 CIM array, as shown in Figure 10, includes a front-end circuit, a CCO circuit, and a digital logic circuit. BL1–BL4 represent the current output from the 4 × 4 CIM array. The voltage at Point A (A1–A4) is clamped to

V_{r e f}

through the operational amplifier, after which the current is mirrored into the CCO circuit (comprising three current control inverters) to produce variable frequency waveforms. Finally, a digital logic circuit quantizes waveforms of different frequencies and outputs them in digital form.

Notable design highlights within this approach include:

Clamp voltage
The current within an RRAM array is computed as $I = Σ V_{i} G_{i}$ . Consequently, variations in column output voltage within the CIM array induce changes in the column’s current, thus impacting accuracy. The CCO-type readout circuit designed in this paper leverages a clamping action generated by a negative feedback circuit, constructed from an operational amplifier and a current mirror, to clamp the voltage at Point A (A1–A4) to $V_{r e f}$ . This maintains a constant voltage difference across the array, thereby enhancing the linearity of the CIM storage compute array’s output current.
Selection of size in current mirror
Given the order of magnitude variation in column current within the CIM array, careful consideration should be given to sizing MN1,1–MN4,1: these NMOS transistors must remain in the saturation region during column current changes. Therefore, prudent selection of the dimensions for these four NMOS transistors is critical during the design process. Considering the mA level current in the CIM array columns, direct replication of current via a current mirror results in prohibitive power consumption. Hence, scaling of the current is performed using (MN1,2, MN1,3, MN1,4); moreover, different proportions of (MN1,2, MN1,3, MN1,4) to (MN4,2, MN4,3, MN4,4) facilitate the assignment of varied weights to different columns within the CIM.
Design of CCO
This study uses a design for a CCO characterized by low process-sensitivity, thereby eliminating the need for an additional reference voltage. Additionally, leveraging the complementary nature of technology-induced variations in capacitance and voltage, the design’s sensitivity to manufacturing process variations is mitigated [16].

3.1. Front-End Circuit Design

In the readout circuit of the voltage-controlled oscillator (VCO)-type proposed in [7], the authors used a current-to-voltage converter (a MOS transistor) to convert current into voltage. This voltage then controls the output frequency of the VCO, and finally, digital circuits quantify the counter to achieve the final digital output.

A structural drawback of this setup is that while it is designed to quantify different column currents, the conversion from current to voltage results in changes in the voltage at point Y reflecting the variations in column currents. Changes in the voltage at point Y (point A in this paper) lead to modifications in the voltage difference across columns in the RRAM Crossbar, thus causing deviations in the column currents from their ideal values. To address this issue, this paper implements a negative feedback circuit formed by an operational amplifier and a current mirror, as shown in Figure 11a.

In the front-end circuit, the voltage at point A1 is clamped to

V_{r e f}

. With this setup, as the changes in the LRS within the CIM array occur, the voltage at point A1 remains almost unchanged, as illustrated in Figure 11b. The voltage variations at point A1 in the circuit designed in the [7] are around 200 mV, whereas in the circuit designed in this paper, the voltage changes at point A1 are about 5 mV, which can be considered negligible. Hence, this structure enhances the linearity of the output column currents in the CIM array.

Within the circuit, the operational amplifier adopts a five-transistor architecture, as depicted in Figure 12a. Due to the requirement for the input to remain stable at 0.8 V, NMOS pair transistors are utilized for the input. The amplitude and phase frequency characteristic curves are shown in Figure 12b. The operational amplifier exhibits an approximate DC gain of 37.5 dB, with a phase margin of about 92°.

To achieve better current mirror matching in the circuit, the current mirror for the first array is divided into MN1,2, MN1,3, and MN1,4, with IB1, IB2, and IB3 each connected to one of the three CCOs in the CCO circuit. Furthermore, considering the near-mA scale current flowing through the columns of the CIM array, the width-to-length ratio of (MN1,2, MN1,3, MN1,4) is proportionally reduced to decrease power consumption in subsequent circuits.

The front-end circuit designed in this paper categorizes the four columns of the CIM array into four distinct weight classes: (MN1,2, MN1,3, MN1,4) represent weight 1, (MN2,2, MN2,3, MN2,4) have weight 2, (MN3,2, MN3,3, MN3,4) carry weight 4, and (MN4,2, MN4,3, MN4,4) are assigned weight 8. These weights are implemented at the physical layer, directly assigning weights to the columns, which effectively reduces the power consumption and area costs that would be incurred by digital logic operations for weight assignment. Table 1 provides the MOS transistor dimensions for the current mirrors shown in Figure 10.

3.2. Design of CCO

Traditional CCOs have limited suppression of fabrication process-sensitivity, which can lead to variations in the output frequency of the CCO, particularly in high-gain CCO designs. These frequency shifts can incur significant correction costs in subsequent circuit outputs. To address this challenge, this paper proposes a new low process-sensitive CCO structure, as demonstrated in Figure 13.

The IBn is the mirrored current from the current mirror in the front-end circuit flowing out of each delay unit.

V_{D D}

denotes the power supply voltage, and

V_{t h n 1}

represents the threshold voltage of MN1. When the input

V_{i n}

shifts from a low to a high level, MP1 and MP3 disconnect, MN3 conducts. Subsequently, the capacitor

C_{i n t}

discharges through the current

I_{C C O}

. Once the voltage on the bottom plate of

C_{i n t}

drops to

V_{t h n 1}

, MN1 turns off, leading to a high inverter output and thus a low

V_{o u t}

. Conversely, when

V_{i n}

shifts from high to low, MP1 conducts and the current through MP1 recharges

C_{i n t}

, resetting its voltage. With MP3 conducting and MN3 off,

V_{o u t}

goes high. The output period of the CCO is given by the following equation:

T = N \times [\frac{C_{i n t} (V_{D D} - V_{t h n 1})}{I_{C C O}} + t_{r i s e} + t_{f a l l}]

(3)

Here, N is the number of delay units (N = 3 in this design),

t_{r i s e}

is the signal’s rising delay time, and

t_{f a l l}

is the falling delay time. In practical circuits,

t_{r i s e}

and

t_{f a l l}

can be neglected compared to the first term within the parentheses in Equation (3). Thus, for simplification, the output waveform’s frequency is assumed to be a first-order function of

I_{C C O}

.

The low process-sensitivity of this CCO structure is due to the complementary nature of changes in the value of

C_{i n t}

with those of (

V_{D D} - V_{t h n 1}

) with respect to process variations. For instance, in an FF process corner, the value of

C_{i n t}

and the threshold voltage

V_{t h n 1}

would decrease, while the value of (

V_{D D} - V_{t h n 1}

) would increase, thus reducing the changes in the product of

C_{i n t} (V_{D D} - V_{t h n 1})

. Table 2 provides the MOS transistor dimensions for the current mirrors shown in Figure 13. Moreover, the

C_{i n t}

is equal to 1 pf.

Considering the approximate current range of 7.2 μA to 42 μA, the Monte Carlo simulation of CCO was operated with a current of 25 μA as

I_{C C O}

, as depicted in Figure 14.

3.3. Digital Logic Circuit

The final output of the readout circuit under discussion is generated as a digital signal via a time-to-digital converter (TDC). Traditional TDC implementations typically use either counters or encoders. In time-delay-based ADC, encoder-based T/D conversion is common due to the circuit structure’s requirements. However, in this research, digital counters are the primary means for achieving T/D conversion [17].

The readout circuit does not impose high requirements on conversion speed. There are no stringent demands on the delay times of the digital modules. To fulfill objectives of low power consumption and reduced area, true single-phase clock (TSPC)-type flip-flops are employed within this structure. In TDCs utilizing digital counters, the output is expressed as:

F (T) = N_{C C O} - ⌊2^{N_{c l k}} \times \frac{f_{c l k}}{f_{C C O}}⌋

(4)

Here,

N_{c l k}

denotes the bit depth of the reference side counter,

f_{c l k}

is the input frequency of the reference signal,

N_{C C O}

represents the bit depth of the CCO output counter, and

f_{C C O}

is the output waveform frequency of the CCO, with

⌊\cdot⌋

indicating a floor function.

When the reference counter reaches a full count of N bits, it triggers the CCO side counter, with the count value of the CCO side counter representing the digital output at that moment. Consequently, the reference counter has fewer bits compared to the CCO counter, which has a greater number to prevent counting errors due to insufficient bit count, typically including a margin for redundancy.

4. Simulation Results

4.1. SAR ADC Readout Circuit Layout and Post-Simitation Results

The SAR ADC proposed in this study was designed utilizing SMIC 110 nm CMOS technology. By employing custom-designed capacitors, active devices were arranged beneath the CDAC array, utilizing metal layers M1 to M3 for routing. We present the simulation results conducted with a 1.2 V supply voltage in this subsection.

Figure 15a presents the performance metrics for the SAR ADC with a Nyquist-rate input. The results demonstrate SNDR of 46.22 dB, SFDR of 57.21 dB, and ENOB of 7.38 bits for the 70 MS/s sampling rate. Furthermore, Figure 15b displays the variation in SNDR and SFDR across different input frequencies. The observations reveal that the SFDR remains above 57 dB and the SNDR stays above 46 dB, with both metrics exhibiting a decrement of less than 2 dB as the input frequency escalates to the Nyquist rate.

The power dissipation observed for the system operating with a 1.2 V supply is measured at 553 μW. The distribution of power consumption across different components of the system is as follows: the DAC accounts for 7% of the total power dissipation; the comparator is responsible for 31%; the S/H circuit utilizes 2%; and the digital circuits collectively demand 60% of the power, as illustrated in Figure 16. Consequently, the achieved FoM performance is 47.26 fJ/Conv.

In this design, active devices are incorporated beneath custom-designed capacitors, employing M1 to M3 layers for routing. This arrangement culminates in a core area of 0.0025544 mm² (49.6 μm × 51.5 μm), utilizing 110 nm CMOS technology, as depicted in Figure 17. A comparison with preceding SAR ADCs is presented in Table 3. It reveals that the current design achieves the minimum area footprint while delivering a competitive FoM at moderate operational speed. Consequently, this design is particularly suitable for CIM applications, demanding an ADC that is both space-efficient and energy-efficient.

4.2. Simulation of Current-Controlled Oscillator-Type Readout Circuit

The

4 \times 4

CIM array utilized in this paper assigns weightings of 1, 2, 4, and 8 to the first, second, third, and fourth columns, respectively. The RRAM employed has two states: an HRS (20 K

Ω

) and an LRS (2 K

Ω

). For each column, there are five types of current outputs associated with RRAM resistances of (2 K

Ω

, 2 K

Ω

, 2 K

Ω

, 2 K

Ω

), (20 K

Ω

, 2 K

Ω

, 2 K

Ω

, 2 K

Ω

), (20 K

Ω

, 20 K

Ω

, 2 K

Ω

, 2 K

Ω

), (20 K

Ω

, 20 K

Ω

, 20 K

Ω

, 2 K

Ω

), and (20 K

Ω

, 20 K

Ω

, 20 K

Ω

, 20 K

Ω

). Given the distinct weights of the four columns in the CIM array, theoretically, this structure can yield

5^{4} = 625

outcomes, denoted as

N_{S} = 625

.

With the reference CLK clock frequency set to 25 M, specific digital outputs post-quantification are derived based on Equation (4). Altering the high/low resistance state count for each RRAM column in the CIM yields digital outputs at the OUTPUT terminal of the digital logic module. When the RRAM in the

4 \times 4

CIM array is entirely in the HRS, the OUTPUT yields 0001101; when entirely in the LRS, the OUTPUT yields 1111111. Statistical analysis performed using Python on the output results reveals a quantification of 95 outcomes within the range of 0001101 to 1111111, hence

N_{D} = 95

. These 95 outcomes are arranged in ascending order, as shown in Figure 18. The observed frequency discontinuities at both sides occur because the semiconductor device within the CCO (specifically, transistor MN1,1 in the current mirror of the front-end circuit) reaches its operational limits. At very low currents, the transistor enters the cut-off region, while at very high currents, it transitions to the linear region. Consequently, this leads to a non-ideal current–frequency relationship.

The delay L represents the interval between the OUTPUT terminal delivering an 8-bit result and the subsequent output of the next 8-bit result. With a reference CLK of 25 MHz, after 128 counts, the reference counter issues a STOP signal ceasing the counting at the CCO counter; concurrently, the parallel-to-serial state machine initiates conversion. After 12 CLK1 (with CLK1 frequency equal to CLK) cycles, the OUTPUT is complete. For efficiency, as OUTPUT begins, the CCO is reset, and the next counting cycle commences. The time from OUTPUT delivering a binary result until the next cycle’s binary output, as shown in Figure 19, sets the circuit delay L to 5.2 μs.

When varying numbers of HRS and LRS are configured in the CIM RRAM array, there is a significant variation in mirrored current from the front-end circuit, leading to a considerable change in power consumption for the front-end and CCO circuits. The average power consumption,

P_{a v e r a g e}

, in this study is quantified by summing and then averaging the power consumption across all 625 scenarios:

P_{a v e r a g e} = \frac{P_{1} + P_{2} + \cdot \cdot \cdot P_{N}}{N}

(5)

Under a supply voltage of 1.2 V, transient simulations yield an estimation of power dissipation for the current-controlled oscillator-type readout circuit, as shown in Figure 20.

P1 is the power consumption of the front-end circuit, P2 is that of the CCO circuit, and P3 is the power consumption of the digital logic circuit. It is apparent from the figure that as the HRS/LRS change in the CIM array, resulting in current variations to the CCO, there is a corresponding fluctuation in power consumption due to the different frequencies of output. Calculating the average power consumption as per Equation (5), the average total power dissipation of the circuit structure is found to be 183.1 μW, of which the front-end circuit accounts for 94.7 μW, the CCO circuit for 86.8 μW, and the digital logic circuit for 1.6 μW. Their proportional contributions are illustrated in Figure 21.

The CCO-type readout circuit designed in this paper achieves a resolution of 95, delay of 5.2 μs, and power dissipation of 183.1 μW at a voltage of 1.2 V in TSMC 65 nm.

Table 4 provides a comparative analysis between the non-traditional readout circuit presented in this study and those reported in other research. This study adopts a method of distinct weighting for different CIM columns, providing a marked advantage in resolution relative to [7,22]. Compared to references [7,22], the circuit delay in this study is relatively larger. There are two main reasons for the extended time in our design. For the first, time and resolution are positively correlated. Taking [7] as an example, the paper designs a VCO-based readout circuit, and its results are only applicable to a single column in an RRAM array with the resolution of 32 levels. Our paper proposes a quantization approach for four columns of an RRAM array simultaneously (the weights for the four columns being 1, 2, 4, and 8, respectively) with the resolution of 95 levels. For the second, we scale down the current using current mirrors, which results in a frequency reduction in the CCO circuit. In turn, this leads to the growth of time. As power consumption metrics are not provided in [7,22], and the number of cycles is not specified, a comparison of power consumption is eschewed.

5. Conclusions

With the rapid evolution of AI technology and its expanding applications, increasingly sophisticated algorithms require superior computing power. The traditional von Neumann architecture is facing power limitations, causing substantial computational bottlenecks. To address this challenge, CIM architectures offer distinct advantages in processing CNN algorithms. However, the readout circuitry incurs significant power consumption and area overhead, often exceeding that of the compute array itself. This study investigates readout circuits within CIM using RRAM as the computational storage unit.

There are two prevalent designs: traditional ADC-type readout circuits and non-traditional types. Based on the former, a low-power, compact SAR ADC is designed as the core of the readout circuit for digital signal extraction. The prototype ADC occupies an active area of only 0.00255 mm² (49.6 μm × 51.5 μm) and achieves an SNDR of 46.22 dB and an SFDR of 57.21 dB, with the Nyquist rate input at a sampling rate of 70 MS/s. The power consumption is 553 μW, resulting in a 47.26 fJ/Conv FOM at 1.2 V supply.

Nevertheless, the complex conversion process in traditional ADC types yields significant power and area costs, which can surpass the compute array. To overcome this drawback, the paper designs a CCO-type readout circuit. This CCO-type readout circuit design achieves a resolution of 95, delay of 5.2 μs, and power dissipation of 183.1 μW at a voltage of 1.2 V in TSMC 65 nm.

Author Contributions

X.X., under the supervisor’s guidance, designed two types of CIM readout circuits in both the schematic and layout phases and wrote the paper. A.W. proposed the topic, provided guidance and ideas for the design, and revised the paper. Y.S. assisted with the CLK generation and pad modules for the first readout circuit, as well as the CCO module for the second readout circuit. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

Data is contained within the article.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Meng, F.H.; Lu, W.D. Compute-in-Memory Technologies for Deep Learning Acceleration. IEEE Nanotechnol. Mag. 2024, 18, 44–52. [Google Scholar] [CrossRef]
Liu, Q.; Gao, B.; Yao, P.; Wu, D.; Chen, J.; Pang, Y.; Zhang, W.; Liao, Y.; Xue, C.X.; Chen, W.H.; et al. 33.2 A fully integrated analog ReRAM based 78.4 TOPS/W compute-in-memory chip with fully parallel MAC computing. In Proceedings of the 2020 IEEE International Solid-State Circuits Conference-(ISSCC), San Francisco, CA, USA, 16–20 February 2020; pp. 500–502. [Google Scholar]
Lee, K.; Cheon, S.; Jo, J.; Choi, W.; Park, J. A charge-sharing based 8t sram in-memory computing for edge dnn acceleration. In Proceedings of the 2021 58th ACM/IEEE Design Automation Conference (DAC), San Francisco, CA, USA, 5–9 December 2021; pp. 739–744. [Google Scholar]
Xiao, K.; Cui, X.; Qiao, X.; Song, J.; Luo, H.; Wang, X.; Wang, Y. A 28nm 32Kb SRAM computing-in-memory macro with hierarchical capacity attenuator and input sparsity-optimized ADC for 4b MAC operation. IEEE Trans. Circuits Syst. II Express Briefs 2023, 70, 1816–1820. [Google Scholar] [CrossRef]
Chou, T.; Tang, W.; Botimer, J.; Zhang, Z. Cascade: Connecting rrams to extend analog dataflow in an end-to-end in-memory processing paradigm. In Proceedings of the 52nd Annual IEEE/ACM International Symposium on Microarchitecture, Columbus, OH, USA, 12–16 October 2019; pp. 114–125. [Google Scholar]
Liu, H.; Qian, Z.; Wu, W.; Ren, H.; Liu, Z.; Ni, L. AFPR-CIM: An Analog-Domain Floating-Point RRAM-based Compute-In-Memory Architecture with Dynamic Range Adaptive FP-ADC. arXiv 2024, arXiv:2402.13798. [Google Scholar]
Mayahinia, M.; Singh, A.; Bengel, C.; Wiefels, S.; Lebdeh, M.A.; Menzel, S.; Wouters, D.J.; Gebregiorgis, A.; Bishnoi, R.; Joshi, R.; et al. A voltage-controlled, oscillation-based adc design for computation-in-memory architectures using emerging rerams. ACM J. Emerg. Technol. Comput. Syst. 2022, 18, 1–25. [Google Scholar] [CrossRef]
Singh, A.; Lebdeh, M.A.; Gebregiorgis, A.; Bishnoi, R.; Joshi, R.V.; Hamdioui, S. Srif: Scalable and reliable integrate and fire circuit adc for memristor-based cim architectures. IEEE Trans. Circuits Syst. Regul. Pap. 2021, 68, 1917–1930. [Google Scholar] [CrossRef]
Harpe, P.; Zhou, C.; Wang, X.; Dolmans, G.; de Groot, H. A 30fJ/conversion-step 8b 0-to-10MS/s asynchronous SAR ADC in 90 nm CMOS. In Proceedings of the 2010 IEEE International Solid-State Circuits Conference-(ISSCC), San Francisco, CA, USA, 7–11 February 2010; pp. 388–389. [Google Scholar]
Chen, H.; He, L.; Deng, H.; Yin, Y.; Lin, F. A high-performance bootstrap switch for low voltage switched-capacitor circuits. In Proceedings of the 2014 IEEE International Symposium on Radio-Frequency Integration Technology, Hefei, China, 27–30 August 2014; pp. 1–3. [Google Scholar]
Xu, X.; Shui, Y.; Wang, A. A 0.0025 mm 2 8-bit 70MS/s SAR ADC with a Linearity-Improved Bootstrapped Switch for Computation in Memory. In Proceedings of the 2023 8th International Conference on Integrated Circuits and Microsystems (ICICM), Nanjing, China, 20–23 October 2023; pp. 412–416. [Google Scholar]
Khorami, A.; Dastjerdi, M.B.; Ahmadi, A.F. A low-power high-speed comparator for analog to digital converters. In Proceedings of the 2016 IEEE International Symposium on Circuits and Systems (ISCAS), Montreal, QC, Canada, 22–25 May 2016; pp. 2010–2013. [Google Scholar]
Ginsburg, B.P.; Chandrakasan, A.P. An energy-efficient charge recycling approach for a SAR converter with capacitive DAC. In Proceedings of the 2005 IEEE International Symposium on Circuits and Systems, Kobe, Japan, 23–26 May 2005; pp. 184–187. [Google Scholar]
Wang, A.; Shi, C.J.R. A 10-bit 50-MS/s SAR ADC with 1 fJ/conversion in 14 nm SOI FinFET CMOS. Integration 2018, 62, 246–257. [Google Scholar] [CrossRef]
Jiang, Z.; Wong, H.S.P. Stanford University resistive-switching random access memory (RRAM) Verilog-A model. nanoHUB 2014. [Google Scholar]
Shui, Y.; Wang, A. A 14.17 pJ·K² FoM CMOS Temperature Sensor with 173 μm² Sensing Core for Remote Sensing in 65 nm CMOS. IEEE Sens. J. 2023, 23, 27059–27067. [Google Scholar] [CrossRef]
Yoon, Y.G.; Park, S.H.; Cho, S. A time-based noise shaping analog-to-digital converter using a gated-ring oscillator. In Proceedings of the 2011 IEEE MTT-S International Microwave Workshop Series on Intelligent Radio for Future Personal Terminals, Daejeon, Republic of Korea, 24–25 August 2011; pp. 1–4. [Google Scholar]
Liu, S.; Rabuske, T.; Paramesh, J.; Pileggi, L.; Fernandes, J. Analysis and background self-calibration of comparator offset in loop-unrolled SAR ADCs. IEEE Trans. Circuits Syst. Regul. Pap. 2017, 65, 458–470. [Google Scholar] [CrossRef]
Tang, F.; Ma, Q.; Shu, Z.; Zheng, Y.; Bermak, A. A 28 nm cmos 10 bit 100 ms/s asynchronous sar adc with low-power switching procedure and timing-protection scheme. Electronics 2021, 10, 2856. [Google Scholar] [CrossRef]
Zhao, J.; Huang, Z.; Hou, X. A 10-bit 50-ms/s asynchronous sar adc in 65nm cmos. In Proceedings of the 2022 IEEE 14th International Conference on Advanced Infocomm Technology (ICAIT), Chongqing, China, 8–11 July 2022; pp. 225–229. [Google Scholar]
Huang, Y.; Luo, C.; Guo, G. A cryogenic 8-bit 32 ms/s sar adc operating down to 4.2 k. Electronics 2023, 12, 1420. [Google Scholar] [CrossRef]
Liu, C.; Yan, B.; Yang, C.; Song, L.; Li, Z.; Liu, B.; Chen, Y.; Li, H.; Wu, Q.; Jiang, H. A spiking neuromorphic design with resistive crossbar. In Proceedings of the 52nd Annual Design Automation Conference, San Francisco, CA, USA, 7–11 June 2015; pp. 1–6. [Google Scholar]

Figure 1. Block and timing diagrams of the proposed SAR ADC.

Figure 2. The bootstrapped switch structure.

Figure 3. Bootstrapped switch dynamic performance.

Figure 4. (a) Traditional two-stage dynamic comparator and (b) transconductance-enhanced two-stage dynamic comparator.

Figure 5. Monte Carlo simulation of the comparator offset voltage.

Figure 6. Comparison of

V_{o u t}

waveforms from two comparators.

Figure 6. Comparison of

V_{o u t}

waveforms from two comparators.

Figure 7. 3D model of capacitance with shielding in one direction [11].

Figure 8. DNL and INL simulation results.

Figure 9.

4 \times 4

CIM array.

Figure 9.

4 \times 4

CIM array.

Figure 10. CCO-type readout circuit structure.

Figure 11. (a) The front-end circuit and (b) change of point A1 with the number of LRS (

V_{r e f} = 0.8

V).

Figure 11. (a) The front-end circuit and (b) change of point A1 with the number of LRS (

V_{r e f} = 0.8

V).

Figure 12. (a) The operational amplifier structure and (b) its amplitude frequency response and phase response.

Figure 13. Low process-sensitive CCO circuit.

Figure 14. Monte Carlo simulation of the CCO.

Figure 15. (a) FFT spectrum at 70 MS/s and (b) dynamic performance versus different input frequencies [11].

Figure 16. Power dissipation I.

Figure 17. The layout of the proposed ADC [11].

Figure 18. Distribution of 95 outcomes.

Figure 19. Time delay.

Figure 20. Power distribution in 625 cases.

Figure 21. Power dissipation II.

Table 1. MOS transistors size configuration of current mirror in front-end circuit.

MOS	MN1,1–MN4,1	MN1,2–MN1,4	MN2,2–MN2,4	MN3,2–MN3,4	MN4,2–MN4,4
W/L	20 $μ$ /600n	120n/600n	240n/600n	480n/600n	960n/600n

Table 2. MOS transistors size configuration of current mirror in front-end circuit.

MOS	MN1	MN2	MN3	MP1	MP2	MP3
W/L	600n/200n	120n/60n	120n/60n	3 $μ$ /60n	1 $μ$ /200n	240n/60n

Table 3. Comparison with previous works I.

	[18] *	[19] *	[20] ⁺	[21] ⁺	This Work ⁺
Technology (nm)	130	28	65	180	110
Active Area (mm²)	0.048	0.026	0.105	0.253	0.00255
Resolution (bits)	8	10	10	8	8
Supply (V)	1.2	0.9	1.2	1.8	1.2
Sampling Rate (MS/s)	150	100	50	32	70
SNDR (dB)	42.9	51.54	52.09	47.7	46.2
ENOB (bits)	6.83	8.27	8.36	7.63	7.4
Power (W)	640 $μ$	1.1 m	2.79 m	2.4 m	553 $μ$
FOM (fJ/Conv.-step)	37.5	35.6	169	378	47.26

*: testing results, and ⁺: post-layout simulation results.

Table 4. Comparison with previous works II.

	[22]	[7]	This Work ⁺
Technology (nm)	-	28	65
Time delay (s)	400n	10n	5.2 $μ$
Resolution (level)	20	32	95

⁺: pre-simulation results.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Xu, X.; Wang, A.; Shui, Y. Readout Circuit Design for RRAM Array-Based Computing in Memory Architecture. Electronics 2024, 13, 2478. https://doi.org/10.3390/electronics13132478

AMA Style

Xu X, Wang A, Shui Y. Readout Circuit Design for RRAM Array-Based Computing in Memory Architecture. Electronics. 2024; 13(13):2478. https://doi.org/10.3390/electronics13132478

Chicago/Turabian Style

Xu, Xingjie, Aili Wang, and Yuhang Shui. 2024. "Readout Circuit Design for RRAM Array-Based Computing in Memory Architecture" Electronics 13, no. 13: 2478. https://doi.org/10.3390/electronics13132478

APA Style

Xu, X., Wang, A., & Shui, Y. (2024). Readout Circuit Design for RRAM Array-Based Computing in Memory Architecture. Electronics, 13(13), 2478. https://doi.org/10.3390/electronics13132478

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Readout Circuit Design for RRAM Array-Based Computing in Memory Architecture

Abstract

1. Introduction

2. SAR ADC Design for Traditional ADC Readout Circuits

2.1. Linearity-Improved Bootstrapped Switch

2.2. Transconductance-Enhanced Dynamic Comparator

2.3. CDAC Switching Scheme and Unit Capacitor

3. CCO-Type Readout Circuit

3.1. Front-End Circuit Design

3.2. Design of CCO

3.3. Digital Logic Circuit

4. Simulation Results

4.1. SAR ADC Readout Circuit Layout and Post-Simitation Results

4.2. Simulation of Current-Controlled Oscillator-Type Readout Circuit

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI