A Novel 8T Cell-Based Subthreshold Static RAM for Ultra-Low Power Platform Applications

Abstract: Subthreshold SRAMs profit various energy-constrained applications. The traditional 6T SRAMs exhibit poor cell stability with voltage scaling. To this end, several 8T to 16T cell designs have been reported to improve the stability. However, they either suffer one of disturbances or consume large bit-area overhead. Furthermore, some cell options have a limited write-ability. This paper presents a novel 8T static RAM for reliable subthreshold operation. The cell employs a fully differential scheme and features cross-point access. An adaptive cell bias for each operating mode eliminates the read disturbance and enlarges the write-ability as well as the half-select stability in a cost-effective small bit-area. The bit-cell also can support efficient bit-interleaving. To verify the SRAM technique, a 32-kbit macro incorporating the proposed cell was implemented with an industrial 180 nm low-power CMOS process. At 0.4 V and room temperature, the proposed cell achieves 3.6× better write-ability and 2.6× higher dummy-read stability compared with the commercialized 8T cell. The 32-kbit SRAM successfully operates down to 0.21 V (~0.27 V lower than transistor threshold voltage). At its lowest operating voltage, the sleep-mode leakage power of entire SRAM is 7.75 nW. Many design results indicate that the proposed SRAM design, which is applicable to an aggressively-scaled process, might be quite useful in realizing cost-effective robust ultra-low voltage SRAMs.


Introduction
Embedded SRAMs are a critical component in memory rich network-on-chip or SoC designs. They are used as buffer memories or caches, and occupy a large portion of area in current system VLSIs. Especially in energy-constrained low-speed biomedical devices and other emerging applications like wireless sensor networks and many wearable devices, subthreshold SRAM designs to extend system operating life-time from limited energy resource have become an ever-important issue. Lowering supply voltage (V DD ) down to transistor threshold-voltage (V TH ) level decreases CMOS switching power quadratically and decreases first-order leakage power linearly [1]. However, reliable circuit design for proper subthreshold operation is extremely challenging because of reduced design margins and increased device variations.
The widely adopted conventional 6T SRAMs exhibit poor cell stability with V DD scaling, thus severely limiting the minimum operating voltage (V MIN ). To cope with the stability problem at deep-low voltage regime, various SRAM cells composed of different number of transistors have been explored. The common approach utilizes a dedicated read port to separate the data storage nodes from the read path, thereby eliminating the read disturbance [2][3][4][5][6][7][8][9][10]. However, these read-decoupled (RD) cells suffer the half-select disturbance while writing into a cell, hence they cannot support an efficient bit-interleaving structure. Some 8T, 9T, 10T, and 12T SRAM cells [11][12][13][14][15][16] employ cross-point Figure 1 depicts the configuration of the proposed cell, along with the commercialized RD-8T cell [2] for comparison. The proposed 8T cell consists of two load transistors (P1, P2), two drive transistors (N1, N2), two access transistors (N3, N4), and two conducting transistors (P3, P4) between the access device and the drive device. The wordline (WL) is row-based, and the bitline (BL), /bitline (/BL), and column-wise assistline (CAL) are column-based. It has a fully differential symmetric configuration, and utilizes relatively simple one-row/three-column control lines. Table 1 lists the bias conditions of the proposed cell for each operating mode.
Electronics 2020, 9, x FOR PEER REVIEW 2 of 16 transistors [11,12] eliminate the half-select disturbance only. They still suffer the read disturbance. Moreover, the CP 8T cell in [11] has a limited write capability owing to its serial access transistors. Other CP cell options composed of 9, 10, and 12 transistors [13][14][15][16] eliminate the read disturbance as well as the half-select disturbance. However, cross-point cells in [13][14][15] consume large bit-area overhead. Furthermore, CP cells in [13,14,16] have a limited write capability owing to their serial access transistors. The other approaches (Schmitt-trigger-based 10T cell [17], P-P-N-based 10T cell [18], and 1-read/1-write 16T cell [19]) may achieve a good read stability and can provide an efficient bit-interleaving in column. However, the bit-cell area overhead of those is as high as 110% [17,18] and 800% [19] compared with the traditional 6T cell. Besides the cell stability issue, reducing the bitline switching power is also important to achieve low power SRAM. The memory bitlines usually have a heavy capacitive loading. Whenever write or read operation is performed, switching the heavy bitlines costs significant power dissipation. In this paper, we present a novel 8T cell based static RAM for robust subthreshold operation. Our previous work [20] conceptualized the bit-cell operation roughly without detailed core and peripheral circuits. This work, which extends that of [20], seeks the bit performance and all (read, write, row half-select, and column half-select) stability metrics of our proposed cell in greater detail, and verifies the subthreshold SRAM techniques in a 32-kbit prototype designed with an industrial CMOS process technology. This SRAM design extremely reduces the bitline switching power of non-selected columns, and successfully operates down to deep-low supply voltage. Figure 1 depicts the configuration of the proposed cell, along with the commercialized RD-8T cell [2] for comparison. The proposed 8T cell consists of two load transistors (P1, P2), two drive transistors (N1, N2), two access transistors (N3, N4), and two conducting transistors (P3, P4) between the access device and the drive device. The wordline (WL) is row-based, and the bitline (BL), /bitline (/BL), and column-wise assistline (CAL) are column-based. It has a fully differential symmetric configuration, and utilizes relatively simple one-row/three-column control lines. Table 1 lists the bias conditions of the proposed cell for each operating mode.  [2]. WL, wordline; BL, bitline; CAL, column-wise assistline; DN, data-node.   [2]. WL, wordline; BL, bitline; CAL, column-wise assistline; DN, data-node.

Bit-Cell Structure and Layout
We implemented the proposed 8T and commercialized RD-8T cells in an industrial 180 nm low-power CMOS process. An average MOSFET V TH value (NMOS V TH = 0.49 V, PMOS V TH = −0.47 V) of this process is around 0.48 V. The sizes of bit-cell transistors are listed in Table 2, and the resulting bit-cell layouts are shown in Figure 2, in which the layers from active to metal-1 are shown including the n-well layer. In the proposed 8T cell, we use the minimum width and length for the load, drive, and Electronics 2020, 9, 928 3 of 17 conducting transistors. For the access transistors, we use the minimum length, but a 1.91× enlarged width to enhance the cell write-ability. The active, poly, contact, via, and metal are shared as much as possible to minimize the bit-area. The area of the proposed cell is 2.31 × 4.02 µm 2 , almost equal to the RD-8T cell. Table 1. Operational conditions of the proposed 8T SRAM cell. WL, wordline; BL, bitline; CAL, column-wise assistline. We implemented the proposed 8T and commercialized RD-8T cells in an industrial 180 nm lowpower CMOS process. An average MOSFET VTH value (NMOS VTH = 0.49 V, PMOS VTH = −0.47 V) of this process is around 0.48 V. The sizes of bit-cell transistors are listed in Table 2, and the resulting bit-cell layouts are shown in Figure 2, in which the layers from active to metal-1 are shown including the n-well layer. In the proposed 8T cell, we use the minimum width and length for the load, drive, and conducting transistors. For the access transistors, we use the minimum length, but a 1.91× enlarged width to enhance the cell write-ability. The active, poly, contact, via, and metal are shared as much as possible to minimize the bit-area. The area of the proposed cell is 2.31 × 4.02 µm 2 , almost equal to the RD-8T cell.

Standby Mode
The proposed 8T cell employs cross-point access. During the standby, BL and /BL are precharged to VDD, WL is forced to ground, and the column-based CAL is forced to ground. This turns on the conducting transistors (P3, P4), thus the cross-coupled latch (P1-N1, P2-N2) with on-state conducting transistors holds bi-stable data. For example, if the data-node (DN) stores a low-state, the /data-node (/DN) maintains VDD because the conducting transistor P4 can pass a strong high-state. If /DN is low, on the other hand, DN maintains a high-state through the conducting transistor P3. Therefore, a sufficient stability can be achieved in the standby mode. Figure 3 depicts the read operation of the proposed cell. To perform a read operation, two conducting transistors of accessed cell are turned off by switching CAL to high, bitline pairs are fully discharged to ground, and then WL is boosted to VPP where the boosted voltage (VPP) level is about 1.3VDD in this work. Either BL or /BL is charged depending on data status. If data-node (DN) and /data-node (/DN) store a 'low' and 'high' value, a cell current (ICELL) through P2 and N4 transistors will flow to /BL. This raises the /BL voltage level, which can be compared with the pre-discharged BL value. If DN and /DN store a 'high' and 'low' value, then the load transistor P1 conducts. Thus, ICELL will flow to BL, raising the BL level, which can be compared with the discharged /BL. As the conducting transistors are turned off during the read access, there is no direct disturbance on true

Standby Mode
The proposed 8T cell employs cross-point access. During the standby, BL and /BL are precharged to V DD , WL is forced to ground, and the column-based CAL is forced to ground. This turns on the conducting transistors (P3, P4), thus the cross-coupled latch (P1-N1, P2-N2) with on-state conducting transistors holds bi-stable data. For example, if the data-node (DN) stores a low-state, the /data-node (/DN) maintains V DD because the conducting transistor P4 can pass a strong high-state. If /DN is low, on the other hand, DN maintains a high-state through the conducting transistor P3. Therefore, a sufficient stability can be achieved in the standby mode. Figure 3 depicts the read operation of the proposed cell. To perform a read operation, two conducting transistors of accessed cell are turned off by switching CAL to high, bitline pairs are fully discharged to ground, and then WL is boosted to V PP where the boosted voltage (V PP ) level is about 1.3V DD in this work. Either BL or /BL is charged depending on data status. If data-node (DN) and /data-node (/DN) store a 'low' and 'high' value, a cell current (I CELL ) through P2 and N4 transistors Electronics 2020, 9,928 4 of 17 will flow to /BL. This raises the /BL voltage level, which can be compared with the pre-discharged BL value. If DN and /DN store a 'high' and 'low' value, then the load transistor P1 conducts. Thus, I CELL will flow to BL, raising the BL level, which can be compared with the discharged /BL. As the conducting transistors are turned off during the read access, there is no direct disturbance on true storage nodes of our cell. The data-nodes are decoupled from the read path. This read mechanism, which is similar to the RD-8T cell, prevents disturbing the internal stored data. After reading the cell, WL and CAL return to ground in sequence, inducing the cross-coupled latch (P1-N1, P2-N2) with on-state conducting transistors to restore bi-stable data.

Read Operation
Electronics 2020, 9, x FOR PEER REVIEW 4 of 16 storage nodes of our cell. The data-nodes are decoupled from the read path. This read mechanism, which is similar to the RD-8T cell, prevents disturbing the internal stored data. After reading the cell, WL and CAL return to ground in sequence, inducing the cross-coupled latch (P1-N1, P2-N2) with on-state conducting transistors to restore bi-stable data. In Figure 4, we show the 5000 Monte Carlo (MC) simulation results of the read operation. During the read access, CAL goes high for a moment, thus the data-nodes (DN, /DN) of the accessed cell are left floating temporarily. The floating data are actually stored on the gate capacitance of the load and drive transistors and additional parasitic capacitances connected to the data-nodes. Nevertheless, the cell does not lose the data because the CAL pulse width, namely CAL activation time needed to develop a sufficient bitline voltage difference, is much shorter compared with the retention time of the floating data [7,8,15]. Even though there is a small coupling noise caused by up-down behavior of CAL and WL, the accessed cell still holds the data during the read operation. Ultimately, the read operation of the proposed cell is performed without disturbance from the bitlines.   Figure 5 illustrates the write operation of the proposed cell. At the beginning, the column-based CAL is lowered negatively to NVGG, where the negative voltage (NVGG) level is −0.85VDD in this work. This increases the conductivity of the conducting transistors. Let us assume that DN stores 'low', while /DN stores 'high'. To perform a write '1' to DN, /BL is discharged to ground. When WL is In Figure 4, we show the 5000 Monte Carlo (MC) simulation results of the read operation. During the read access, CAL goes high for a moment, thus the data-nodes (DN, /DN) of the accessed cell are left floating temporarily. The floating data are actually stored on the gate capacitance of the load and drive transistors and additional parasitic capacitances connected to the data-nodes. Nevertheless, the cell does not lose the data because the CAL pulse width, namely CAL activation time needed to develop a sufficient bitline voltage difference, is much shorter compared with the retention time of the floating data [7,8,15]. Even though there is a small coupling noise caused by up-down behavior of CAL and WL, the accessed cell still holds the data during the read operation. Ultimately, the read operation of the proposed cell is performed without disturbance from the bitlines. storage nodes of our cell. The data-nodes are decoupled from the read path. This read mechanism, which is similar to the RD-8T cell, prevents disturbing the internal stored data. After reading the cell, WL and CAL return to ground in sequence, inducing the cross-coupled latch (P1-N1, P2-N2) with on-state conducting transistors to restore bi-stable data. In Figure 4, we show the 5000 Monte Carlo (MC) simulation results of the read operation. During the read access, CAL goes high for a moment, thus the data-nodes (DN, /DN) of the accessed cell are left floating temporarily. The floating data are actually stored on the gate capacitance of the load and drive transistors and additional parasitic capacitances connected to the data-nodes. Nevertheless, the cell does not lose the data because the CAL pulse width, namely CAL activation time needed to develop a sufficient bitline voltage difference, is much shorter compared with the retention time of the floating data [7,8,15]. Even though there is a small coupling noise caused by up-down behavior of CAL and WL, the accessed cell still holds the data during the read operation. Ultimately, the read operation of the proposed cell is performed without disturbance from the bitlines.   Figure 5 illustrates the write operation of the proposed cell. At the beginning, the column-based CAL is lowered negatively to NVGG, where the negative voltage (NVGG) level is −0.85VDD in this work. This increases the conductivity of the conducting transistors. Let us assume that DN stores 'low', while /DN stores 'high'. To perform a write '1' to DN, /BL is discharged to ground. When WL is  Figure 5 illustrates the write operation of the proposed cell. At the beginning, the column-based CAL is lowered negatively to NV GG , where the negative voltage (NV GG ) level is −0.85V DD in this work. This increases the conductivity of the conducting transistors. Let us assume that DN stores 'low', while /DN stores 'high'. To perform a write '1' to DN, /BL is discharged to ground. When WL is boosted to V PP (~1.3V DD in this work), the node /PN changes from 'high' to 'low'. Because the strength of the conducting PMOSs is sufficiently large, /DN can be easily discharged to ground. Falling of /DN brings the cross-coupled latch (P1-N1, P2-N2) to complete a flip of state.  Besides the read stability, the write capability of SRAM is equally important for the subthreshold operation. In Figure 6, the write-ability of our cell is compared with the commercialized RD-8T cell at the worst-case process corner (SF: slow-NMOS, fast-PMOS). To write a SRAM cell, one of the bitlines needs to be discharged. The write margin (WM), a metric of the write-ability, is defined here as the discharging bitline voltage when a flip of cell state occurs. The larger the write margin, the easier to it is write. At VDD = 0.4 V and room temperature, WM of the proposed cell measures 177.3 mV, 3.6× larger than that of the RD-8T cell. This is because we boost the wordline by 30% of supply voltage to increase the conductivity of the access transistors. Such boosting provides strong writeability even in the deep subthreshold region. The column-wise write-assist increasing the strength of the conducting transistors also facilitates changing the contents of the cell. On the contrary, the wordline boosting technique may not be employed on the RD-8T cell, because the stability of halfselected cells during write access is significantly degraded by the boosted write-wordline voltage. Figure 7 shows the WM distributions at 0.3 V, 25 °C and the worst-case SF process corner obtained from 10,000 Monte Carlo iterations. Statistically, our cell achieves 3.7× higher mean WM as compared with the RD-8T cell. The minimum WM, which is more critical in subthreshold SRAM design, is also 85 mV higher than that of the RD-8T cell. Overall, the proposed cell provides strong write-ability that results mainly from the wordline boosting and the column-wise write-assist. Besides the read stability, the write capability of SRAM is equally important for the subthreshold operation. In Figure 6, the write-ability of our cell is compared with the commercialized RD-8T cell at the worst-case process corner (SF: slow-NMOS, fast-PMOS). To write a SRAM cell, one of the bitlines needs to be discharged. The write margin (WM), a metric of the write-ability, is defined here as the discharging bitline voltage when a flip of cell state occurs. The larger the write margin, the easier to it is write. At V DD = 0.4 V and room temperature, WM of the proposed cell measures 177.3 mV, 3.6× larger than that of the RD-8T cell. This is because we boost the wordline by 30% of supply voltage to increase the conductivity of the access transistors. Such boosting provides strong write-ability even in the deep subthreshold region. The column-wise write-assist increasing the strength of the conducting transistors also facilitates changing the contents of the cell. On the contrary, the wordline boosting technique may not be employed on the RD-8T cell, because the stability of half-selected cells during write access is significantly degraded by the boosted write-wordline voltage. Figure 7 shows the WM distributions at 0.3 V, 25 • C and the worst-case SF process corner obtained from 10,000 Monte Carlo iterations. Statistically, our cell achieves 3.7× higher mean WM as compared with the RD-8T cell. The minimum WM, which is more critical in subthreshold SRAM design, is also 85 mV higher than that of the RD-8T cell. Overall, the proposed cell provides strong write-ability that results mainly from the wordline boosting and the column-wise write-assist.

Row Half-Select
An efficient bit-interleaving architecture may not be applicable to some previous low-voltage SRAMs [2][3][4][5][6][7][8][9][10]. While writing into a cell in these SRAMs, other column cells sharing a selected row experience dummy-read operation, degrading their hold stability significantly. To cope with the row half-select disturbance, we employed a cross-point access scheme.
During write or read operation in our SRAM, CALs of other column cells sharing a selected row are connected to the ground, as depicted in Figure 8. Let us assume that DN holds '0', while /DN holds '1'. When a row is selected, the voltage dividing in serial three devices (access transistor (N3), conducting transistor (P3) with poor '0' passing, and drive transistor (N1)) extremely limits voltage Electronics 2020, 9, 928 6 of 17 rising of DN, improving the dummy-read static noise margin (SNM). This is verified in the butterfly curve of Figure 9. When DN rises from '0' to '1' with BL = /BL = '1' and WL = high, /DN changes from '1' to '0'. In this voltage transfer curve, the logic '0' value of /DN node in our cell is nearly 0 V because of the conducting transistor. This provides a very ideal butterfly curve, preferable for stable SRAM. At V DD = 0.4 V and T = 25 • C, the dummy-read SNM of the proposed cell measures 130 mV, 2.6× higher than that of the RD-8T cell. Figure 10 shows the 10000 Monte Carlo simulation results at 0.3 V, 25 • C and worst-case FS (fast-NMOS, slow-PMOS) process corner. The results show that the mean and minimum values of dummy-read SNM of the proposed cell are 2.7× and 3.5× higher than those of the RD-8T cell, respectively. As the stability of row half-selected cells is good enough, an efficient bit-interleaving architecture might be implemented in the proposed SRAM.
Besides the read stability, the write capability of SRAM is equally important for the subthreshold operation. In Figure 6, the write-ability of our cell is compared with the commercialized RD-8T cell at the worst-case process corner (SF: slow-NMOS, fast-PMOS). To write a SRAM cell, one of the bitlines needs to be discharged. The write margin (WM), a metric of the write-ability, is defined here as the discharging bitline voltage when a flip of cell state occurs. The larger the write margin, the easier to it is write. At VDD = 0.4 V and room temperature, WM of the proposed cell measures 177.3 mV, 3.6× larger than that of the RD-8T cell. This is because we boost the wordline by 30% of supply voltage to increase the conductivity of the access transistors. Such boosting provides strong writeability even in the deep subthreshold region. The column-wise write-assist increasing the strength of the conducting transistors also facilitates changing the contents of the cell. On the contrary, the wordline boosting technique may not be employed on the RD-8T cell, because the stability of halfselected cells during write access is significantly degraded by the boosted write-wordline voltage. Figure 7 shows the WM distributions at 0.3 V, 25 °C and the worst-case SF process corner obtained from 10,000 Monte Carlo iterations. Statistically, our cell achieves 3.7× higher mean WM as compared with the RD-8T cell. The minimum WM, which is more critical in subthreshold SRAM design, is also 85 mV higher than that of the RD-8T cell. Overall, the proposed cell provides strong write-ability that results mainly from the wordline boosting and the column-wise write-assist.

Row Half-Select
An efficient bit-interleaving architecture may not be applicable to some previous low-voltage SRAMs [2][3][4][5][6][7][8][9][10]. While writing into a cell in these SRAMs, other column cells sharing a selected row experience dummy-read operation, degrading their hold stability significantly. To cope with the row half-select disturbance, we employed a cross-point access scheme.
During write or read operation in our SRAM, CALs of other column cells sharing a selected row are connected to the ground, as depicted in Figure 8. Let us assume that DN holds '0', while /DN holds '1'. When a row is selected, the voltage dividing in serial three devices (access transistor (N3), conducting transistor (P3) with poor '0' passing, and drive transistor (N1)) extremely limits voltage rising of DN, improving the dummy-read static noise margin (SNM). This is verified in the butterfly  At VDD = 0.4 V and T = 25 °C, the dummy-read SNM of the proposed cell measures 130 mV, 2.6× higher than that of the RD-8T cell. Figure 10 shows the 10000 Monte Carlo simulation results at 0.3 V, 25 °C and worst-case FS (fast-NMOS, slow-PMOS) process corner. The results show that the mean and minimum values of dummy-read SNM of the proposed cell are 2.7× and 3.5× higher than those of the RD-8T cell, respectively. As the stability of row half-selected cells is good enough, an efficient bit-interleaving architecture might be implemented in the proposed SRAM.

Column Half-Select
Meanwhile, writing or reading the proposed cell nearly does not affect the stability of other cells sharing a selected column. While writing a cell, WLs of other row cells sharing a selected column are connected to the ground, as depicted in Figure 11a. When CAL of a selected column is lowered negatively to write a cell, the conductivity of conducting transistors will be stronger. Thus, the cell stability, which is shown in Figure 11b, becomes nearly same as the standby mode one. Hence, the unwritten cells sharing a selected column can retain their data securely during write operation.
In Figure 12a, we have shown the status of column half-selected cells while reading the proposed cell. During the read operation, WLs of other row cells sharing a selected column are connected to the ground. When CAL of a selected column is going to high, the storage nodes (DN, /DN)

Column Half-Select
Meanwhile, writing or reading the proposed cell nearly does not affect the stability of other cells sharing a selected column. While writing a cell, WLs of other row cells sharing a selected column are connected to the ground, as depicted in Figure 11a. When CAL of a selected column is lowered negatively to write a cell, the conductivity of conducting transistors will be stronger. Thus, the cell stability, which is shown in Figure 11b, becomes nearly same as the standby mode one. Hence, the unwritten cells sharing a selected column can retain their data securely during write operation.
In Figure 12a,

Column Half-Select
Meanwhile, writing or reading the proposed cell nearly does not affect the stability of other cells sharing a selected column. While writing a cell, WLs of other row cells sharing a selected column are connected to the ground, as depicted in Figure 11a. When CAL of a selected column is lowered negatively to write a cell, the conductivity of conducting transistors will be stronger. Thus, the cell stability, which is shown in Figure 11b, becomes nearly same as the standby mode

Macro Organization
To verify the proposed SRAM cell and its subthreshold operation, we implemented a 32-kbit SRAM macro with an industrial 180 nm low-power CMOS process. Figure 13 shows its basic schematic diagram. The 32-kbit memory macro consists of two 16-kbit memory blocks. Each block contains 128-row × 128-column array. The I/O is eight bits wide. In this design, each high capacitive bitline pair has a local sense amplifier (SA) to reduce the read delay under deep-low voltage regime. During read access, the bitline SAs forward the full-swing read signals to the block sense amplifiers dedicated to each 16-kbit block. In addition, the macro includes two wordline boosters dedicated to each 16-kbit block and one negative voltage generator supplying the NVGG voltage. The write drivers; column signal drivers; and other peripheral units like address buffers, control buffers, data I/O buffers, predecoders, and control logic use static CMOS circuits. The whole memory macro operates at single supply voltage. In Figure 12a, we have shown the status of column half-selected cells while reading the proposed cell. During the read operation, WLs of other row cells sharing a selected column are connected to the ground. When CAL of a selected column is going to high, the storage nodes (DN, /DN) of the column half-selected cells are left floating for a moment. The gate capacitance of the load and drive transistors and additional parasitic capacitances will hold the floating data during the high CAL period. Similarly to the read accessed cell, the half-selected cells can maintain their data because the high CAL period needed for a successful read operation is much shorter compared with the data retention time of the floating storage nodes. This was verified by Monte Carlo simulations shown in Figure 12b. Although the column half-selected cells have a small coupling noise caused by up-down behavior of CAL, they still retain the data for an 87 µs CAL activation time, which is much longer than that required for 0.3 V read operation. Thus, reading a cell does not cause other row cells sharing a selected column to lose their holding data.

Macro Organization
To verify the proposed SRAM cell and its subthreshold operation, we implemented a 32-kbit SRAM macro with an industrial 180 nm low-power CMOS process. Figure 13 shows its basic schematic diagram. The 32-kbit memory macro consists of two 16-kbit memory blocks. Each block contains 128-row × 128-column array. The I/O is eight bits wide. In this design, each high capacitive bitline pair has a local sense amplifier (SA) to reduce the read delay under deep-low voltage regime.

Macro Organization
To verify the proposed SRAM cell and its subthreshold operation, we implemented a 32-kbit SRAM macro with an industrial 180 nm low-power CMOS process. Figure 13 shows its basic schematic diagram. The 32-kbit memory macro consists of two 16-kbit memory blocks. Each block contains 128-row × 128-column array. The I/O is eight bits wide. In this design, each high capacitive bitline pair has a local sense amplifier (SA) to reduce the read delay under deep-low voltage regime. During read access, the bitline SAs forward the full-swing read signals to the block sense amplifiers dedicated to each 16-kbit block. In addition, the macro includes two wordline boosters dedicated to each 16-kbit block and one negative voltage generator supplying the NV GG voltage. The write drivers; column signal drivers; and other peripheral units like address buffers, control buffers, data I/O buffers, predecoders, and control logic use static CMOS circuits. The whole memory macro operates at single supply voltage.  Figure 14 shows the configuration of one block SRAM array, and Figure 15 illustrates its read and write signal waveforms. The design employs an eight-column activating architecture to minimize the array power consumption. During the standby, the bitline pairs and datalines (DL, /DL) are precharged to VDD, and WL and CAL are connected to ground.  Figure 14 shows the configuration of one block SRAM array, and Figure 15 illustrates its read and write signal waveforms. The design employs an eight-column activating architecture to minimize the array power consumption. During the standby, the bitline pairs and datalines (DL, /DL) are precharged to V DD , and WL and CAL are connected to ground.

Cell Array Architecture
In the read access, the conducting transistors of accessed cells are turned off by switching CAL to high, and the V DD prechargers are disabled. Next, the selected bitlines are discharged to ground and then floated by toggling RBLP signal. To read out each stored value of the selected cells, WL is switched to V PP . The cell read-current charges one of the bitlines depending on the stored information. Then, the sense amplifiers are enabled by triggering SAN and SAP. After WL is switched to ground, the column gates are turned on to transfer the read signals to the datalines.
In the write access, the external data drive the datalines. After lowering CAL to a negative voltage NV GG , the V DD prechargers are disabled. Next, BL and /BL are driven with the external data by raising a column signal Y. Then, WL is switched to V PP , allowing the external data to be written. After WL and Y return to ground, the bitlines are charged to V DD , and CAL is switched to ground.
In Figure 16, we show a natural benefit of the proposed cell structure. During the read or write cycle, CALs of row half-selected cells within a 128-row × 128-column array are connected to the ground. When a WL is activated, the row half-selected cells may discharge the bilines. However, the current path from V DD -precharged bitline to data '0' node suffers a stack effect because of the conducting transistor. Moreover, the conducting PMOS turns off harder owing to its reduced source-to-gate voltage and negative source-to-body potential. The net results suppress the subthreshold current component from the bitline to cell '0' node drastically. The scenario has been confirmed from the simulation. One thing notable from the simulated waveforms is that BL and /BL swings of non-selected columns are nearly 0 V in both the read and write cycle. This means that the proposed 8T SRAM structure extremely diminishes the waste power consumption, which might be caused by the bitline swing of non-selected columns.  Figure 14 shows the configuration of one block SRAM array, and Figure 15 illustrates its read and write signal waveforms. The design employs an eight-column activating architecture to minimize the array power consumption. During the standby, the bitline pairs and datalines (DL, /DL) are precharged to VDD, and WL and CAL are connected to ground. In the read access, the conducting transistors of accessed cells are turned off by switching CAL to high, and the VDD prechargers are disabled. Next, the selected bitlines are discharged to ground and then floated by toggling RBLP signal. To read out each stored value of the selected cells, WL is switched to VPP. The cell read-current charges one of the bitlines depending on the stored information. Then, the sense amplifiers are enabled by triggering SAN and SAP. After WL is switched to ground, the column gates are turned on to transfer the read signals to the datalines.

Cell Array Architecture
In the write access, the external data drive the datalines. After lowering CAL to a negative voltage NVGG, the VDD prechargers are disabled. Next, BL and /BL are driven with the external data   Figure 18 shows our negative voltage scheme providing a negative voltage NVGG. The negative voltage generator in this work consists of an oscillator, negative charge pump utilizing voltage doubler [21], and level detector. The negative pump has two charge-transfer paths. During one oscillation period, the pump transfers charges alternately.

Negative Voltage Generator
When an external snooze enable input (/ZZ) is low, the memory macro enters a low-power sleep mode with all data preserved. It disables the negative voltage generator and blocks the incoming read or write request. During the sleep mode, the negative pump output NPOUT is connected to ground. To enter the normal mode (standby, read or write), /ZZ has to be raised and held at a high level. This high /ZZ input activates the level detector and initiates the pumping oscillation. If the pumped voltage reaches below the target value, the level detector turns the oscillator off. As shown in Figure  19a, when VDD = 0.4 V, the NPOUT potential initially settles down near the target value (−0.85VDD) in 65 µs after triggering /ZZ.
On the other hand, during the normal mode, the NPOUT potential actually varies every moment because of some leakage currents and the current extraction from a selected CAL driver during write access. When NPOUT potential rises above the reference level (−0.85VDD), the level detector output (OSCEN) goes high. Then, the oscillator starts toggling. When NPOUT potential is pulled down to below the reference level, OSCEN goes low again, disabling the pumping oscillator. That is, the pumping circuit operates intermittently during SRAM operation. Figure 19b displays NPOUT voltage  Figure 17 shows the row path circuitry used in this work. The wordline booster providing a boosted voltage V PP was implemented with a traditional circuit using one boosting capacitor (C VPP ). The booster output WB OUT is the supply node of 128 row decoders. Each row decoder consists of a static CMOS gate followed by a level shifter. During standby, the node WB OUT is connected to V DD . Whenever write or read operation is performed, the control signal VPPPRE with positive transition makes WB OUT float. The succeeding control signal VPPEN with positive transition allows the boosting capacitor to push the voltage of WB OUT node above V DD . Then, a selected row decoder transfers the boosted V PP level (about 1.3V DD in this design) to the wordline.   Figure 18 shows our negative voltage scheme providing a negative voltage NVGG. The negative voltage generator in this work consists of an oscillator, negative charge pump utilizing voltage doubler [21], and level detector. The negative pump has two charge-transfer paths. During one oscillation period, the pump transfers charges alternately.

Negative Voltage Generator
When an external snooze enable input (/ZZ) is low, the memory macro enters a low-power sleep mode with all data preserved. It disables the negative voltage generator and blocks the incoming read or write request. During the sleep mode, the negative pump output NPOUT is connected to ground. To enter the normal mode (standby, read or write), /ZZ has to be raised and held at a high level. This high /ZZ input activates the level detector and initiates the pumping oscillation. If the pumped voltage reaches below the target value, the level detector turns the oscillator off. As shown in Figure  19a, when VDD = 0.4 V, the NPOUT potential initially settles down near the target value (−0.85VDD) in  Figure 18 shows our negative voltage scheme providing a negative voltage NV GG . The negative voltage generator in this work consists of an oscillator, negative charge pump utilizing voltage doubler [21], and level detector. The negative pump has two charge-transfer paths. During one oscillation period, the pump transfers charges alternately.  Figure 20 shows the CAD plot of SRAM layout. The 32-kbit macro was partitioned into two 16kbit blocks. The wordline decoding circuits are located between the two memory blocks. The organization is 4096-word × 8-bit. The on-chip negative voltage generator is placed over the memory blocks. The wordline boosters, write drivers, block sense amplifiers, and other peripheral units are placed on the bottom. The macro size is 700 × 784 µm 2 . The layout area overhead incurred by two wordline boosters is 0.41% of the macro size, while one negative voltage generator together with 32 CAL drivers (shown in Figure 18) consumes 4.75% of the macro area. When an external snooze enable input (/ZZ) is low, the memory macro enters a low-power sleep mode with all data preserved. It disables the negative voltage generator and blocks the incoming read or write request. During the sleep mode, the negative pump output NP OUT is connected to ground. To enter the normal mode (standby, read or write), /ZZ has to be raised and held at a high level. This high /ZZ input activates the level detector and initiates the pumping oscillation. If the pumped voltage reaches below the target value, the level detector turns the oscillator off. As shown in Figure 19a Figure 20 shows the CAD plot of SRAM layout. The 32-kbit macro was partitioned into two 16kbit blocks. The wordline decoding circuits are located between the two memory blocks. The organization is 4096-word × 8-bit. The on-chip negative voltage generator is placed over the memory blocks. The wordline boosters, write drivers, block sense amplifiers, and other peripheral units are placed on the bottom. The macro size is 700 × 784 µm 2 . The layout area overhead incurred by two wordline boosters is 0.41% of the macro size, while one negative voltage generator together with 32 CAL drivers (shown in Figure 18) consumes 4.75% of the macro area. On the other hand, during the normal mode, the NP OUT potential actually varies every moment because of some leakage currents and the current extraction from a selected CAL driver during write access. When NP OUT potential rises above the reference level (−0.85V DD ), the level detector output (OSCEN) goes high. Then, the oscillator starts toggling. When NP OUT potential is pulled down to below the reference level, OSCEN goes low again, disabling the pumping oscillator. That is, the pumping circuit operates intermittently during SRAM operation. Figure 19b displays NP OUT voltage variation under successive write operation with a period of 6.2 µs. As shown in the simulation, the negative voltage generator in this work can hold the NP OUT voltage level to the target value (−0.34 V for V DD = 0.4 V) with the maximum deviation of ±12 mV. Figure 20 shows the CAD plot of SRAM layout. The 32-kbit macro was partitioned into two 16-kbit blocks. The wordline decoding circuits are located between the two memory blocks. The organization is 4096-word × 8-bit. The on-chip negative voltage generator is placed over the memory blocks. The wordline boosters, write drivers, block sense amplifiers, and other peripheral units are placed on the bottom. The macro size is 700 × 784 µm 2 . The layout area overhead incurred by two wordline boosters is 0.41% of the macro size, while one negative voltage generator together with 32 CAL drivers (shown in Figure 18) consumes 4.75% of the macro area.  Figure 21 shows the internal voltage waveforms at 0.4 V supply. All parasitic resistances and capacitances are included in the simulations. The boosted VPP level is 0.52 V, and the negatively pumped NVGG level is −0.34 V. In the read access, the voltage sensing margin of the bitline sense amplifier is larger than 100 mV. The read delay from CAL activation to the block sense amplifier output (BSAOUT) is observed to be 4.2 µs. The write access shows that a new data is written in 1.2 µs after CAL activation. Extensive simulations have been performed from 0.6 V to 0.15 V supply. The 32-kbit SRAM we have designed is fully functional down to 0.21 V. The operating voltage is limited by read operation. When VDD is lower than 0.21 V, our bitline latch SA cannot sense correctly the voltage difference from the bitline pair.   Figure 21 shows the internal voltage waveforms at 0.4 V supply. All parasitic resistances and capacitances are included in the simulations. The boosted V PP level is 0.52 V, and the negatively pumped NV GG level is −0.34 V. In the read access, the voltage sensing margin of the bitline sense amplifier is larger than 100 mV. The read delay from CAL activation to the block sense amplifier output (BSA OUT ) is observed to be 4.2 µs. The write access shows that a new data is written in 1.2 µs after CAL activation. Extensive simulations have been performed from 0.6 V to 0.15 V supply. The 32-kbit SRAM we have designed is fully functional down to 0.21 V. The operating voltage is limited by read operation. When V DD is lower than 0.21 V, our bitline latch SA cannot sense correctly the voltage difference from the bitline pair.  Figure 21 shows the internal voltage waveforms at 0.4 V supply. All parasitic resistances and capacitances are included in the simulations. The boosted VPP level is 0.52 V, and the negatively pumped NVGG level is −0.34 V. In the read access, the voltage sensing margin of the bitline sense amplifier is larger than 100 mV. The read delay from CAL activation to the block sense amplifier output (BSAOUT) is observed to be 4.2 µs. The write access shows that a new data is written in 1.2 µs after CAL activation. Extensive simulations have been performed from 0.6 V to 0.15 V supply. The 32-kbit SRAM we have designed is fully functional down to 0.21 V. The operating voltage is limited by read operation. When VDD is lower than 0.21 V, our bitline latch SA cannot sense correctly the voltage difference from the bitline pair.  Figure 22 shows the simulated maximum operating frequency and power consumption at the maximum frequency. The leakage power represents the static leakage power consumption of the entire SRAM during the low-power sleep mode. The read power consumption is a little larger than the write power, because both BL and /BL of selected columns are fully discharged to ground in read

Discussion
In this study, we demonstrated a novel cross-point cell based 32-kbit subthreshold SRAM in an industrial 180 nm low-power CMOS process. The memory cell consists of eight symmetric transistors, in which the latch storing data is controlled by a column-based assistline. The bit-cell employs a fully differential read and write scheme. In the read access, the true storage nodes are dynamically separated from the read path, thus eliminating the read disturbance. During dummy-read operation, our cell keeps the data '0' node close to ground, thereby improving the dummy-read stability. In the write access, a capacitive wordline boosting technique enhances the cell write-ability. At VDD = 0.4 V, the proposed 8T cell achieves 3.6× better write-ability and 2.6× higher dummy-read stability compared with the commercialized RD-8T cell. As silicon results are not available, we confirmed, by simulations, that the SRAM is fully functional down to 0.21 V (~0.27 V lower than transistor threshold voltage) and all the bit-cells in the 32-kbit array are stable for each operating mode down to the minimum operating voltage. Obviously, the proposed SRAM design is applicable to an aggressivelyscaled process. Table 3 lists the features of this work with prior SRAMs [4,7,9,[12][13][14][15]18,19] for subthreshold operation. Because SRAM performances are very sensitive to process technology, transistor characteristics, memory array size, and I/O interface, it is rather complicated and difficult to compare various SRAMs point-by-point. Instead, here, we compare the key features of each cell comprehensively. Our SRAM cell has several merits one may desire when building a robust subthreshold SRAM. Our cell utilizes relatively simple one-row/three-column control lines and adopts a fully differential read/write scheme. The differential cell is robust as compared with the single-ended one. The differential read exhibits shorter read delay with an enlarged sensing margin, while the differential write contributes to better write performance. In addition, the proposed cell in this work eliminates the read disturbance and enlarges the write-ability as well as the half-select stability simultaneously. Unlike [13][14][15]18,19], all the cell stability metrics are improved in a costeffective small bit-area. Unlike [4,7,9], our cell can support an efficient bit-interleaving structure. Unlike [12], our cell does not suffer the read disturbance. Moreover, our cell structure provides a natural benefit to reduce the waste power consumption caused by the bitline swing of non-selected columns. Accordingly, we believe that the proposed cell and circuit techniques might be quite useful in realizing cost-effective robust ultra-low voltage SRAMs.

Discussion
In this study, we demonstrated a novel cross-point cell based 32-kbit subthreshold SRAM in an industrial 180 nm low-power CMOS process. The memory cell consists of eight symmetric transistors, in which the latch storing data is controlled by a column-based assistline. The bit-cell employs a fully differential read and write scheme. In the read access, the true storage nodes are dynamically separated from the read path, thus eliminating the read disturbance. During dummy-read operation, our cell keeps the data '0' node close to ground, thereby improving the dummy-read stability. In the write access, a capacitive wordline boosting technique enhances the cell write-ability. At V DD = 0.4 V, the proposed 8T cell achieves 3.6× better write-ability and 2.6× higher dummy-read stability compared with the commercialized RD-8T cell. As silicon results are not available, we confirmed, by simulations, that the SRAM is fully functional down to 0.21 V (~0.27 V lower than transistor threshold voltage) and all the bit-cells in the 32-kbit array are stable for each operating mode down to the minimum operating voltage. Obviously, the proposed SRAM design is applicable to an aggressively-scaled process. Table 3 lists the features of this work with prior SRAMs [4,7,9,[12][13][14][15]18,19] for subthreshold operation. Because SRAM performances are very sensitive to process technology, transistor characteristics, memory array size, and I/O interface, it is rather complicated and difficult to compare various SRAMs point-by-point. Instead, here, we compare the key features of each cell comprehensively. Our SRAM cell has several merits one may desire when building a robust subthreshold SRAM. Our cell utilizes relatively simple one-row/three-column control lines and adopts a fully differential read/write scheme. The differential cell is robust as compared with the single-ended one. The differential read exhibits shorter read delay with an enlarged sensing margin, while the differential write contributes to better write performance. In addition, the proposed cell in this work eliminates the read disturbance and enlarges the write-ability as well as the half-select stability simultaneously. Unlike [13][14][15]18,19], all the cell stability metrics are improved in a cost-effective small bit-area. Unlike [4,7,9], our cell can support an efficient bit-interleaving structure. Unlike [12], our cell does not suffer the read disturbance. Moreover, our cell structure provides a natural benefit to reduce the waste power consumption caused by the bitline swing of non-selected columns. Accordingly, we believe that the proposed cell and circuit techniques might be quite useful in realizing cost-effective robust ultra-low voltage SRAMs. Table 3. Comparison summary with prior works.