1. Introduction
The rapid growth of Artificial Intelligence (AI) and Machine Learning (ML) applications has significantly increased the demand for energy-efficient hardware accelerators capable of performing large-scale multiply-and-accumulate (MAC) operations [
1,
2,
3]. Deep neural network workloads require repeated movement of input activations, weights, and partial sums between memory and processing units, resulting in high latency and energy consumption in conventional von Neumann architectures. As a result, data movement has become a major bottleneck in modern AI hardware systems.
Compute-in-memory (CIM) has emerged as a promising approach to reduce this memory-access overhead by allowing computation inside or near the memory array [
4,
5,
6]. Among different memory technologies, Static Random Access Memory (SRAM)-based CIM architectures are widely explored due to their high-speed operation, CMOS compatibility, and reliable memory behavior. SRAM-CIM designs can generally be classified into analog and digital approaches. Analog CIM architectures perform computation via bitline charge-sharing or voltage-domain accumulation, but they suffer from limited signal margin, nonlinearity, sensitivity to process variation, and reduced precision [
7]. These limitations make high-precision MAC operation challenging, especially for low-voltage and edge-AI applications.
Digital CIM architectures address many of the accuracy limitations of analog CIM by using logic-based multiplication and digital accumulation [
8]. Since the accumulated output bit-width can be extended to the required precision, digital CIM enables more reliable, loss-free MAC computation. However, this comes at the cost of additional area and power overhead from modified SRAM bitcells, adder trees, shifters, and peripheral logic. Therefore, reducing the bitcell-level overhead while preserving correct memory and computation functionality remains an important design challenge.
Recent work by Tyagi and Mittal proposed an all-digital 6T SRAM-based CIM macro that features a compact 2T multiplier for bitwise multiplication [
9]. Their design improves over earlier NOR-gate-based multiplier approaches by reducing multiplier transistor count and eliminating the need for input inversion. The 6T+2T architecture supports reconfigurable input and weight precision, digital MAC computation, concurrent MAC and weight update operations, and wide supply-voltage operation. However, the storage portion of the bitcell still relies on the conventional 6T SRAM structure, which contributes to the total transistor count and area overhead when scaled to large memory arrays.
In parallel, reduced-transistor SRAM cells such as 4T SRAM have been explored for near-threshold and low-power memory operation in scaled CMOS technologies [
10]. A lower transistor count can improve area density and reduce parasitic capacitance, thereby reducing switching energy. Motivated by this, the present work investigates a compact 4T+2T SRAM-based digital CIM bitcell implemented in 45 nm CMOS technology. The proposed design replaces the conventional 6T storage cell with a 4T SRAM cell while retaining the compact 2T multiplier path for bitwise computation. The objective is to evaluate whether the transistor count and hardware overhead can be reduced while maintaining correct write, read, and CIM functionality.
This work focuses on bitcell-level and small array-level validation of the proposed 4T+2T architecture. The design is evaluated through transient simulation, delay measurement, power and energy analysis, Monte Carlo variation study, CIM waveform validation, and noise/SNR estimation. The results are compared with the reference 6T+2T structure to analyze the potential benefits and trade-offs of the proposed reduced-transistor CIM bitcell. The remainder of this paper is organized as follows.
Section 2 presents the background, research gap, and proposed implementation.
Section 3 describes the proposed architecture and methodology.
Section 4 discusses simulation results and performance analysis. Finally,
Section 5 concludes the paper and outlines future work.
Beyond SRAM, alternative non-volatile memory technologies have also been actively explored as computational substrates for in-memory computing. Resistive RAM (ReRAM) has attracted considerable attention due to its crossbar array structure, which enables highly parallel analog MAC operations through Ohm’s law and Kirchhoff’s current law, offering potentially superior area density and energy efficiency compared to SRAM-based approaches. Prior work has explored a wide range of SRAM-based CIM designs, including analog bitline-based approaches [
11,
12,
13], all-digital CIM macros [
14,
15], compute-in-memory circuits for deep learning [
16,
17,
18], classifier implementations in standard SRAM arrays [
19,
20], ReRAM-based CIM designs [
21], and high-precision MAC-oriented SRAM architectures [
22,
23,
24,
25]. Chen et al. provide a comprehensive review of ReRAM-based processing-in-memory architectures for neural network acceleration, discussing various design schemes and their associated challenges including device non-linearity, soft and hard faults, and limited precision [
26]. Phase-change memory (PCM) has similarly been explored as a CIM substrate, leveraging multi-level cell programming to store multi-bit weights in a compact footprint. More recently, embedded resistive RAM (eRRAM) implementations in advanced CMOS nodes have demonstrated logic-compatible integration without extra masks, as reported by Huang et al. in a 2T bipolar eRRAM macro fabricated in 28 nm HKMG process [
27]. While these non-volatile CIM approaches offer compelling advantages in standby power and density, they face challenges in write endurance, device variability, and CMOS process compatibility that make SRAM-based digital CIM a preferred choice for applications requiring high reliability, full CMOS compatibility, and deterministic digital computation, which motivates the present work.
2. Background
SRAM-based CIM architectures aim to overcome the data-transfer bottleneck by allowing memory arrays to participate directly in computation. In machine learning workloads, MAC operations dominate the overall computation, and repeated access to stored weights creates significant energy overhead in conventional memory-processing systems. CIM reduces this overhead by reusing memory cells not only for storage but also for computation. Analog CIM approaches commonly perform multiplication or accumulation through bitline voltage development, charge sharing, or current-domain summation. While these techniques can offer high energy efficiency, they are limited by signal margin, nonlinearity, device variation, and the need for ADC-based readout. These limitations become more severe as the required computation precision increases. Digital CIM architectures improve computational accuracy by performing bitwise multiplication and accumulation using digital logic. Since the accumulation path is digital, the output precision can be extended to support the required MAC bit-width. The reference 6T+2T CIM design integrates a conventional 6T SRAM bitcell with a 2T multiplier to perform bitwise multiplication between the stored weight and input activation [
9], as illustrated in
Figure 1. In this structure, the complement of the stored weight is already available inside the SRAM cell, and the 2T multiplier generates the
output without requiring additional input inversion. This reduces the complexity compared with earlier NOR-gate-based multiplier structures and improves delay and energy efficiency. Although the 6T+2T CIM architecture improves over prior digital CIM bitcells, the memory storage portion still uses a conventional 6T SRAM cell. Since CIM macros are composed of a large number of repeated bitcells, even a small reduction in transistor count at the bitcell level can yield measurable reductions in parasitic capacitance and aggregate switching energy. This creates a research gap: most prior digital CIM work focuses on improving the multiplier or peripheral MAC circuitry, while the storage cell itself remains relatively transistor-intensive. A reduced-transistor SRAM cell integrated with a compact multiplier can reduce switching overhead, provided that stable storage and correct computation are maintained. The proposed work addresses this gap by replacing the 6T SRAM storage cell with a compact 4T SRAM cell and integrating it with the 2T multiplier concept, forming a 4T+2T digital CIM bitcell. The 4T SRAM cell stores the weight value, while the 2T multiplier receives the input activation and produces the bitwise product
. Compared with the 6T+2T reference design, the proposed 4T+2T structure reduces the total number of transistors from eight to six. This reduction lowers device-level switching capacitance, which is reflected in the measured power and energy results reported in this work. The proposed implementation is evaluated in 45 nm CMOS technology. The bitcell is first validated for write and read functionality to confirm stable memory operation. The CIM mode is then verified by applying input activation pulses while maintaining the stored weight inside the SRAM cell. The output waveform is analyzed to confirm correct bitwise multiplication behavior. In addition, delay, average power, energy, Monte Carlo variation, and noise/SNR behavior are evaluated and compared with the 6T+2T reference structure. At this stage, the work focuses on bitcell-level feasibility and comparative analysis rather than complete macro-level integration.
3. Proposed Architecture & Methodology
This section presents a proposed SRAM-based digital compute-in-memory (CIM) architecture that uses a compact 4T SRAM storage cell integrated with a 2T bitwise multiplier. The main objective of the proposed design is to reduce transistor count while preserving memory functionality and enabling in-memory bitwise computation. In the proposed approach, the SRAM cell stores the weight bit, while the input activation is applied to the multiplier path to generate the bitwise multiplication output. The design was implemented and evaluated in CMOS technology using HSPICE simulations. All simulations were performed at a supply voltage of , temperature of , and output load capacitance of .
3.1. Proposed 4T+2T Bitcell
The proposed 4T+2T bitcell consists of two functional parts: a 4T SRAM storage section and a 2T multiplier section. The 4T SRAM section is responsible for storing the binary weight at the internal storage nodes Q and . The additional 2T multiplier path uses the stored data and the input activation to produce the bitwise CIM output .
Compared to a conventional 6T SRAM cell, the 4T SRAM structure uses fewer transistors, thereby improving memory density and reducing device-level switching overhead, as shown in
Figure 2. This makes the proposed structure suitable for compact CIM arrays, where both storage density and computation efficiency are important. Since the proposed design retains the basic storage functionality while adding a separate multiplier output path, the read/write operation and the CIM operation can be evaluated independently.
In CIM mode, the stored SRAM data acts as the weight, and the external input activation is applied to the multiplier path. The multiplier output node represents the bitwise product of the input activation and the stored weight. When the stored weight is logic high, and the input activation switches high, the multiplier output also switches high, corresponding to . For other input-weight combinations, the output remains at logic low. Therefore, the proposed 4T+2T bitcell supports bitwise multiplication inside the memory structure without requiring a separate external multiplier for the evaluated bit-level operation.
For performance evaluation, conventional SRAM read and write operations were analyzed along the word-line and bit-line paths. In contrast, CIM operations were analyzed along the input-to-multiplier-output path. Specifically, CIM delay is measured from the input activation node to the multiplier output node , since this path directly represents the bitwise computation behavior of the proposed cell. The same nominal simulation conditions, , , and , are used for read, write, and CIM performance evaluation to ensure consistent comparison across the bitcell designs.
The truth table of the proposed multiplier operation is shown in
Table 1.
Integrating the bitwise multiplication path directly with the memory cell, the proposed 4T+2T bitcell reduces the need for transferring stored weight data to external computation units. The stored SRAM data is used locally during CIM operation, while the input activation propagates through the 2T multiplier path to generate the output. Furthermore, the reduced transistor count can lower parasitic and switching capacitance, thereby improving power and energy efficiency.
Figure 3 illustrates the CIM operation flow of the proposed 4T+2T bitcell. In CIM mode, the stored weight remains inside the SRAM cell, while the input activation is applied to the 2T multiplier path. The resulting
signal represents the bitwise multiplication output used for compute-in-memory operation.
3.2. Overall Architecture and Methodology
The proposed CIM architecture is constructed using an array of 4T+2T bitcells arranged in rows and columns. Each bitcell stores a binary weight in the SRAM storage nodes and uses the additional 2T multiplier path to generate a bitwise product with the applied input activation. In the memory mode, the cell operates as a conventional SRAM bitcell for read and write access. In the CIM mode, the stored weight remains inside the memory cell, while the input activation is applied to the multiplier path to produce the bitwise output .
During CIM operation, input activation bits are applied to the multiplier path of the selected bitcells. Each cell locally computes a bitwise product using the stored weight and the input activation. The generated bitwise products can then be processed by peripheral digital circuits, such as adder trees, barrel shifters, and accumulators, to complete a multibit MAC operation. In this evaluation, the cell-level CIM functionality is assessed by measuring the propagation delay, power, and energy from the input activation node to the multiplier output node .
For a multibit MAC operation, the multiplication between an input activation and stored weight can be performed using bit-level partial products. For a 4-bit input activation
and a 4-bit weight
, the MAC operation can be expressed as:
where
to
represent the 4-bit input activation bits of the
row,
represents the stored weight bit, and
R is the number of rows participating in the CIM operation. Each term represents a bitwise multiplication between the input activation bit and the stored weight, followed by a shift according to the input bit significance. The shifted partial sums are then accumulated to generate the final MAC output.
The architecture supports two modes of operation:
Memory Mode: Standard SRAM read and write operations are performed through the word-line and bit-line paths to store or update weight values.
CIM Mode: The stored weight is retained inside the SRAM cell, and the input activation is applied to the multiplier path to generate the bitwise multiplication output .
Read and write simulations verify the storage functionality of the proposed cell, while CIM simulations verify the bitwise multiplication path. Since the computation is performed digitally, the proposed methodology avoids the precision loss and non-linearity issues commonly associated with analog CIM approaches. The use of a compact 4T storage cell with a lightweight 2T multiplier provides a hardware-efficient approach for low-power SRAM-based CIM design in 45 nm CMOS technology.
4. Results
4.1. Functional Validation of the 4T+2T Bitcell
The proposed 4T+2T SRAM-based compute-in-memory (CIM) bitcell was validated using transient simulations in 45 nm CMOS technology. The simulations were performed to verify both the conventional memory operation and the CIM bitwise multiplication functionality of the proposed structure. We analyzed the bitcell under read, write, and CIM operating conditions to confirm that the reduced-transistor storage structure and the integrated 2T multiplier path function correctly.
The overall performance was evaluated in terms of delay, power, and energy for read, write, and CIM operations. The conventional 6T SRAM cell served as the baseline memory structure, while the 6T+2T design served as the reference CIM bitcell. The proposed 4T+2T design was compared against both to analyze the trade-off between memory functionality and compute-in-memory capability.
During memory operation, the internal storage nodes Q and were monitored to verify stable data storage. The write operation confirms that the proposed bitcell can switch the internal nodes according to the applied input condition. In contrast, the read operation verifies that the stored data can be accessed without causing a destructive change in the stored state. Since reduced-transistor SRAM cells are more sensitive to node disturbances, the stability of Q and was carefully monitored during the transient analysis.
For CIM operation, the stored data acts as the weight, while the input activation is applied to the multiplier path. The multiplier output was observed to follow the expected bitwise multiplication behavior. When the stored weight is logic high, and the input activation switches high, the multiplier output also switches high, confirming the condition. For other input-weight combinations, the output remains at logic low. This verifies that the proposed 4T+2T bitcell supports in-memory bitwise multiplication.
4.2. Read Operation Analysis
The read performance of both the 6T+2T and proposed 4T+2T bitcells was evaluated using transient simulations. The delay was measured from the output node’s response to the applied wordline signal.
The measured read delay was 2023 ps for the 6T+2T design and 26.91 ps for the proposed 4T+2T design. The reduced delay in the 4T+2T bitcell indicates faster read response due to lower parasitic capacitance and simplified transistor structure.
In addition, the proposed design demonstrates significantly lower power consumption during read operations. The average read power is reduced from 10.03 nW in the 6T+2T design to 1.35 nW in the 4T+2T design.
Similarly, the read energy is reduced from 0.02006 fJ to 0.005403 fJ, highlighting the energy-efficient nature of the proposed architecture.
4.3. Write Operation Performance
The write operation of the proposed 4T+2T bitcell was evaluated by measuring the rise and fall transition delays of the storage nodes Q and QB, along with average write power and write energy.
The proposed 4T+2T bitcell achieved write delays of , calculated as the average of rise and fall transitions, with a write power of and write energy of . Compared with the 6T+2T reference, the proposed design shows a higher write delay of compared with , but substantially lower write power of compared with , and lower write energy of compared with . The faster write speed of the 6T+2T design is attributed to its stronger transistor-assisted write path. However, the proposed 4T+2T design provides a more energy-efficient write operation, which is advantageous for workloads with infrequent weight updates.
4.4. CIM Operation and Bitwise Multiplication Performance
The CIM functionality of the proposed 4T+2T bitcell was verified by measuring the propagation from the input activation node to the multiplier output node . This measurement directly represents the bitwise multiplication path of the CIM cell. In this mode, the stored SRAM data remains at the internal nodes , while the input activation is applied to the 2T multiplier path.
The waveform results confirmed that the multiplier output switches according to the expected bitwise multiplication operation, for the selected case where the stored weight is logic high, follows the applied input activation. The storage node Q remains stable during this operation, indicating that the CIM evaluation does not disturb the stored weight. A small variation was observed at the complementary node ; however, it remained below the logic threshold and did not affect computational correctness.
The CIM performance of the proposed 4T+2T design was compared with the reference 6T+2T design. It was observed that the proposed design improves CIM energy efficiency while incurring only a minor delay penalty.
The compute-in-memory (CIM) capability of the proposed 4T+2T architecture was validated through transient simulation by evaluating the bitwise multiplication path. In CIM mode, the stored SRAM data acts as the weight, while the input activation is applied to the 2T multiplier path. The multiplier output represents the bitwise product between the input activation and the stored weight.
Unlike conventional SRAM read operations, the CIM delay was not measured along the word-line-to-bit-line path. Instead, the delay was measured from the input activation node to the multiplier output node , since this path directly represents the computation performed inside the bitcell. For the evaluated case, the stored weight was initialized as logic high, and the input activation was switched from logic low to logic high. The resulting transition at confirms the bitwise multiplication condition .
The CIM performance of the proposed 4T+2T design was compared with the reference 6T+2T CIM bitcell using the same input transition, supply voltage, and measurement window. The results show that the proposed 4T+2T design has a slightly higher CIM delay of 50.89 ps compared to 47.76 ps for the 6T+2T design. However, the proposed design significantly reduces CIM power and energy. The CIM power decreases from 1.772 µW to 0.8014 µW, while the CIM energy decreases from 10.63 fJ to 4.808 fJ.
4.5. Monte Carlo Analysis
We performed Monte Carlo analysis on the proposed 4T+2T SRAM bitcell to evaluate its robustness under process variations. A total of 100 simulation samples were considered, with Gaussian variations applied to the transistor dimensions while maintaining a load capacitance of 5 fF. The analysis was carried out under the write-‘0’ condition to observe the stability of the internal storage nodes, propagation delay, power, and energy across different process samples.
The simulation results show that the write delay remains stable across all Monte Carlo samples. The rise delay varies approximately between 56 ps and 68 ps, while the fall delay varies between 58 ps and 64 ps. This narrow delay variation indicates that the proposed 4T+2T bitcell is insensitive to process-induced changes in transistor dimensions during write operations.
The average power consumption also shows only a small variation, remaining approximately within the range of 29 nW to 32 nW. Similarly, the write energy remains nearly constant across the samples, varying approximately from 0.059 fJ to 0.064 fJ. This confirms that the proposed design maintains consistent energy behavior under process variations.
The internal storage nodes Q and remain stable for all Monte Carlo samples. During the write ‘0’ condition, node Q settles close to ground, while node remains near the supply voltage. This confirms that the proposed bitcell successfully stores the intended logic state without bit-flip failures across the simulated variation range.
The multiplier output node also maintains valid logic behavior during the simulations. The output reaches a valid logic-high level when expected, while a small transient undershoot may appear due to capacitive coupling and switching effects. Since this undershoot remains small and does not cross the logic threshold, it does not affect the circuit’s functional correctness.
The Monte Carlo waveforms in
Figure 4 show the process-variation behavior of the reference 6T+2T and proposed 4T+2T bitcells across 100 samples. The
maximum voltage remains close to
for both designs, confirming that the multiplier output maintains a valid logic-high level under transistor-dimension variations. The proposed 4T+2T bitcell also shows lower average power variation, remaining within approximately
–
, compared with
–
for the 6T+2T reference. Similarly, the write energy of the proposed design remains lower and tightly distributed, varying from
–
, while the 6T+2T reference varies from
–
. These results confirm that the proposed 4T+2T bitcell maintains stable output behavior while reducing power and energy under process variations.
Table 2 summarizes the nominal CIM performance and Monte Carlo write variation results for the 6T+2T and proposed 4T+2T bitcells. The proposed 4T+2T design achieves a
reduction in both CIM power and energy with only a
increase in CIM delay. Under process variation, the 4T+2T bitcell consistently shows lower write delay, average power, and write energy across all 100 Monte Carlo samples, while the
output remains at a valid logic-high level for both designs.
Overall, the Monte Carlo analysis confirms that the proposed 4T+2T bitcell maintains stable write operation under process variations. The limited variation in delay, power, energy, and storage-node voltages demonstrates the robustness of the proposed design for low-power SRAM-based compute-in-memory applications.
5. Discussion
The proposed 4T+2T SRAM-based CIM bitcell has been validated at the circuit level for read, write, and bitwise CIM multiplication operations. The current work demonstrates that the proposed design can reduce read power, read energy, and CIM energy while maintaining correct storage and computation behavior. The reduction in transistor count from eight to six lowers device-level switching capacitance, which is consistent with the measured improvements in power and energy consumption reported in
Section 4.
The transient write waveform shown in
Figure 5 confirmed the correct switching behavior of the internal storage nodes
Q and
. Across multiple write cycles over a
simulation window,
Q and
switch between fully complementary logic states, with
Q transitioning from logic high to logic low while
transitions from logic low to logic high, and vice versa. The clean rail-to-rail transitions observed at both storage nodes, without intermediate metastability, indicate that the proposed 4T structure can reliably accept new data during word-line activation. The concurrently applied
and
W signals drive the write and CIM operation as expected. The multiplier output
rises to logic high only when the input and stored weight satisfy the multiplication condition
, and returns to logic low otherwise. The exponential discharge observed at
during falling transitions is consistent with the RC discharge behavior of the 2T multiplier path and does not indicate a functional error.
Table 3 presents a consolidated comparison across bitcell designs. The proposed 4T+2T bitcell achieves the lowest read delay of
, read power of
, and read energy of
among all compared designs. This read performance advantage stems directly from the reduced transistor count. By eliminating two transistors from the storage cell, the proposed design lowers the aggregate bitline capacitance seen during read access, which shortens the discharge time and reduces dynamic switching energy. This behavior is consistent with the well-established relationship between bitline parasitic capacitance and read performance in SRAM design. In the broader CIM design space, Tyagi and Mittal’s 6T+2T design [
9], evaluated in
, demonstrates that a compact 2T multiplier improves over NOR-gate-based multiplier approaches. The present work extends this direction by also compressing the storage cell, trading a modest write delay increase of
compared with
for substantially lower write power of
compared with
, and lower write energy of
compared with
. The
6T+4T and 8T designs in
Table 3 show markedly higher read and write energy values of
and
, respectively, reflecting both process-node differences and the overhead of more complex multiplier structures. Unlike the conventional 6T SRAM cell, both the 6T+2T and proposed 4T+2T designs support bitwise multiplication, confirming that the transistor reduction in the proposed design does not sacrifice CIM capability.
Figure 6 illustrates the CIM waveform of the proposed 4T+2T bitcell under sustained switching operation. The input signal
is applied as a periodic pulse train, and the multiplier output
correctly follows the input when the stored weight supports a logic-high product. This confirms the intended bitwise multiplication behavior of the proposed CIM path. However, two important waveform characteristics are observed. First, the storage node
, plotted with an expanded scale and a
offset, shows small positive and negative transient spikes synchronized with the switching events of
. These spikes remain approximately around
and decay rapidly, indicating capacitive coupling from the multiplier path to the storage node through the shared transistor. Although these transients do not disturb the stored logic state, they indicate a coupling mechanism that should be further characterized in a full array environment, where bitline and wordline parasitics are present.
Second, the complementary node
shows a slow staircase-like drift that accumulates across switching cycles, reaching approximately
by the end of the
simulation window. Based on the periodic pulse train visible in
Figure 6, with a period of approximately
, this
accumulation occurs over roughly 10 switching cycles. This corresponds to a per-cycle drift of approximately
per transition at the
node. This drift is attributed to charge accumulation at the floating QB node in the asymmetric 4T topology, which lacks the strong restoring path present in a conventional 6T SRAM cell.
Since the observed drift remains below the logic threshold, no bit flip occurs during the simulated interval. However, this effect may become more significant over longer operating periods or at elevated temperatures. Two practical mitigation techniques should be investigated in future work. First, a periodic write-back scheme can be used to refresh the stored weight at fixed intervals, analogous to DRAM refresh, thereby resetting the drift without requiring structural changes to the bitcell. Second, a weak keeper transistor can be inserted at the node to provide a continuous restoring current, at the cost of a small increase in static power and transistor count.
Figure 7 presents the static noise margin (SNM) analysis of the proposed 4T SRAM storage cell using the voltage transfer characteristic (VTC) and butterfly curve methods under FreePDK45 technology at
and
. The VTC shows a clear inverting transition around the switching point
. The extracted unity-gain points are
and
, resulting in an
value of
. Unlike a conventional 6T SRAM cell, the 4T topology does not produce two symmetric butterfly lobes; instead, the butterfly curve shows a single crossing point, indicating an asymmetric hold condition. Therefore, the butterfly-based SNM is approximately zero, and the stability of the proposed 4T cell is evaluated using the
metric. The obtained
margin confirms that the proposed cell can maintain its stored state under nominal simulation conditions. However, since this margin is lower than that of a conventional 6T SRAM cell, process variation analysis is required to further evaluate the robustness of the proposed design.
Given the static noise margin and the drift behavior characterized above, the proposed 4T+2T bitcell is best suited for moderate-scale CIM arrays, on the order of tens to low hundreds of rows, where weight refresh intervals can be bounded and the supply voltage remains near the nominal value of . Workloads that involve sparse or infrequent weight updates, such as inference-only edge-AI acceleration for keyword spotting or low-resolution image classification, are particularly well matched to the proposed design. These workloads minimize write-cycle stress on the asymmetric storage node while benefiting from the read and CIM energy advantages of the 4T+2T bitcell.