Previous Article in Journal
Design and Experimental Study of Full-Process Automatic Anti-Corrosion Joint-Coating Equipment
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Low-Rank Compensation in Hybrid 3D-RRAM/SRAM Computing-in-Memory System for Edge Computing

by
Weiye Tang
1,2,
Long Nie
1,2,
Cailian Ma
1,2,
Hao Wu
1,2,
Yiyang Yuan
1,2,
Shuaidi Zhang
1,2,
Qihao Liu
3 and
Feng Zhang
1,2,*
1
State Key Laboratory of Fabrication Technologies for Integrated Circuits, Institute of Microelectronics of the Chinese Academy of Sciences, Beijing 100029, China
2
School of Integrated Circuits, University of Chinese Academy of Sciences, Beijing 100049, China
3
Shandong SinoChip Semiconductors Co., Ltd., Jinan 250100, China
*
Author to whom correspondence should be addressed.
Eng 2025, 6(12), 332; https://doi.org/10.3390/eng6120332
Submission received: 27 September 2025 / Revised: 7 November 2025 / Accepted: 19 November 2025 / Published: 21 November 2025

Abstract

Artificial intelligence (AI) has made significant strides, with computing-in-memory (CIM) emerging as a key enabler for energy-efficient AI acceleration. Resistive random-access memory (RRAM)-based analog CIM offers better energy efficiency and storage density compared to static random-access memory (SRAM)-based digital CIM. Building on this, three-dimensional (3D) RRAM further improves storage density through vertical stacking. However, 3D-RRAM-CIM is susceptible to variation, which degrades accuracy and poses a significant challenge for system-level deployment in edge computing. Furthermore, the constrained capacity of CIM limits the multitasking performance. In this work, low-rank adaptation is applied to the Hybrid CIM system (Hybrid-CIM) for the first time, which leverages high-density 3D RRAM and high-precision SRAM, to address these challenges. Simulation results illustrate the feasibility of our approach, reducing accuracy degradation by 86% and achieving an 8.5× reduction in area with less than 2% weight overhead. In ResNet-18, with the backbone stored in 3D-RRAM kept fixed, the proposed low-rank adaptation branch (LoBranch) approach achieves an accuracy of 94.0% on CIFAR-10, which is only 0.4% lower than the noise-free digital baseline. This work strikes a favorable balance between accuracy and area, thereby facilitating reliable and efficient 3D-RRAM-based edge computing.

1. Introduction

In recent years, artificial intelligence (AI) has witnessed remarkable progress, particularly in the field of deep neural network (DNN) algorithms, which have been extensively applied across various domains, especially in edge computing. Training allows the general model to fit specific tasks encountered by edge devices. Fine-tuning algorithms, as opposed to traditional training methods, significantly reduce training overhead and enable faster adaptation to specific computational demands at the edge, thereby expanding the potential for AI applications at the edge [1,2]. However, the conventional von Neumann architecture separates memory from computation. This separation incurs significant time and energy costs for memory access, which hinders the deployment of AI algorithms at the edge [3].
Computing-in-memory (CIM) can address the bottleneck by integrating storage and computation, thereby reducing data movement overhead and improving computational efficiency. Currently, most CIM-based works use either emerging non-volatile memories, such as RRAM and MRAM, or volatile SRAM for neural network acceleration [4,5,6]. CIM-based architecture can efficiently support both the inference and training of neural networks, particularly the matrix-vector multiplication (MVM) involved in forward inference and backward error propagation [7,8,9]. RRAM-based analog CIM offers computing efficiency and memory capacity advantages, especially 3D-RRAM-based CIM. In contrast, SRAM-based CIM offers higher computing accuracy, lower write overhead, and better endurance, but limited density [10,11].
Furthermore, the exploration of novel materials such as two-dimensional (2D) materials for constructing memristive devices highlights the ongoing pursuit of higher density and performance. However, comprehensive reviews indicate that fundamental challenges like device-to-device variation, limited yield, and reliability concerns persist from the device level upwards, which are critical hurdles for the commercialization and widespread adoption of RRAM technology [12]. These inherent device-level non-idealities directly manifest as computational errors at the system level, imposing stringent requirements on cross-layer algorithm–hardware co-design.
The non-volatile RRAM eliminates the need for weight data transfer during each wake-up and inference, making it particularly suitable for edge computing. However, when neural networks are deployed on RRAM-CIM, they are susceptible to non-ideal distortions arising from circuit and device imperfections. Specifically, factors such as resistance variation, limited on/off ratio, device-to-device and cycle-to-cycle deviation, and circuit non-ideal effects can distort the summation currents in analog computing [13,14,15], as illustrated in Figure 1a. Although 3D RRAM can achieve higher storage density, it is more prone to variation compared to 2D RRAM owing to 2D peripheral circuits. Once deployed, these distortions compromise the computational results, making it challenging for RRAM-based CIM to maintain computing accuracy in edge computing [16,17,18,19], as shown in Figure 1b.
To mitigate this issue, different solutions have been investigated. Recent works, for instance, have explored reconfigurable digital RRAM-CIM designs that achieve high precision and enable in situ learning by leveraging RRAM as a digital switch [20]. While such approaches effectively address computational errors and offer flexibility, they potentially underutilize the native analog storage capability of RRAM for high-density, parallel MVM operations. In contrast, analog 3D-RRAM-CIM promises greater density and energy efficiency for these core operations, yet its susceptibility to device-level variation remains a fundamental bottleneck. Other solutions range from on-chip variation-aware training to post-mapping error compensation and special computing approaches [7,10,15,21,22,23,24]. However, these solutions often suffer from limited applicability [7,23], high write overhead for RRAM [8,10,17], restricted compensation range [24], and reduced storage density [15,21]. There remains a lack of an efficient system-level compensation strategy that leverages both the algorithm and the features of different CIM architectures to mitigate errors in 3D-RRAM-CIM and expand its applicability in edge computing. To address this gap, this work concentrates on the variation-mitigation aspect of Hybrid-CIM, where accuracy and hardware overhead are jointly assessed at the system level through simulation.
Our prior work [11,13] reported the 3D-RRAM array and the corresponding CIM macro, demonstrating the feasibility of 3D-RRAM-CIM. Building on this foundation, an algorithm–hardware co-design approach is further developed to explore the optimization space of mapping strategies and hybrid computing systems with respect to recognition accuracy. To better accommodate hybrid mapping, this paper depends on the system-level Hybrid CIM architecture (Hybrid-CIM) that integrates accurate SRAM-CIM with high-density 3D-RRAM-CIM, leveraging digital computing to compensate for the non-ideality of analog computing while balancing accuracy and storage density, as shown in Figure 1c, which illustrates the overall workflow. In the visualization of multi-task recognition results, blue denotes correct outputs and red denotes errors. Based on the Hybrid-CIM, a low-rank adaptation branch (LoBranch) is presented, which combines low-rank adaptation with the characteristics of different CIMs. Simulation results demonstrate the promise of our approach in enabling reliable edge computing with low overhead. Compared with existing approaches relying on either intensive RRAM rewriting, large SRAM provisioning, or purely algorithm-level training compensation, our method jointly leverages 3D-RRAM density and SRAM precision, offering both variation robustness and multi-task adaptability with minimal overhead.
In summary, the key novelties of this work are threefold: (1) To the best of our knowledge, this work presents the first integration of the Low-Rank Adaptation (LoRA) algorithm with a hybrid CIM architecture, creating a dedicated Low-Rank Adaptation Branch (LoBranch) for hardware error compensation. (2) A system-level solution is introduced, which synergistically combines the high density of non-volatile 3D-RRAM with the high precision of volatile SRAM, achieving robust computing without costly RRAM write operations. (3) This co-design approach simultaneously addresses two major challenges for edge AI: variation-induced accuracy loss in analog CIM and the limited capacity for multi-task computing, as validated through extensive simulations.

2. Background and Motivation

2.1. Three-Dimensional RRAM and SRAM

Much like the human mind perceives and interprets the world through a delicate interplay between long-term and short-term memory, efficient information processing systems must also harmonize different memory modalities. This philosophical parallel underscore a fundamental design principle in modern computing system: no single type of memory is universally optimal—rather, it is the combination of heterogeneous memory units that enables both capacity and flexibility. Guided by this insight, the architectural distinctions and functional roles of various memory devices are explored.
The cell structures of RRAM, 3D RRAM, and 6T SRAM are illustrated in Figure 2. Both RRAM and 3D RRAM utilize memristor-based mechanisms to store data in the form of resistance. 3D RRAM employs vertical stacking technology to achieve multilayer integration, significantly enhancing storage density compared to RRAM and SRAM. An 8-layer 3D-RRAM-CIM chip was reported in prior work [11], which achieves a practical and substantial increase in storage density through 3D stacking. Like 2D RRAM, 3D RRAM is compatible with standard CMOS logic processes and can be monolithically integrated with peripheral circuits through additional back-end-of-line (BEOL) processing steps. Additionally, the non-volatile nature of RRAM devices facilitates low standby power consumption, which is particularly advantageous for edge computing applications [25].
SRAM, in contrast, excels in write efficiency and operational robustness owing to its digital mechanism. This makes it well-suited for applications that require frequent data updates. However, SRAM’s volatile nature leads to higher standby power consumption and less robust data retention compared to RRAM. Despite 3D RRAM’s superior storage density and non-volatile characteristics, its higher write overhead limits its applicability in scenarios requiring rapid and frequent data modifications.
A major challenge in promoting the application of 3D-RRAM lies in its limited flexibility and reliability. If large-scale neural network models can be effectively deployed on 3D-RRAM, achieving minimal accuracy degradation and functional flexibility comparable to those of SRAM, the low-density issue associated with SRAM can be effectively addressed. Nevertheless, a mature hardware–software co-design methodology that efficiently integrates the two technologies to extend the applicability of 3D-RRAM is still lacking.

2.2. Computing-in-Memory

CIM, integrating computing within memory, effectively overcomes the memory access bottleneck by executing MVM operations directly in the memory array. In terms of implementation, CIM is mainly divided into analog and digital categories. As shown in Figure 3a, the basic principles of these two categories are illustrated. Analog CIM, based on RRAM, can store multi-bit data in an analog state and more efficiently implement basic MVM computations through the principle of current summation. During computation, ADCs and DACs are required to convert data between the analog and digital domains. Digital CIM, based on SRAM, maintains computing accuracy effectively through digital computation and offers greater flexibility in updating weights.
Moreover, CIM architectures require peripheral circuits to fully support NN acceleration based on MVM. These components—including global accumulators, buffers, and inter-macro buses—are systematically integrated at the architectural level. End-to-end NN evaluation of such systems across diverse memory technologies is enabled by system-level architectural simulator works such as NeuroSim [14]. By reconfiguring computation directions, the CIM architecture can support both neural network forward and backward propagation without altering weights, as shown in Figure 3b. This integration effectively addresses the computational resource constraints of edge devices.
However, RRAM-device-based CIM is facing a challenge in analog-domain computing. Non-ideal effects in analog storage and computation introduce variation and degrade the accuracy [15]. The accuracy decline limits the computational parallelism during CIM operations, thereby diminishing the high-density advantage of 3D RRAM. If 3D RRAM-CIM could achieve computing accuracy comparable to digital implementations, its applicability would be significantly enhanced in edge computing scenarios. These considerations impose more stringent requirements on cross-layer algorithm–hardware co-design.

2.3. Fine-Tuning and Multi-Task Computing

With the rapid advancement of neural network technologies, their functional capabilities have become increasingly sophisticated. When deploying pre-trained neural network models to edge devices, task-specific functional adaptation is often required to optimize performance in particular environments. Such adaptation typically necessitates additional training, which imposes substantial computational demands. To facilitate efficient adaptation of cloud-based pre-trained models for edge computing tasks with minimal computational overhead, fine-tuning techniques have emerged as a research focus [2].
Fine-tuning enables the functional adaptation of pre-trained models to enhance their task-specific performance. Current approaches primarily include Full Fine-tuning (FFT), Layer-wise Fine-tuning [26], and Adapter-based Fine-tuning [27]. While FFT achieves optimal performance by updating all model parameters, it incurs prohibitively high computational costs. Linear probing, a representative Layer-wise Fine-tuning method, reduces computational overhead by only modifying the final layer parameters, albeit at the cost of limited effectiveness. Adapter-based methods offer a balanced solution by integrating trainable adapter modules while keeping original parameters frozen [2]. These methods achieve comparable performance to FFT while maintaining efficiency. Furthermore, adapter-based methods enable seamless task switching through adapter replacement, making them particularly suitable for multi-task computing at edge.
In multi-task computing scenarios, existing CIM works often face capacity limitations [6]. While some studies have demonstrated multi-functional deployment capabilities, they require complete model replacement for task switching [7,8,10]. Although 3D RRAM technology offers potential capacity improvements through high-density integration, its practical implementation is hindered by three limitations: excessive write overhead, limited endurance, and accuracy decline [11]. These constraints severely restrict its viability for edge computing’s multitask requirements. Therefore, developing an efficient multi-task solution that maintains 3D RRAM’s density advantages while overcoming its write limitations becomes imperative. To address this challenge, a novel integration of fine-tuning with 3D RRAM is proposed.

2.4. Related Work

2.4.1. Low-Rank Adaptation

Foundational models are usually pre-trained on large-scale datasets, creating a highly parameterized and generalizable representation framework. These models require fine-tuning to adapt to specific downstream applications. Low-Rank Adaptation (LoRA) addresses the computational challenges of fine-tuning large-scale pre-trained language models through an innovative approach. First introduced by Hu et al. [2], LoRA leverages low-rank matrix approximation to model weight updates during fine-tuning, thereby significantly reducing the number of trainable parameters. This method lowers hardware requirements by minimizing gradient computations and optimizer states for most parameters. Furthermore, LoRA can be integrated with model weight quantization [28] to further reduce resource demands. Building on this concept, AdaLoRA [29] extends the LoRA framework by introducing a dynamic rank adjustment mechanism for low-rank matrices during fine-tuning. Additionally, other variants have been explored, focusing on parameter compression, rank adaptation, and optimization of training methods [27].
The effectiveness of LoRA is evidenced by its widespread adoption across domains—from LLaMA series models using QLoRA [28] for efficient quantization, to Vision Transformers (ViTs) for visual tracking and face forgery detection [30].
While LoRA-related work primarily targets the fine-tuning of large models, edge computing applications are often constrained by hardware resources. In such scenarios, a lightweight neural network, such as a convolutional neural network (CNN), is more suitable. Moreover, LoRA and its related work have primarily focused on improving computational performance under ideal conditions, without considering robustness in computing. In this work, the LoRA structure is combined with lightweight convolutional neural networks and its fine-tuning capabilities are applied to noise compensation, with a focus on hardware deployment of LoRA.

2.4.2. Previous CIM

To accelerate the MVM operations in neural network, numerous CIM approaches have been proposed, primarily utilizing RRAM and SRAM. Representative works span from early SRAM-CIM macros [1,31] that established digital CIM foundations, to RRAM-CIM prototypes [7,8] demonstrating analog computing advantages, and more recently, 3D-RRAM architectures [11] pushing storage density boundaries. The progression towards hybrid CIM systems [10,15] further highlights the ongoing effort to balance computational accuracy with hardware efficiency.
In RRAM-CIM, the analog storage mechanism allows individual devices to store multiple states. However, due to the susceptibility of analog storage and computation to noise, RRAM-CIM operations typically incur accuracy loss [15]. Some studies have mitigated this issue by reducing computational parallelism, specifically by controlling the number of activated rows and stored states [10] in each computation. Furthermore, benefiting from the small cell size, RRAM has an advantage in storage density, providing higher on-chip storage capacity [7]. The implementation of 3D RRAM has increased storage and computational density significantly [11]. However, non-idealities limit the parallelism and performance of 3D RRAM CIM. Additionally, the high write power and time overhead, as well as the limited endurance of the RRAM device, restrict the application of RRAM-based CIM [19]. Compared with RRAM-CIM, SRAM-CIM can perform computations in digital form [31]. This not only eliminates computational errors but also ensures compatibility with advanced processes such as 5 nm [32], offering more possibilities for CIM design.
The analog storage of RRAM exhibits non-idealities, leading to accuracy decline and limiting computational effectiveness. Some studies have extracted these non-ideal deviations and incorporated them into algorithm training, using variation-aware training (VAT) to mitigate post-programming precision loss [8,19,20,21,22,23,24]. Other works have improved the storage principles of RRAM arrays by proposing encoding schemes and high-precision write strategies [33] to directly reduce non-ideal interference caused by write deviations, although at the cost of additional write overhead. Zhang et al. adopted an on-chip training approach [7], leveraging the algorithm training process to learn the non-ideal characteristics of RRAM arrays online and thereby improve computational precision. However, the frequent updates during on-chip training may affect the lifetime of RRAM devices. Wen et al. utilized the high accuracy and low write overhead of SRAM to compensate for the computational precision loss of RRAM [10]. However, due to the low density of SRAM, the proposed solution can only compensate for a part of the model and still requires write operations on RRAM. Seeking robust computational approaches for RRAM is a cross-layer effort that requires collaborative design from the device to the algorithm level [8].
Because CIM technology is still maturing, existing hardware prototypes are generally limited to single-macro demonstrations or small-scale neural-network validations, which remain insufficient for practical deployments. To bridge the gap between hardware and algorithmic maturity, system-level simulation frameworks, such as NeuroSim [14] and ISAAC [34], have been developed. After calibration, NeuroSim delivers chip-level error below 1% [35]. These frameworks provide comprehensive evaluations of CIM architectures and offer indispensable guidance for subsequent system-level implementations. Leveraging such frameworks, refs. [22,23] optimized RRAM-CIM computation from a hardware–software co-design perspective and experimentally confirmed the hardware feasibility of their proposals, thereby advancing cross-layer methodologies for expanding RRAM applicability.
In this work, to compensate for the variation in RRAM devices and enhance the application prospects of 3D RRAM, a hybrid computational approach (LoBranch) was designed and validated for its effectiveness in error compensation. Moreover, our proposed approach integrates 3D RRAM with multi-task computing, further leveraging the high-density characteristics of 3D RRAM and eliminating the write operations to RRAM devices. To assess the effectiveness of our hardware–software co-design and compare alternative architectures, algorithm- and system-level evaluations were conducted using a mature simulation framework.

3. Methodology

The variation in RRAM device adds noise to the weight of the pre-trained model, leading to accuracy degradation. Since multiple local minima for a given dataset can achieve similar accuracy levels, adapting the model to an alternative local minimum can effectively mitigate this noise and ensure accuracy when mapped to 3D-RRAM. Hence, to increase the robustness of the model under device variation, a fine-tuning-based computing approach is proposed. In this approach, the 3D-RRAM stores the variation-aware pre-trained model and is corrected by the SRAM, which acts as an adaptation branch. This section describes the proposed approach combining the LoBranch structure with the Hybrid-CIM architecture.

3.1. Low-Rank Adaptation Branch

Fine-tuning algorithms are particularly suited for models with massive parameters, such as DNNs and large language models. LoRA is one such fine-tuning method, which allows for targeted adjustments to the model’s functionality by altering only a small subset of parameters while keeping the pre-trained model’s parameters unchanged [2]. This makes it suitable for resource-constrained edge computing systems. LoRA introduces two additional low-rank matrices alongside the model weights to approximate the changes in the original weight matrix, as illustrated in Figure 4. When the pre-trained model is fine-tuned to enhance its performance, the modifications essentially involve changes to the weights. Thus, the function of the model can be viewed as the result Yupdated obtained from the input X and the weight Wupdated through the model f, which can be described as:
Y u p d a t e d   =   f ( X ;   W u p d a t e d )   =   f ( X ;   W + Δ W ) ,
Here, Δ W corresponds to updates in the original weight matrix W, enabling adjustments to the function of the model. The core idea of LoRA is to approximate the changes in the weight matrix by introducing two additional low-rank matrices. Assuming the original weight of the model is WRm×n, LoRA expresses the weight update Δ W using two low-rank matrices ARm×r and BRr×n, where rm, n. The weight update ΔW can be expressed as follows:
Δ W   =   A B ,
Thus, the updated weight matrix Wupdated can be expressed as:
W u p d a t e d   =   W   +   A B ,
During inference, the same input X is simultaneously subjected to forward computation with both W and AB, but the primary storage and computation are dominated by W. For instance, in the ResNet-18 model utilized in our experiment, with r set to 2, the proportion of weights in the A and B matrices is less than 2%. This choice of a small, static rank (r = 2) is a key design decision. While adaptive rank selection methods (e.g., AdaLoRA [29]) can offer a more fine-grained trade-off for large models with massive parameter redundancy, they introduce additional computational overhead and complexity for determining rank importance dynamically, which is less suitable for resource-constrained edge devices. For our target scenario of lightweight CNNs, a small, fixed rank effectively captures the core directions necessary for error compensation while minimizing the number of additional parameters. The rank r directly controls the trade-off: a higher rank increases the number of trainable parameters in A and B, potentially offering better error compensation at the cost of increased SRAM area overhead. Our choice of r = 2 ensures a minimal area penalty while achieving a significant reduction in accuracy loss, as demonstrated in Section 4.1, which represents a favorable balance for edge computing.
During training, backpropagation applies to the original weights W as well as the low-rank matrices A and B. However, weight updates only occur in the low-rank matrices A and B, meaning that LoRA updates only the low-rank A and B matrices, while W remains unchanged.
Typically, LoRA is primarily employed in the linear layers of large models, where its low-rank characteristics enable effective model fine-tuning. However, the effectiveness of LoRA in CNN has not yet been fully explored. Given that CNN also exhibits low-rank properties [36], this finding provides motivation for our research. This motivated the application of the LoBranch structure to CNN, as shown in Figure 5. Moreover, previous application of low-rank adaptation in fine-tuning neural network models has demonstrated that the accuracy and function of neural networks can be adjusted by updating only a small fraction of the parameters. This update mechanism of LoRA shows strong compatibility with both high-density 3D-RRAM-CIM architectures and low-write-overhead SRAM-CIM designs, suggesting significant potential for efficient and robust hardware implementation.

3.2. Hybrid-CIM for LoBranch

To validate the effectiveness of combining the LoBranch structure with CIM, a system-level Hybrid-CIM architecture was proposed to deploy the CNN model enhanced with LoBranch. As illustrated in Figure 6, the Hybrid-CIM architecture comprises two primary components: 3D-RRAM-CIM and SRAM-CIM, each leveraging the advantages of their respective storage media. The frozen pre-trained weights are stored in 3-D RRAM-CIM, while the low-rank parameters are kept in SRAM-CIM. The Hybrid-CIM architecture is designed to accelerate both forward inference and backward error propagation. The high-density 3D RRAM is employed to store the model’s original weights in the form of resistance without updating, while the flexible SRAM is utilized to store the low-rank weights A and B in digital form.
During computation, 3D-RRAM-CIM, which relies on analog computation and storage, is susceptible to variations. These variations may introduce discrepancies between the actual results obtained from the 3D-RRAM-CIM and the ideal computational outcomes, thereby contributing to errors. In a multi-layered network, these computational errors propagate across layers, ultimately degrading the accuracy. However, by leveraging the characteristic of low-rank neural networks, our approach can compensate for these errors through the digital SRAM-CIM. The SRAM-CIM enables precise storage and computation of low-rank weights, effectively mitigating the impact of analog non-idealities. This error correction mechanism operates at each layer of the network, calibrating the computational noise introduced by 3D-RRAM-CIM, thereby improving accuracy.
Compared to the CIM-based system, which relies solely on a single type of memory, Hybrid-CIM can combine—at the system level—the high-density of 3D-RRAM-CIM with the high flexibility of SRAM-CIM, enhancing analog computation accuracy, maintaining high storage efficiency, and reducing the need for RRAM write operations. With the digital LoBranch, Hybrid-CIM can compensate for errors arising from the variation of 3D-RRAM-CIM. Additionally, since the approach is based on a fine-tuning algorithm, it is more conducive to supporting multi-task edge computing.

3.3. Simulation on Hybrid-CIM with Variation

To investigate the computing accuracy of RRAM and advance the development of CIM technology, many studies have conducted simulation validations of RRAM-CIM computational performance [14,15,22]. Simulation-based validation not only significantly mitigates the expenses associated with chip development and augments design efficiency but also catalyzes advancements in automated design processes and architectural exploration. In this work, the performance of the proposed approach was evaluated based on the sampled variation from 3D RRAM and its hardware overhead was assessed using NeuroSim [11,14,22]. To obtain and compare the accuracy performance under different approaches with non-ideal interference, the evaluation was implemented in Python 3.12 with PyTorch 2.3.0 and executed on an RTX 3090 GPU. Furthermore, the proposed approach incorporates device variation extracted from our 8-layer 3D-RRAM-CIM chip [11] to get close alignment with hardware behavior.
In noise modeling, the analog computation process on 3D RRAM-CIM chips is highly susceptible to variation interference. To effectively characterize the variation, Gaussian noise is employed in this work. The normal distribution property of Gaussian noise can well reflect the statistical characteristics of actual noise and effectively describe its main features [37]. Using Gaussian noise can facilitate experimental validation. It has been validated to effectively fit the non-idealities of RRAM device, reducing post-deployment precision loss effectively. Therefore, Gaussian noise has been widely used in RRAM-based CIM chip design and simulation works to evaluate and enhance performance [8,14,15,18].
Figure 7 illustrates the overall evaluation framework of the proposed algorithm–hardware co-design approach, which integrates algorithm-level validation with hardware-level simulation. The evaluation is conducted using NeuroSim, a CIM simulation platform calibrated with SPICE simulation [14,35]. This platform is widely adopted in variation-aware CIM studies [21,23], thereby ensuring credible system-level estimation. At the algorithm level, the workflow includes pre-mapping VAT and post-mapping adaptation [22], which together ensure model robustness before and after hardware deployment. At the hardware level, algorithm weights are transformed into matrix–vector multiplication (MVM) operations and mapped onto the system-level Hybrid-CIM architecture, where 3D RRAM and SRAM macros operate as distinct computational units. Rather than merging at the array level, their contributions are coordinated through the peripheral circuitry, which orchestrates dataflow and accumulates results across macros. The NeuroSim-based simulation incorporates peripheral components such as buffers, ADCs/DACs, accumulators, and write circuits, while also modeling activation circuits and bus-based interconnects at each level. This framework enables system-level evaluation of both robustness and method-induced hardware overhead, providing the basis for comparing different memory technologies and verifying the practicality of Hybrid-CIM for full-scale neural network deployment. The architectural simulation is used only for consistent system-level comparison, rather than to claim detailed hardware implementation. This use of NeuroSim follows the same simulation-only convention as in [23].
Following this evaluation framework, the proposed algorithm-hardware co-design is realized through a cohesive workflow that integrates algorithmic training with hardware deployment, consisting of four key stages:
  • Pre-mapping Variation-Aware Training: The model, incorporating the LoBranch, undergoes training where Gaussian noise is injected into the weights to simulate device-level variation. This stage enhances the model’s inherent robustness before hardware mapping.
  • Hybrid-CIM Hardware Mapping: The robust model is then deployed onto the system-level Hybrid-CIM architecture. The pre-trained main weights (W) are mapped to the variation-prone 3D-RRAM macro, while the low-rank LoBranch parameters (A, B) are mapped to the precise SRAM-CIM macro.
  • Post-mapping Adaptation: To achieve optimal performance on the specific hardware, the LoBranch weights (A, B) in SRAM are fine-tuned based on the output of the deployed system. This step compensates for write variation, while the main RRAM weights are kept frozen.
  • System-Level Performance Evaluation: The final performance of the adapted model is evaluated using a simulation framework calibrated for CIM architectures. This evaluation provides credible estimates for metrics such as accuracy, energy efficiency, latency, and area overhead, representing the optimized system’s capability.
This end-to-end workflow ensures that both algorithmic robustness and hardware efficiency are jointly optimized. The core computational procedures of the first and third stages are formalized in Algorithm 1 (Pre-mapping Variation-Aware Training) and Algorithm 2 (Post-mapping Adaptation), respectively.
Algorithm1 Pre-mapping Variation-Aware Training
Input: Pre-trained weights W; Rank r; Noise level σ
Output: Robust weights W, A, B
1: Initialize low-rank matrices A, B.
2: while not converged do
3: //Inject non-idealities during training
4:  Wnoisy = W + N (0, σ2)
5:  //Hybrid forward pass with LoBranch
6:  Ypred = f (X; Wnoisy) + (A B) ⋅ X
7:  Compute loss L = Loss (Ypred, Yideal)
8:  Compute gradients g with respect to W, A, B.
9:  Update W, A, B
10: end while
11: return W, A, B
Algorithm 2 Post-mapping Adaptation
Input: Frozen weights W; Low-rank matrices A, B
Output: Adapted matrices A, B
1: The model is deployed on Hybrid-CIM with variation.
2: while not converged do
3:  //Forward pass on the Hybrid-CIM system
4:  Yrram = 3DRRAM-CIM (X, W)
5:  Ysram = SRAM-CIM (X, A, B)
6:  //Layer-wise Hybrid summation
7:  Y’ = Yrram + Ysram
8:  Compute loss L = Loss (Y’, Yideal)
9:  Compute gradients g with respect to A, B.
10:  Update A, B stored in SRAM-CIM only.
11: end while
12: return A, B
In the experiments, the variation was simulated by adding Gaussian noise to the weight with a practical deployment workflow [14]. The noise injection method is adopted from [38]. The 20% Gaussian noise level used here is calibrated against the measured 3D-RRAM-CIM distribution shown later in Figure 8 [11]. Using Gaussian noise to describe the variation in RRAM device effectively supports the subsequent comparative experiments of different schemes, models, and datasets presented in this work. The simulation considers the characteristics of analog and digital computations to address noise interference. Specifically, as outlined in the workflow, the original weights are stored in high-density 3D RRAM with variation, whereas the low-rank weights are stored digitally in low-density SRAM. The original weight is static and does not require updates, while the low-rank weight in SRAM is updated during post-mapping adaptation to improve model performance. During computation of the Hybrid-CIM, both sets of weights share the same input, and their outputs are combined to produce the result, as illustrated in Figure 6. Thus, the weights in 3D RRAM are subject to the noise caused by the device variation, whereas the low-rank parameters in SRAM remain free from storage noise. This Hybrid computational approach and the hardware-software co-design of the Hybrid-CIM aim to compensate for the precision loss of analog computation in 3D RRAM-CIM using limited digital computation, thereby facilitating the application of 3D RRAM.

3.4. Model and Dataset Selection

In the context of edge computing, where devices have limited hardware resources, a lightweight CNN is more suitable than a large model. LoRA has been widely applied in large models, but its potential in smaller CNNs for error compensation remains underexplored. Similarly, due to the limited development and small capacity of CIM, related works usually employ smaller CNNs to verify the computational effectiveness of their designs. In this work, the potential of the LoBranch for error compensation was evaluated on CNNs. The ResNet18 [39] was selected and integrated with the LoBranch structure. The original weight accounts for most of the model size (11.7 M), overshadowing the LoBranch weight (0.19 M). Additionally, to validate the generalizability of our approach, error compensation experiments were also conducted on VGG16 (138.4 M) [40], ResNet50 (25.6 M), and MobileNetV2 (3.4 M) [41]. The selection of these models is strategic, covering a diverse range of modern CNN architectural paradigms:
ResNet-18 and ResNet-50 represent architectures with residual connections, which are robust to gradient vanishing and are widely used as backbones for vision tasks.
VGG16 represents a plain, deeper network with a large number of parameters primarily in convolutional and fully connected layers, testing the method on an architecture without advanced skip connections.
MobileNetV2 represents modern lightweight, depth-wise-separable architectures designed specifically for mobile and edge devices, with the fewest parameters but many layers.
Evaluation on this diverse network set demonstrates that the proposed approach is not specific to a single network topology but is broadly applicable. It is important to clarify the roles of the components in our co-design. In this work, ResNet, VGG16, and MobileNetV2 were employed as the host network architectures for method evaluation. The Low-Rank Adaptation (LoRA) technique is the inspiration for our proposed LoBranch structure, which is integrated into these CNNs to enable hardware error compensation. Therefore, the relationship is not one of comparison but of synergy: we adapt the LoRA methodology to function as a robust compensation branch within established CNN architectures for edge computing. Across all these models, the proportion of weights in the LoBranch was kept below 2% of the original model’s parameters, ensuring a minimal hardware overhead.
In terms of dataset selection, the widely recognized image classification dataset CIFAR-10 was selected for this study. During the error compensation experiment, the CNN models were first pre-trained on CIFAR-10 and then deployed onto the Hybrid-CIM with noise interference. The effectiveness of the proposed approach was evaluated by comparing the post-mapping on-chip accuracy. To further explore its potential for edge computing, additional experiments were conducted using the MNIST, FashionMNIST, and Flowers102 datasets for the multi-task and robust computing experiment. The use of multiple diverse datasets reflects the variety and complexity of tasks that edge computing might address in different scenarios, thereby enriching the multi-task computing experiment. In the multi-task computing experiment, the pre-trained models were fine-tuned on the ImageNet dataset to represent the performance of our approach when encountering different tasks from different scenarios in edge computing.

4. Results and Discussion

4.1. Implementation of Error Correction

In the error correction experiment, the objective is to compensate for the variation in 3D-RRAM-CIM and mitigate the accuracy loss when deploying neural networks on RRAM. Previous approaches have either incorporated hardware errors into the training process for variation-aware training to enhance model robustness through offline learning or directly employed on-chip online learning to reduce the nonlinearity of RRAM. Our proposed approach is an algorithm-hardware co-design, leveraging LoRA and Hybrid-CIM to compensate for the efficient but error-prone analog computation with accurate digital computation.
This co-design is realized through a deliberate two-stage optimization strategy. First, pre-mapping Variation-Aware Training (VAT) employs a 10% Gaussian noise level to instill a foundation of general robustness into the model, enhancing its stability without severely compromising ideal performance. Subsequently, after deployment, post-mapping adaptation fine-tunes the LoBranch parameters in SRAM to perform a precision calibration, specifically compensating for the target hardware’s actual non-idealities, which are characterized by 20% write variation. This two-stage approach ensures the model is both broadly robust and precisely calibrated to its operational environment.
Our approach improves performance at two key stages: pre-mapping VAT and post-mapping adaptation. During pre-mapping VAT, the LoBranch module provides an additional precise computational path. This enhances model stability when injecting noise into pre-trained weights. For post-mapping adaptation, the Hybrid-CIM architecture increases computational accuracy by updating the SRAM component, eliminating RRAM write operations. Consequently, this approach significantly boosts the robustness of 3D RRAM-based neural network computing with minimal hardware overhead. The experimental setup utilized a ResNet-18 model pre-trained on the CIFAR-10 dataset as the baseline.
To verify the feasibility of the proposed approach, the variations were first extracted from the 3D-RRAM-CIM [11], as it directly impacts accuracy. Figure 8 illustrates the resistance state distribution of 3D RRAM when storing multi-bit data. The data was measured from our 3D-RRAM-CIM chip, demonstrating a greater than 10× on-off ratio with the device variation of 20%. Therefore, to simulate such variation during computing, corresponding noise was injected into the weights [38].
In our approach, the original neural network weights are stored as resistance values in multi-level 3D-RRAM. During neural network computation, the MVM operations involving these weights are accelerated through the CIM architecture. Due to the distribution range of the resistance, variations accumulate during current summation in analog computation, leading to errors. In neural networks, especially deep neural networks with tens or even hundreds of layers, errors induced by non-idealities propagate through subsequent layers and are progressively amplified with increasing model depth, ultimately leading to accuracy decline.
Accuracy degradation is a key challenge for the practical application of 3D RRAM. The degradation stems from multiple sources of interference during computation, including device yield, endurance, weight storage errors, and computational nonlinearity [13,19,22]. For weight storage errors, the limited on/off ratio, resistance drift, endurance, and device-to-device non-idealities prevent the written data from perfectly matching the ideal values. This results in a distribution range that introduces analog variations to the original weights, thereby affecting network accuracy. Pre-mapping VAT can enable models to better adapt to nonlinear interference, enhancing the model’s performance before deployment.
Within the pre-mapping VAT experiment, the impact of mismatched variation between training and testing on model performance was investigated by examining the accuracy variation under different noise conditions. Such mismatches can occur in practical deployment scenarios, for instance, due to device aging or degradation of temperature. As shown in Figure 9a, our experiments showed that when a small amount of noise is injected during training, the model achieved accuracy comparable to the baseline under low variation. However, as the device variation increases, the accuracy degrades rapidly, indicating poor robustness. Conversely, injecting larger noise during training improves the model’s resilience to noisy environments, but at the cost of reduced accuracy under noise-free conditions.
Traditional VAT strategies often adopt stronger noise during training to account for potential worst-case scenarios, which inevitably compromises the model’s peak accuracy. As shown in Figure 9b, standard VAT was compared with LoBranch-enhanced VAT under varying conditions. Here, Baseline denotes the original model deployed without any VAT; w/o LoBranch denotes the model trained with standard VAT (noise injected into all weights); and w/LoBranch denotes our method incorporating the LoBranch structure during VAT. Due to the noise-immune computational path introduced by LoBranch, the model performance under bad conditions is improved. Nevertheless, the excessive noise introduced during VAT always leads to an inevitable drop in model accuracy.
To address the trade-off, the Hybrid-CIM architecture enables post-mapping adaptation to achieve better performance. The hybrid-CIM introduces a small portion of digital computation to compensate for the nonlinear interference in 3D-RRAM and mitigate the accuracy degradation. As shown in Figure 10, SRAM-CIM, leveraging accurate digital computation, achieves the best performance, maintaining the same accuracy as the pre-trained model. In contrast, 3D-RRAM-CIM suffers from accuracy decline, with a recognition accuracy of 91.5%. The proposed Hybrid-CIM performs between RRAM and SRAM.
The experimental results indicated that 3D-RRAM-CIM, based on analog computation, suffered from model accuracy loss due to nonlinear interference, with a 2.9% accuracy drop. However, the Hybrid-CIM architecture, while also affected by variation, introduced a digital compensation mechanism through SRAM-CIM, reducing the accuracy loss by 86% and limiting the degradation to only 0.4%, compared to 3D-RRAM-CIM. Both 3D-RRAM and Hybrid-CIM need to employ online learning to achieve the best accuracy. The 3D-RRAM approach involves direct weight updates within the 3D-RRAM during training, which can partially eliminate variation-induced nonlinear interference and achieve higher recognition accuracy than direct deployment [7]. However, the high write overhead and the stochastic new variation introduced during weight updates still result in a 2.9% accuracy drop and lead to unstable performance across training runs. Similarly, requiring online training, hybrid-CIM only needs to update the low-rank weights in SRAM, avoiding the high write overhead and the instability stemming from write variation during weight updates. Moreover, since no write operations are needed for RRAM, the hybrid-CIM approach eliminates the impact on the endurance of RRAM compared to 3D-RRAM-CIM with online training. Overall, Hybrid-CIM can effectively compensate for the computational errors in analog computation with minimal additional digital computation, achieving comparable accuracy as SRAM-CIM and reducing accuracy loss by 86%.
From the perspective of area overhead, although both SRAM-CIM and Hybrid-CIM utilize lower-density SRAM, Hybrid-CIM only employs SRAM to store the weights of the LoBranch structure. For the ResNet18 model used in this experiment, these low-rank weights constitute less than 2% of the original model. Consequently, the pre-trained weights are stored in high-density 3D-RRAM, enabling Hybrid-CIM to achieve significantly lower area overhead compared to a pure SRAM solution, with an area reduction by a factor of 8.5 relative to SRAM-CIM, as shown in Figure 10. Compared to 3D-RRAM-CIM, the additional area overhead of Hybrid-CIM is minimal, demonstrating that it retains the density advantage of 3D-RRAM while maintaining area efficiency.
Due to their distinct architectures, different neural networks exhibit varying degrees of inherent robustness to noise and perturbations. To further validate the effectiveness of the proposed approach, we conducted experiments on multiple CNN models. As illustrated in Figure 11, our approach effectively reduces computational error across all models, achieving a reduction in accuracy loss of at least 74%. The variation in improvement ratio can be attributed to the intrinsic architectural properties of each model. Specifically, the superior improvement observed in both ResNet-50 (87%) and ResNet-18 (86%) is likely due to the robust residual learning framework, which mitigates error propagation through skip connections [39,42]. The deeper ResNet-50 might offer a slightly greater capacity for compensation within this robust framework. In contrast, VGG-16, despite having the most parameters, lacks such built-in resilience. Its plain, deeply stacked structure is more susceptible to the accumulation of analog errors, resulting in a lower relative improvement. Notably, our method still provides substantial compensation (74%) for the lightweight MobileNetV2, which, despite having the fewest parameters, contains a large number of layers where errors can propagate. These results collectively demonstrate that our approach provides significant and generalizable error compensation across diverse and common convolutional architectures.
The consistent effectiveness across diverse architectures stems from a fundamental design principle that distinguishes our LoBranch from a typical low-rank network. Unlike approaches that replace the original backbone with a low-rank structure, our method preserves the full-precision, high-rank backbone network and augments it with a parallel, low-rank adaptation branch. This architecture ensures that the backbone’s rich representational capacity, stored in high-density 3D-RRAM, remains intact. The LoBranch then acts not as a feature extractor, but as a dedicated co-processor for hardware error correction, stored in precise SRAM. This synergistic combination—a dense analog backbone for capacity and a sparse digital compensator for accuracy—is the cornerstone of achieving both robust performance against variation and the flexibility for multi-task computing.
Although non-idealities such as peripheral noise, sneak paths and variation over time are not explicitly considered, their effects are inherently reflected in the computation results used during post-mapping adaptation [22], enabling our LoBranch-based approach to simultaneously compensate for these additional non-idealities.
This adaptive nature of our method also defines its interaction with fundamental device-level properties. The proposed low-rank adaptation compensates for these properties through its core adaptive compensation mechanism. In this work, the Gaussian noise model primarily captures the effects of conductance quantization and write-induced variation, which are the dominant and immediate non-idealities. The post-mapping adaptation process fine-tunes the LoBranch by directly learning from the hardware’s computational output, allowing it to correct for the aggregate functional deviation caused by these effects.
Furthermore, the same principle suggests its potential utility in mitigating other device-level drifts that manifest as gradual deviations over time, such as resistance drift. Since the core operation is to measure output error and adapt, the system is inherently suited to compensating for a wide range of systematic biases that corrupt the ideal MVM operation, provided they occur on a timescale slower than the adaptation process.
This inherent adaptability also suggests a promising pathway to mitigate performance degradation caused by environmental and long-term reliability issues, such as temperature variations and device aging. Temperature fluctuations can shift the operating point of RRAM devices, while aging effects (e.g., resistance drift) can gradually alter their conductance states over time. Although not the primary focus of this work, the proposed system can, in principle, address these challenges through periodic re-invocation of the post-mapping adaptation process. By periodically calibrating the LoBranch weights based on the current system output, the Hybrid-CIM can continuously learn and correct for the slowly evolving errors introduced by these factors, thereby enhancing its long-term operational robustness in real-world edge scenarios.
While the proposed method achieves consistent improvements across different models, an observation is that its effectiveness is well maintained when scaling from a smaller network (ResNet18) to a larger one (ResNet50). This demonstrates the potential scalability of the approach to networks with increasing size and complexity. Considering that LoRA has been validated as an effective technique in large-scale models [2], our approach is expected to retain adaptability to more complex and larger architectures, such as RNNs, generative models, and Transformers. Future work should further explore the scalability of our approach to enhance the intelligence of edge computing systems.
In summary, the experimental results demonstrate that the proposed computational approach not only reduces accuracy loss by 86% on the ResNet18 model with minimal area overhead but also effectively mitigates accuracy decline in 3D-RRAM across different neural network architectures. This indicates that our approach has general applicability for enhancing the accuracy performance of 3D-RRAM-CIM. During the post-mapping adaptation, our approach only requires updating a small number of weights in SRAM to compensate for computational accuracy, eliminating the need for write operations to 3D RRAM.

4.2. Implementation of Multi-Task Computing

The low-rank neural network, which significantly reduces training overhead by approximating weight updates through low-rank matrices, has been widely adopted in large models. Considering the effectiveness of LoRA in model functional fine-tuning, the potential of the proposed approach for multi-task edge computing is further explored. However, previous LoRA-related studies have primarily focused on large models for functional adjustment, due to the higher redundancy of weights and the more pronounced low-rank characteristics in large models. Given the extensive application of CNN in edge computing [43], this research extends the fine-tuning strategy of this algorithm from large-scale models to smaller convolutional networks. Since CNN, compared to a large model with massive parameters, is more suitable for deployment on practical edge devices, this extension holds significant importance for edge computing.
The datasets used in the multi-task experiment include CIFAR100, MNIST, FashionMNIST, and Flowers102, which cover common recognition tasks in edge computing, such as object recognition, digit recognition, clothing category recognition, and plant recognition. These datasets encompass a variety of typical recognition task scenarios in edge computing. The multi-task experiment integrates LoRA with convolutional networks to investigate the potential for multi-task edge computing. Specifically, it explores the effect of fine-tuning a pre-trained model to perform multiple tasks simultaneously, thereby leveraging the capabilities of high-density 3D RRAM.
To validate whether the proposed Hybrid architecture can adjust the function of the light-weight model while achieving error compensation, the CNN-based ResNet-18 model was employed. The experiment utilized a model pre-trained on the ImageNet dataset and applied Hybrid-CIM for edge learning. By fine-tuning the model to adapt to different datasets, the process simulates edge learning scenarios that edge devices may face. During the multi-task experiment, only the low-rank weights in SRAM-CIM were updated. Evaluations were conducted on the CIFAR-100, Fashion MNIST, MNIST, and Flowers102 datasets to evaluate the model’s ability in multi-task edge computing.
Figure 12 presents the generalization evaluation results of the proposed approach on ResNet-18. Different tasks were evaluated on Hybrid and Baseline approaches using the same pre-trained model. It is worth noting that the baseline here is trained under ideal conditions with the LoRA structure, while the Hybrid approach performs low-rank fine-tuning with the pre-trained weight affected by variation from 3D RRAM. For the more complex CIFAR-100 task, although a degree of accuracy loss is observed, Hybrid-CIM can still achieve a certain level of accuracy. For relatively simpler datasets, such as MNIST, the accuracy of hybrid-CIM is comparable to the baseline. This phenomenon is primarily attributed to our approach is sufficiently capable of handling the simpler tasks typically encountered in edge computing. On the Fashion MNIST dataset, hybrid-CIM also exhibits performance close to the baseline model, effectively reducing accuracy loss. Our approach demonstrates strong adaptability across the CIFAR-100, Fashion MNIST, and MNIST datasets and can effectively support the requirements of edge computing for multitasking in different scenarios.
The performance of Hybrid-CIM across these diverse datasets validates its versatility in multi-task edge learning. It is important to note that the principal advantage of our approach is system-level. For simpler tasks (e.g., MNIST), a robust model deployed directly on a noisy 3D-RRAM-CIM macro might sustain reasonable baseline accuracy due to the task’s inherent simplicity. However, such a pure RRAM-based system cannot switch tasks without a complete and costly model rewrite. In contrast, our Hybrid-CIM architecture enables efficient multi-task support without any write operations to the RRAM. By keeping the RRAM-stored backbone fixed and only adapting the tiny LoBranch in SRAM, we eliminate the associated write energy, latency, and endurance overhead.
This leads to a fundamental improvement in storage efficiency and practicality. The experimental results validate that a single, fixed model in 3D-RRAM can support numerous tasks (e.g., Flowers102, CIFAR-100, Fashion MNIST, MNIST) by merely switching the low-rank weights in SRAM. In contrast, prior CIM works typically require loading a completely different model for each task [8,10], a scheme burdened by the high write overhead of RRAM. Thus, our approach achieves a more practical and scalable solution for edge devices, fully leveraging the high-density advantage of 3D-RRAM while introducing agile multi-task capability.

4.3. Comparison and Feasibility Analysis

This section presents a comparison of the accuracy performance of the proposed Hybrid-CIM architecture against several state-of-the-art methods, as summarized in Table 1. The evaluation is conducted on the CIFAR-10 dataset using both VGG-16 and ResNet-18 models.
For VGG-16 on CIFAR-10, our method achieves the highest post-mapping accuracy of 93.14%, outperforming previous 8-bit RRAM-based works such as DVA [36] and Unary [21] by 13.04% and 5.2%, respectively. Even compared to advanced heterogeneous approaches like KD + RSA [22], Hybrid IMC [18], and CorrectNet+ [24], our approach provides improved accuracy with a simpler and denser storage scheme. Notably, it achieves this using 8-bit precision, without relying on aggressive quantization or extensive SRAM usage, demonstrating the effectiveness of our algorithm–hardware co-design approach.
For ResNet-18 on CIFAR-10, the proposed Hybrid-CIM achieves an accuracy of 94.0% with 8-bit precision. Compared with [23], the proposed architecture also shows competitive performance. To the best of our knowledge, this work is the first to combine variation mitigation with multi-task computation within a heterogeneous CIM architecture. Our approach not only corrects hardware-induced errors but also enables multi-task computing, achieving higher accuracy and supporting more functionality under limited edge hardware resources.
Although post-mapping approaches, such as the proposed LoBranch and Hybrid IMC [18], can mitigate accuracy degradation caused by the variation in RRAM devices, they may incur additional hardware overhead, including area and latency. To assess the hardware feasibility of the proposed approach, NeuroSim [14] was adopted, an architectural analysis tool that estimates area, latency, and energy efficiency, to evaluate its performance on ResNet-18.
Figure 13 presents the system-level estimated area based on NeuroSim analysis, including peripheral circuits. The proposed Hybrid-CIM architecture exhibits a significant reduction in total area compared to both SRAM-CIM and RRAM-CIM baselines. Specifically, Hybrid-CIM achieves an 8.5× area reduction relative to SRAM-CIM and a 4.2× reduction relative to RRAM-CIM. This improvement is mainly attributed to the hardware-aware design of the LoBranch structure and the use of high-density 3D RRAM for storing the majority of weights. Although peripheral circuits cannot be vertically stacked with the 3D-RRAM array, the layer-wise computation pattern inherent in neural networks allows each layer to be mapped to a separate physical layer of 3D-RRAM. This mapping enables temporal reuse of peripheral resources across layers, effectively reducing their impact on the overall area. Despite the substantial reduction in area, Hybrid-CIM maintains comparable inference accuracy as SRAM-CIM, as shown in Figure 10, indicating a favorable trade-off between accuracy and hardware cost.
This significant area reduction is achieved without compromising speed or reliability. On the contrary, the Hybrid-CIM architecture achieves a favorable balance across all these system-level metrics.
Speed (Latency): As shown in Figure 14, the inference latency of Hybrid-CIM (9.6 ms) is not only lower than that of the RRAM-CIM baseline (10.1 ms) but also comparable to the SRAM-CIM baseline (9.5 ms). This is because the digital computation of the lightweight LoBranch in SRAM is fast and can be efficiently orchestrated with the 3D-RRAM-CIM computation, avoiding a critical path bottleneck.
Reliability: The hybrid approach enhances system reliability in two key aspects. First, by keeping the weights in 3D-RRAM frozen and only updating the parameters in SRAM during adaptation, it eliminates write cycles to the RRAM devices for functional tuning, thereby mitigating RRAM endurance concerns and avoiding the introduction of new write-induced variations. Second, the SRAM-CIM, which stores the critical compensation parameters, provides highly robust digital storage and computation.
Figure 14 presents the estimated energy efficiency and inference latency across different systems. The proposed Hybrid-CIM exhibits higher energy efficiency and competitive inference latency compared to both SRAM-CIM and RRAM-CIM counterparts. Specifically, Hybrid-CIM achieves an energy efficiency of 4.9 TOPS/W, which represents a 1.59× improvement over SRAM-CIM and a 1.19× improvement over RRAM-CIM. In terms of inference latency, Hybrid-CIM achieves 9.6 ms, which is lower than RRAM-CIM (10.1 ms) and comparable to SRAM-CIM (9.5 ms). Remarkably, the proposed Hybrid-CIM offers these system-level energy efficiency gains with negligible accuracy degradation, highlighting its practical feasibility in energy-constrained scenarios. It should be noted that the hardware-cost analysis here, like in [22], is only to show the feasibility of LoBranch.
In summary, by leveraging 3D RRAM and the proposed LoBranch structure, the Hybrid CIM system achieves a favorable balance among accuracy and hardware cost without compromising the intrinsic density advantage of 3D RRAM. By combining high-density 3D RRAM with the SRAM-based LoBranch design, Hybrid-CIM significantly outperforms conventional SRAM-CIM and planar RRAM-CIM architectures in terms of system area, while maintaining comparable accuracy. These findings demonstrate the effectiveness and feasibility of the LoBranch compensation approach in addressing stringent energy and area constraints, thereby extending the applicability of 3D RRAM to resource-constrained scenarios such as edge AI systems.

5. Conclusions

This paper demonstrates that the integration of a Low-Rank Adaptation branch (LoBranch) within a 3D RRAM/SRAM Hybrid CIM system effectively mitigates the core challenges of variation and limited multi-task capacity in edge computing. The key contributions are threefold:
First, from a methodological perspective, this work is the first to repurpose the LoRA algorithm as a dedicated hardware-error compensation mechanism. The proposed LoBranch acts as a residual compensator during both pre-mapping variation-aware training and post-mapping adaptation, enabling robust computation on noisy analog hardware.
Second, from a system-level perspective, the co-design achieves remarkable hardware efficiency. It reduces post-mapping accuracy degradation by 86% versus a pure SRAM-CIM, achieving 94.0% accuracy on CIFAR-10. Simultaneously, the Hybrid-CIM architecture yields an 8.5× reduction in system-level area compared to SRAM-CIM and a 4.2× reduction compared to planar RRAM-CIM, while maintaining competitive energy efficiency and latency.
Third, from an application perspective, the approach enables multi-task capability without sacrificing density. By updating less than 2% of the model parameters stored in SRAM, the system supports multiple tasks on a single, fixed 3D-RRAM-stored model, avoiding write overhead and endurance issues associated with RRAM weight updates.
In summary, this work provides a balanced and robust solution for 3D-RRAM-based edge intelligence. Future work will focus on extending the approach to more complex models like Transformers, exploring adaptive rank selection strategies to further optimize, and refining the Hybrid-CIM for physical implementation.

Author Contributions

Conceptualization, W.T. and L.N.; software simulation and validation, W.T.; data processing, W.T., C.M. and L.N.; experimental guidance, H.W., Y.Y. and S.Z.; resources, Q.L.; writing—original draft preparation, W.T.; writing—review and editing, F.Z.; project administration, F.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported in part by the National Key Research and Development Program of China under Grant 2023YFB4402400; the National Natural Science Foundation of China under Grants 92464201, U2341218, and 62322412.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The original contributions presented in this study are included in the article. Further inquiries can be directed to the corresponding author.

Conflicts of Interest

Author Qihao Liu was employed by the company Shandong SinoChip Semiconductors Co., Ltd. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript; or in the decision to publish the results.

References

  1. LeCun, Y.; Bengio, Y.; Hinton, G. Deep Learning. Nature 2015, 521, 436–444. [Google Scholar] [CrossRef]
  2. Hu, E.J.; Shen, Y.; Wallis, P.; Allen-Zhu, Z.; Li, Y.; Wang, S.; Wang, L.; Chen, W. LoRA: Low-Rank Adaptation of Large Language Models 2021. In Proceedings of the International Conference on Learning Representations (ICLR), Virtual Event, 25–29 April 2022; pp. 1–13. [Google Scholar]
  3. Lepri, N.; Glukhov, A.; Cattaneo, L.; Farronato, M.; Mannocci, P.; Ielmini, D. In-Memory Computing for Machine Learning and Deep Learning. IEEE J. Electron Devices Soc. 2023, 11, 587–601. [Google Scholar] [CrossRef]
  4. Gong, N.; Rasch, M.J.; Seo, S.-C.; Gasasira, A.; Solomon, P.; Bragaglia, V.; Consiglio, S.; Higuchi, H.; Park, C.; Brew, K.; et al. Deep Learning Acceleration in 14nm CMOS Compatible ReRAM Array: Device, Material and Algorithm Co-Optimization. In Proceedings of the 2022 International Electron Devices Meeting (IEDM), San Francisco, CA, USA, 3–7 December 2022; pp. 33.7.1–33.7.4. [Google Scholar]
  5. Jung, S.; Kim, S.J. MRAM In-Memory Computing Macro for AI Computing. In Proceedings of the 2022 International Electron Devices Meeting (IEDM), San Francisco, CA, USA, 3–7 December 2022; pp. 33.4.1–33.4.4. [Google Scholar]
  6. Su, J.-W.; Chou, Y.-C.; Liu, R.; Liu, T.-W.; Lu, P.-J.; Wu, P.-C.; Chung, Y.-L.; Hung, L.-Y.; Ren, J.-S.; Pan, T.; et al. 16.3 A 28nm 384kb 6T-SRAM Computation-in-Memory Macro with 8b Precision for AI Edge Chips. In Proceedings of the 2021 IEEE International Solid- State Circuits Conference (ISSCC), San Francisco, CA, USA, 13–22 February 2021; pp. 250–252. [Google Scholar]
  7. Zhang, W.; Yao, P.; Gao, B.; Liu, Q.; Wu, D.; Zhang, Q.; Li, Y.; Qin, Q.; Li, J.; Zhu, Z.; et al. Edge Learning Using a Fully Integrated Neuro-Inspired Memristor Chip. Science 2023, 381, 1205–1211. [Google Scholar] [CrossRef]
  8. Wan, W.; Kubendran, R.; Schaefer, C.; Eryilmaz, S.B.; Zhang, W.; Wu, D.; Deiss, S.; Raina, P.; Qian, H.; Gao, B.; et al. A Compute-in-Memory Chip Based on Resistive Random-Access Memory. Nature 2022, 608, 504–512. [Google Scholar] [CrossRef]
  9. Su, J.-W.; Si, X.; Chou, Y.-C.; Chang, T.-W.; Huang, W.-H.; Tu, Y.-N.; Liu, R.; Lu, P.-J.; Liu, T.-W.; Wang, J.-H.; et al. Two-Way Transpose Multibit 6T SRAM Computing-in-Memory Macro for Inference-Training AI Edge Chips. IEEE J. Solid-State Circuits 2022, 57, 609–624. [Google Scholar] [CrossRef]
  10. Wen, T.-H.; Hung, J.-M.; Huang, W.-H.; Jhang, C.-J.; Lo, Y.-C.; Hsu, H.-H.; Ke, Z.-E.; Chen, Y.-C.; Chin, Y.-H.; Su, C.-I.; et al. Fusion of Memristor and Digital Compute-in-Memory Processing for Energy-Efficient Edge Computing. Science 2024, 384, 325–332. [Google Scholar] [CrossRef]
  11. Huo, Q.; Yang, Y.; Wang, Y.; Lei, D.; Fu, X.; Ren, Q.; Xu, X.; Luo, Q.; Xing, G.; Chen, C.; et al. A Computing-in-Memory Macro Based on Three-Dimensional Resistive Random-Access Memory. Nat. Electron. 2022, 5, 469–477. [Google Scholar] [CrossRef]
  12. Rehman, M.M.; Samad, Y.A.; Gul, J.Z.; Saqib, M.; Khan, M.; Shaukat, R.A.; Chang, R.; Shi, Y.; Kim, W.Y. 2D Materials-Memristive Devices Nexus: From Status Quo to Impending Applications. Prog. Mater. Sci. 2025, 152, 101471. [Google Scholar] [CrossRef]
  13. Huo, Q.; Song, R.; Lei, D.; Luo, Q.; Wu, Z.; Wu, Z.; Zhao, X.; Zhang, F.; Li, L.; Liu, M. Demonstration of 3D Convolution Kernel Function Based on 8-Layer 3D Vertical Resistive Random Access Memory. IEEE Electron Device Lett. 2020, 41, 497–500. [Google Scholar] [CrossRef]
  14. Peng, X.; Huang, S.; Luo, Y.; Sun, X.; Yu, S. DNN+NeuroSim: An End-to-End Benchmarking Framework for Compute-in-Memory Accelerators with Versatile Device Technologies. In Proceedings of the 2019 IEEE International Electron Devices Meeting (IEDM), San Francisco, CA, USA, 7–11 December 2019; pp. 32.5.1–32.5.4. [Google Scholar]
  15. Krishnan, G.; Wang, Z.; Yeo, I.; Yang, L.; Meng, J.; Liehr, M.; Joshi, R.V.; Cady, N.C.; Fan, D.; Seo, J.-S.; et al. Hybrid RRAM/SRAM in-Memory Computing for Robust DNN Acceleration. IEEE Trans. Comput. Aided Des. Integr. Circuits Syst. 2022, 41, 4241–4252. [Google Scholar] [CrossRef]
  16. Jeong, S.; Kim, J.; Jeong, M.; Lee, Y. Variation-Tolerant and Low R-Ratio Compute-in-Memory ReRAM Macro with Capacitive Ternary MAC Operation. IEEE Trans. Circuits Syst. I 2022, 69, 2845–2856. [Google Scholar] [CrossRef]
  17. Yao, P.; Wu, H.; Gao, B.; Tang, J.; Zhang, Q.; Zhang, W.; Yang, J.J.; Qian, H. Fully Hardware-Implemented Memristor Convolutional Neural Network. Nature 2020, 577, 641–646. [Google Scholar] [CrossRef]
  18. Wang, Z.; Nalla, P.S.; Krishnan, G.; Joshi, R.V.; Cady, N.C.; Fan, D.; Seo, J.; Cao, Y. Digital-Assisted Analog In-Memory Computing with RRAM Devices. In Proceedings of the 2023 International VLSI Symposium on Technology, Systems and Applications (VLSI-TSA/VLSI-DAT), HsinChu, Taiwan, 17–20 April 2023; pp. 1–4. [Google Scholar]
  19. Yang, X.; Belakaria, S.; Joardar, B.K.; Yang, H.; Doppa, J.R.; Pande, P.P.; Chakrabarty, K.; Li, H.H. Multi-Objective Optimization of ReRAM Crossbars for Robust DNN Inferencing under Stochastic Noise. In Proceedings of the 2021 IEEE/ACM International Conference On Computer Aided Design (ICCAD), Munich, Germany, 1–4 November 2021; pp. 1–9. [Google Scholar]
  20. Wang, S.; Zhang, Y.; Chen, J.; Zhang, X.; Li, Y.; Lin, N.; He, Y.; Yang, J.; Yu, Y.; Li, Y.; et al. Reconfigurable Digital RRAM Logic Enables In-Situ Pruning and Learning for Edge AI. arXiv 2025, arXiv:2506.13151. [Google Scholar] [CrossRef]
  21. Ma, C.; Sun, Y.; Qian, W.; Meng, Z.; Yang, R.; Jiang, L. Go Unary: A Novel Synapse Coding and Mapping Scheme for Reliable ReRAM-Based Neuromorphic Computing. In Proceedings of the 2020 Design, Automation & Test in Europe Conference & Exhibition (DATE), Grenoble, France, 9–13 March 2020; pp. 1432–1437. [Google Scholar]
  22. Charan, G.; Mohanty, A.; Du, X.; Krishnan, G.; Joshi, R.V.; Cao, Y. Accurate Inference With Inaccurate RRAM Devices: A Joint Algorithm-Design Solution. IEEE J. Explor. Solid-State Comput. Devices Circuits 2020, 6, 27–35. [Google Scholar] [CrossRef]
  23. Sun, Y.; Ma, C.; Li, Z.; Zhao, Y.; Jiang, J.; Qian, W.; Yang, R.; He, Z.; Jiang, L. Unary Coding and Variation-Aware Optimal Mapping Scheme for Reliable ReRAM-Based Neuromorphic Computing. IEEE Trans. Comput. Aided Des. Integr. Circuits Syst. 2021, 40, 2495–2507. [Google Scholar] [CrossRef]
  24. Eldebiky, A.; Zhang, G.L.; Böcherer, G.; Li, B.; Schlichtmann, U. CorrectNet+: Dealing With HW Non-Idealities in In-Memory-Computing Platforms by Error Suppression and Compensation. IEEE Trans. Comput. Aided Des. Integr. Circuits Syst. 2024, 43, 573–585. [Google Scholar] [CrossRef]
  25. Ye, W.; Wang, L.; Zhou, Z.; An, J.; Li, W.; Gao, H.; Li, Z.; Yue, J.; Hu, H.; Xu, X.; et al. A 28-Nm RRAM Computing-in-Memory Macro Using Weighted Hybrid 2T1R Cell Array and Reference Subtracting Sense Amplifier for AI Edge Inference. IEEE J. Solid-State Circuits 2023, 58, 2839–2850. [Google Scholar] [CrossRef]
  26. Kosta, A.; Soufleri, E.; Chakraborty, I.; Agrawal, A.; Ankit, A.; Roy, K. HyperX: A Hybrid RRAM-SRAM Partitioned System for Error Recovery in Memristive Xbars. In Proceedings of the 2022 Design, Automation & Test in Europe Conference & Exhibition (DATE), Antwerp, Belgium, 14–23 March 2022; pp. 88–91. [Google Scholar]
  27. Yang, M.; Chen, J.; Zhang, Y.; Liu, J.; Zhang, J.; Ma, Q.; Verma, H.; Zhang, Q.; Zhou, M.; King, I.; et al. Low-Rank Adaptation for Foundation Models: A Comprehensive Review. arXiv 2024, arXiv:2501.00365. [Google Scholar] [CrossRef]
  28. Dettmers, T.; Pagnoni, A.; Holtzman, A.; Zettlemoyer, L. QLoRA: Efficient Finetuning of Quantized LLMs. arXiv 2023, arXiv:2305.14314. [Google Scholar] [CrossRef]
  29. Zhang, Q.; Chen, M.; Bukharin, A.; Karampatziakis, N.; He, P.; Cheng, Y.; Chen, W.; Zhao, T. AdaLoRA: Adaptive Budget Allocation for Parameter-Efficient Fine-Tuning. arXiv 2023, arXiv:2303.10512. [Google Scholar]
  30. Mao, Y.; Ge, Y.; Fan, Y.; Xu, W.; Mi, Y.; Hu, Z.; Gao, Y. A Survey on LoRA of Large Language Models. Front. Comput. Sci. 2025, 19, 197605. [Google Scholar] [CrossRef]
  31. Chih, Y.-D.; Lee, P.-H.; Fujiwara, H.; Shih, Y.-C.; Lee, C.-F.; Naous, R.; Chen, Y.-L.; Lo, C.-P.; Lu, C.-H.; Mori, H.; et al. 16.4 An 89TOPS/W and 16.3TOPS/Mm2 All-Digital SRAM-Based Full-Precision Compute-In Memory Macro in 22nm for Machine-Learning Edge Applications. In Proceedings of the 2021 IEEE International Solid- State Circuits Conference (ISSCC), San Francisco, CA, USA, 13–22 February 2021; pp. 252–254. [Google Scholar]
  32. Fujiwara, H.; Mori, H.; Zhao, W.-C.; Chuang, M.-C.; Naous, R.; Chuang, C.-K.; Hashizume, T.; Sun, D.; Lee, C.-F.; Akarvardar, K.; et al. A 5-Nm 254-TOPS/W 221-TOPS/Mm2 Fully-Digital Computing-in-Memory Macro Supporting Wide-Range Dynamic-Voltage-Frequency Scaling and Simultaneous MAC and Write Operations. In Proceedings of the 2022 IEEE International Solid-State Circuits Conference (ISSCC), San Francisco, CA, USA, 20–26 February 2022; pp. 1–3. [Google Scholar]
  33. An, J.; Zhou, Z.; Wang, L.; Ye, W.; Li, W.; Gao, H.; Li, Z.; Tian, J.; Wang, Y.; Hu, H.; et al. Write–Verify-Free MLC RRAM Using Nonbinary Encoding for AI Weight Storage at the Edge. IEEE Trans. VLSI Syst. 2024, 32, 283–290. [Google Scholar] [CrossRef]
  34. Shafiee, A.; Nag, A.; Muralimanohar, N.; Balasubramonian, R.; Strachan, J.P.; Hu, M.; Williams, R.S.; Srikumar, V. ISAAC: A Convolutional Neural Network Accelerator with In-Situ Analog Arithmetic in Crossbars. In Proceedings of the 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA), Seoul, Republic of Korea, 18–22 June 2016; pp. 14–26. [Google Scholar]
  35. Lu, A.; Peng, X.; Li, W.; Jiang, H.; Yu, S. NeuroSim Validation with 40nm RRAM Compute-in-Memory Macro. In Proceedings of the 2021 IEEE 3rd International Conference on Artificial Intelligence Circuits and Systems (AICAS), Washington DC, DC, USA, 6–9 June 2021; pp. 1–4. [Google Scholar]
  36. Li, Y.; Chen, Y.; Dai, X.; Chen, D.; Liu, M.; Yuan, L.; Liu, Z.; Zhang, L.; Vasconcelos, N. MicroNet: Improving Image Recognition with Extremely Low FLOPs. In Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 11–17 October 2021; pp. 458–467. [Google Scholar]
  37. Zhu, Y.; Zhang, G.L.; Wang, T.; Li, B.; Shi, Y.; Ho, T.-Y.; Schlichtmann, U. Statistical Training for Neuromorphic Computing Using Memristor-Based Crossbars Considering Process Variations and Noise. In Proceedings of the 2020 Design, Automation & Test in Europe Conference & Exhibition (DATE), Grenoble, France, 9–13 March 2020; pp. 1590–1593. [Google Scholar]
  38. Long, Y.; She, X.; Mukhopadhyay, S. Design of Reliable DNN Accelerator with Un-Reliable ReRAM. In Proceedings of the 2019 Design, Automation & Test in Europe Conference & Exhibition (DATE), Florence, Italy, 9–13 March 2019; pp. 1769–1774. [Google Scholar]
  39. He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
  40. Simonyan, K.; Zisserman, A. Very Deep Convolutional Networks for Large-Scale Image Recognition. arXiv 2015, arXiv:1409.1556. [Google Scholar] [CrossRef]
  41. Sandler, M.; Howard, A.; Zhu, M.; Zhmoginov, A.; Chen, L.-C. MobileNetV2: Inverted Residuals and Linear Bottlenecks. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–23 June 2018; pp. 4510–4520. [Google Scholar]
  42. Tang, S.; Gong, R.; Wang, Y.; Liu, A.; Wang, J.; Chen, X.; Yu, F.; Liu, X.; Song, D.; Yuille, A.; et al. RobustART: Benchmarking Robustness on Architecture Design and Training Techniques. arXiv 2022, arXiv:2109.05211. [Google Scholar] [CrossRef]
  43. Wu, F.; Zhao, N.; Liu, Y.; Chang, L.; Zhou, L.; Zhou, J. A Review of Convolutional Neural Networks Hardware Accelerators for AIoT Edge Computing. In Proceedings of the 2021 International Conference on UK-China Emerging Technologies (UCET), Chengdu, China, 4–6 November 2021; pp. 71–76. [Google Scholar]
Figure 1. Challenges and the proposed framework: (a) RRAM-CIM variation and error accumulation. (b) Edge computing challenges for RRAM-CIM. (c) Overview of the Hybrid-CIM with LoBranch approach.
Figure 1. Challenges and the proposed framework: (a) RRAM-CIM variation and error accumulation. (b) Edge computing challenges for RRAM-CIM. (c) Overview of the Hybrid-CIM with LoBranch approach.
Eng 06 00332 g001
Figure 2. Illustration of (a) 2D RRAM, (b) SRAM, (c) vertically stacked 3D RRAM, and (d) the feature comparison between SRAM and 3D RRAM.
Figure 2. Illustration of (a) 2D RRAM, (b) SRAM, (c) vertically stacked 3D RRAM, and (d) the feature comparison between SRAM and 3D RRAM.
Eng 06 00332 g002
Figure 3. (a) Illustration of analog CIM and digital CIM architectures. (b) The computation data flows of CIM during forward and backward.
Figure 3. (a) Illustration of analog CIM and digital CIM architectures. (b) The computation data flows of CIM during forward and backward.
Eng 06 00332 g003
Figure 4. Structure of low-rank adaptation and its difference from conventional neural network.
Figure 4. Structure of low-rank adaptation and its difference from conventional neural network.
Eng 06 00332 g004
Figure 5. Proposed LoBranch structure for convolutional neural network.
Figure 5. Proposed LoBranch structure for convolutional neural network.
Eng 06 00332 g005
Figure 6. Proposed system-level architecture of Hybrid-CIM.
Figure 6. Proposed system-level architecture of Hybrid-CIM.
Eng 06 00332 g006
Figure 7. Overall evaluation framework of the proposed approach, implemented with NeuroSim [14] for feasibility analysis.
Figure 7. Overall evaluation framework of the proposed approach, implemented with NeuroSim [14] for feasibility analysis.
Eng 06 00332 g007
Figure 8. Distribution of chip-measured resistance of 8-layer multi-level 3D RRAM.
Figure 8. Distribution of chip-measured resistance of 8-layer multi-level 3D RRAM.
Eng 06 00332 g008
Figure 9. Accuracy analysis: (a) ResNet-18 under the mismatch between noise applied during training and variation experienced during evaluation, and (b) comparison of pre-mapping VAT with and without LoBranch.
Figure 9. Accuracy analysis: (a) ResNet-18 under the mismatch between noise applied during training and variation experienced during evaluation, and (b) comparison of pre-mapping VAT with and without LoBranch.
Eng 06 00332 g009
Figure 10. Error correction performance and area cost comparison.
Figure 10. Error correction performance and area cost comparison.
Eng 06 00332 g010
Figure 11. Performance analysis of hybrid-CIM on different neural networks.
Figure 11. Performance analysis of hybrid-CIM on different neural networks.
Eng 06 00332 g011
Figure 12. Performance of the proposed Hybrid-CIM approach in multi-task scenarios.
Figure 12. Performance of the proposed Hybrid-CIM approach in multi-task scenarios.
Eng 06 00332 g012
Figure 13. System-level area comparison of Hybrid-CIM-based system against SRAM-CIM-based and RRAM-CIM-based systems.
Figure 13. System-level area comparison of Hybrid-CIM-based system against SRAM-CIM-based and RRAM-CIM-based systems.
Eng 06 00332 g013
Figure 14. System-level energy efficiency and latency evaluation of Hybrid-CIM-based system against SRAM-CIM-based and RRAM-CIM-based systems.
Figure 14. System-level energy efficiency and latency evaluation of Hybrid-CIM-based system against SRAM-CIM-based and RRAM-CIM-based systems.
Eng 06 00332 g014
Table 1. Comparison of Post-mapping Accuracy with State-of-the-art Works.
Table 1. Comparison of Post-mapping Accuracy with State-of-the-art Works.
ApproachPrecisionStorage SchemeHeterogeneousDensityMulti-Task FriendlyAccuracy (%)
VGG-16 on CIFAR-10
DVA [38]8-bitRRAMnomediumno80.1 (−13.04)
Unary [21]8-bitRRAMnomediumno87.94 (−5.2)
KD + RSA [22]4-bitRRAM and 15% SRAMyeslimitedno92.57 (−0.57)
Unary-opt [23]6-bitRRAMnomediumno92.66 (−0.48)
Hybrid IMC [18]3-bitRRAM and 100% SRAMyeslimitedno92.97 (−0.17)
CorrectNet+ [24]4-bitRRAM and 0.6% RRAMyesmediumno91.29 (−1.85)
Ours8-bit3D RRAM and 2% SRAMyeshighyes93.14
ResNet-18 on CIFAR-10
DVA [38]8-bitRRAMnomediumno84.1 (−9.9)
Unary-opt [23]6-bitRRAMnomediumno93.2 (−0.8)
Ours8-bit3D RRAM and 2% SRAMyeshighyes94.0
Values in parentheses indicate the accuracy difference relative to the proposed approach.
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Tang, W.; Nie, L.; Ma, C.; Wu, H.; Yuan, Y.; Zhang, S.; Liu, Q.; Zhang, F. Low-Rank Compensation in Hybrid 3D-RRAM/SRAM Computing-in-Memory System for Edge Computing. Eng 2025, 6, 332. https://doi.org/10.3390/eng6120332

AMA Style

Tang W, Nie L, Ma C, Wu H, Yuan Y, Zhang S, Liu Q, Zhang F. Low-Rank Compensation in Hybrid 3D-RRAM/SRAM Computing-in-Memory System for Edge Computing. Eng. 2025; 6(12):332. https://doi.org/10.3390/eng6120332

Chicago/Turabian Style

Tang, Weiye, Long Nie, Cailian Ma, Hao Wu, Yiyang Yuan, Shuaidi Zhang, Qihao Liu, and Feng Zhang. 2025. "Low-Rank Compensation in Hybrid 3D-RRAM/SRAM Computing-in-Memory System for Edge Computing" Eng 6, no. 12: 332. https://doi.org/10.3390/eng6120332

APA Style

Tang, W., Nie, L., Ma, C., Wu, H., Yuan, Y., Zhang, S., Liu, Q., & Zhang, F. (2025). Low-Rank Compensation in Hybrid 3D-RRAM/SRAM Computing-in-Memory System for Edge Computing. Eng, 6(12), 332. https://doi.org/10.3390/eng6120332

Article Metrics

Article metric data becomes available approximately 24 hours after publication online.
Back to TopTop