Bridging Architectures, Mapping, and Learning for DNN Acceleration with Processing-in-Memory and In-Memory Computing Systems
Abstract
1. Introduction
1.1. Tutorial Snapshot: PIM vs. IMC
1.2. Scope and Taxonomy of This Survey
- 1.
- 2.
- 3.
- 4.
1.3. Where Practice Stands: Opportunities and Frictions
1.4. Survey Positioning and Roadmap
2. Landscape of IMC/PIM Architectures (Hybrid)
2.1. Paradigm Shift: From Data Movement to Memory-Centric Compute
2.2. Digital PIM in DRAM/SRAM: Operations, Gains, and Exemplars
2.3. Device Taxonomy and Comparative Trade-Offs (Volatile vs. Non-Volatile; Analog vs. Digital)
2.4. Peripheral Overheads and System Bottlenecks
2.5. Device Non-Idealities and Cross-Layer Mitigation
2.6. Integration Challenges, Selectors, and 3D Stacking
2.7. Photonic IMC: Modulators, MAC Mechanics, Precision, and Energy
2.8. Practical Rule-of-Thumb Bridge to Mapping (Device-Aware Constraints)
2.9. Application Scope and Commercialization Pathways
3. Taxonomy and Pipeline for Mapping, Partitioning, and Scheduling on PIM/IMC (Hybrid, Combined)
3.1. Conventional → Advanced Strategy Spectrum
- Learning-/graph-based and hybrids. DRL/RL learns adaptive mapping/scheduling under non-stationary workloads (training cost is the key hurdle) [121,122,123,124,125,126]. Graph-based methods cast models as data flow graphs (DFGs) and apply partitioning/min-cut to reduce communication and balance load [127,128]. Hybrids mix rules with auto-tuned design space exploration (DSE) and profiling to co-optimize multi-objective trade-offs [102,129,130,131]. Current taxonomies emphasize metaheuristics for search efficiency [58,134,135], RL/DRL for runtime adaptability [106,107,112], and hybrid (ILP + ML, RL + graph) dominance at scale [102,104,105,106,107,112,113].
3.2. Device-Aware and Cross-Layer Constraints (First-Class)
3.3. Pipeline Primitives and Design Knobs
- Operator mapping (layer → resource).
- Model partitioning.
- Task scheduling.
- Dataflow scheduling.
- End-to-end view.
3.4. Strategy-to-Flow Schematics (Fit and Roles)
3.5. Compiler Integration and Portability
3.6. Cross-Layer Co-Optimization and Feedback
3.7. Quantitative Evidence and Case Studies
3.8. Contributors Snapshot
3.9. Limits and Gaps
3.10. Takeaway
4. Chronological Evolution and Comparative Analysis of DNN-to-PIM Mapping Frameworks (2019–2025)
4.1. Early Years (2019–2020): Foundational Simulators and Rule-Based Flows
4.2. Maturity Phase (2021–2022): Hardware-Aware Optimization and ILP-Based Scheduling
4.3. Recent Innovations (2023–2025): Graph-Based, RL/DRL, and Hybrid Frameworks
4.4. Comparative Trade-Offs and Open Gaps
5. Software-Centric Approaches: Compilers and End-to-End Mapping Tools
5.1. Bridging Algorithm–Hardware Abstraction Gaps
5.2. Mapping Pipelines and Dataflow Strategies
5.3. Dynamic Optimizations and Latency Reduction
5.4. Framework Diversity: Open-Source vs. Proprietary
5.5. Unresolved Gaps and Future Directions
6. Benchmarking and Dataset Resources
6.1. Scope and Rationale
6.2. Taxonomy of Benchmarking Practices and Datasets
6.3. Public Ecosystem and Coverage Map
6.4. Standard Metrics and Normalization
6.5. Methodological Pitfalls (“Red-Flag Matrix”) and Best Practices
6.6. Chronology (2018–2025): From Ad Hoc Suites to Graph-Centric and Hardware-in-the-Loop
6.7. Integration with Software Stacks and Transition
7. Comprehensive Reference Card and Comparative Benchmarking of DNN-to-PIM Accelerators

7.1. Energy Efficiency vs. Latency: Trading Power for Real-Time Operation
7.2. Workload Compatibility and Generalization: Versatility vs. Specialization
7.3. Area Efficiency and Scaling Potential: Density Beyond TOPS/W
7.4. Latency–Accuracy–Energy Trifecta: Navigating the Pareto Frontier
7.5. Toward Unified Evaluation: Open Source, Real-World Workloads, and Toolchain Co-Design
7.6. Key Takeaways from Comparative Benchmarking
- Comparisons reflect reported operating points and workloads; when batch size, precision, or ADC range differ, we prioritize area-normalized throughput (TOPS/mm2) and show ranges rather than single peak numbers to avoid apples-to-oranges conclusions.
7.7. Trends, Variability, and Observations
7.8. Reference-Card Synthesis, Claims, and Open Questions
8. Graph-Based and Learning-Driven Approaches: Emerging Directions
8.1. Motivation and Summary
8.2. GNNs for Hardware-Aware Modeling
8.3. Beyond GNNs: RL, BO, Hybrids, and Predictors
8.4. Empirical Benchmarks and Scaling Behavior
8.5. Algorithmic Choices by Objective
8.6. Challenges and Limitations
8.7. End-to-End Frameworks and Integrated Pipelines
8.8. Outlook
9. Open Challenges and Future Directions
9.1. Standardization Needs
9.2. Hardware Heterogeneity
9.3. Integration Barriers
9.4. Community Collaboration
9.5. Actionable Roadmap: Future Directions and Vision
- (1)
- Architectures that adapt to workload shape: Design fabrics that natively accommodate transformers, large GNNs, and hybrid AI–physics models, not just CNNs. Concretely: heterogeneous memory planes co-packaged with reconfigurable compute islands; precision-morphing datapaths (binary → int4 → fp8) exposed via a thin control API; and telemetry hooks (utilization, queueing, bit-toggle activity) surfaced at μs–ms cadence for software to steer placement and precision online.
- (2)
- Mapping and scheduling as living systems: Elevate mapping from a one-time compile step to a continuous control loop: policy controllers (RL/GNN or rule-learned surrogates) that re-tile, re-pipeline, and re-place subgraphs as workload mix changes; fast-path reconfiguration primitives (hot-swap kernels, patchable tiling) with bounded replan latency; and multi-objective envelopes that balance latency, energy, and accuracy with explicit guardrails so service level objectives (SLOs) aren’t violated.
- (3)
- Compilers as the cross-layer glue: Make the compiler the shared language of the stack: a unified IR that encodes operator semantics, layout/precision constraints, and device capabilities in the same graph; online profiling passes that ingest hardware telemetry and update cost models at runtime; and auto-specialization that emits per-fabric kernels from a common schedule, with plug-in passes for quantization, sparsity, and dataflow.
- (4)
- Benchmarking as an open, evolving standard: Shift evaluation from one-off point metrics to portable, replayable execution graphs: a public suite pairing canonical AI models (ResNet, BERT, GNNs) and domain workloads (edge sensing, bio, scientific) with reference execution graphs; core metrics reported uniformly (EDP, energy/inference, latency at batch = 1/streaming, utilization, accuracy under target precision); and governance via versioned releases and community PRs so the suite evolves with models and hardware.
- (5)
- Staged roadmap with measurable deliverables: Near term (1–3 years): telemetry-first microarchitectures (per-op counters, queue depth, bit-toggle stats) with stable APIs; runtime-aware compiler passes (profile-guided tiling/placement) and policy stubs for limited online remapping; starter execution-graph packs (CNN/Transformer/GNN) with replay scripts for simulators and small silicon prototypes.Mid term (3–6 years): heterogeneous PIM fabrics that switch precision/dataflow at run time and accept mapping updates without recompilation; controllers upgraded to closed-loop mapping with verified guardrails (e.g., ≤2% accuracy drift, ≤5 ms replan); v2/v3 benchmarks with cross-vendor runners and mandatory utilization/EDP reporting.Long term (6–10 years): a self-optimizing ecosystem where compiler ↔ mapper ↔ fabric form a single adaptive system learning from workload traces; multi-fabric portability from the same model + IR + policy across DRAM/SRAM/eNVM PIM with bounded QoS deltas; and open governance so new models/devices slot in without redesign.As summarized in Figure 7, the path toward practical DNN deployment on PIM/IMC systems can be organized as a staged progression from foundational benchmarking and architecture–mapping co-design to learning-augmented mapping, unified evaluation, and eventual deployment standardization. This roadmap emphasizes that meaningful progress depends not on isolated advances at a single layer, but on coordinated development across hardware design, mapping intelligence, compiler infrastructure, benchmarking practice, and industrial interoperability.
10. Conclusions
- 1.
- Fragmented evaluation standards. The absence of open, execution-graph benchmarks with shared metrics impedes reproducibility and cross-platform comparability.
- 2.
- Narrow workload coverage and limited training support. Most frameworks remain tuned to CNN inference, with insufficient optimization for transformers, large-scale GNNs, diffusion models, and on-device training, fine-tuning, and continual learning.
- 3.
- Simulation-dominated validation. Heavy reliance on idealized models inflates performance claims; only a minority of systems are validated on silicon under realistic non-idealities (thermal drift, device and process variation, fault resilience).
- 4.
- Weak cross-layer integration. Pruning and quantization are common but rarely co-designed with device-aware scheduling or adaptive runtime control, leaving efficiency and robustness unrealized.
- 1.
- Phase 1: Benchmark and reporting baselines. Establish open repositories of execution graphs and benchmarking scripts spanning vision, language, graph analytics, and scientific workloads, with explicit control of model variants, batch sizes, precision, and metrics to enable fair comparisons.
- 2.
- Phase 2: Non-ideality-aware evaluation. Integrate device non-idealities and environmental factors into evaluation pipelines, and standardize robustness and accuracy-stability reporting under drift, variation, and fault scenarios.
- 3.
- Phase 3: Open compiler foundations. Develop open-source compilers and runtimes with a common intermediate representation and modular backends for DRAM and SRAM PIM as well as RRAM, PCM, and FeFET-based IMC, enabling reusable optimization passes across substrates.
- 4.
- Phase 4: Cross-layer adaptive co-design. Unify mapping, scheduling, and precision management with algorithm-level techniques (e.g., pruning and quantization) in a device-aware manner, and support runtime reconfiguration based on workload phase behavior and system constraints.
- 5.
- Phase 5: Deployment at scale. Validate frameworks on heterogeneous, production-grade systems with mixed workloads and stress tests (including adversarial and corner-case conditions), and evaluate not only throughput and energy but also robustness, accuracy stability, maintainability, and lifecycle costs.
Author Contributions
Funding
Data Availability Statement
Conflicts of Interest
References
- Sebastian, A.; Gallo, M.L.; Khaddam-Aljameh, R.; Eleftheriou, E. Memory devices and applications for in-memory computing. Nat. Nanotechnol. 2020, 15, 529–544. [Google Scholar] [CrossRef]
- Ielmini, D.; Wong, H. In-memory computing with resistive switching devices. Nat. Electron. 2018, 1, 333–343. [Google Scholar] [CrossRef]
- Zou, X.; Xu, S.; Chen, X.; Yan, L.; Han, Y. Breaking the von Neumann bottleneck: Architecture-level processing-in-memory technology. Sci. China Inf. Sci. 2021, 64, 160404. [Google Scholar] [CrossRef]
- Mutlu, O.; Ghose, S.; Gómez-Luna, J.; Ausavarungnirun, R. Processing Data Where It Makes Sense: Enabling In-Memory Computation. Microprocess. Microsyst. 2019, 67, 28–41. [Google Scholar] [CrossRef]
- Gupta, S.; Imani, M.; Kaur, H.; Rosing, T. NNPIM: A Processing In-Memory Architecture for Neural Network Acceleration. IEEE Trans. Comput. 2019, 68, 1325–1337. [Google Scholar] [CrossRef]
- Long, Y.; Kim, D.; Lee, E.; Saha, P.; Mudassar, B.; She, X.; Khan, A.; Mukhopadhyay, S. A Ferroelectric FET-Based Processing-in-Memory Architecture for DNN Acceleration. IEEE J. Explor. Solid-State Comput. Devices Circuits 2019, 5, 113–122. [Google Scholar] [CrossRef]
- Sutradhar, P.R.; Bavikadi, S.; Connolly, M.; Prajapati, S.; Indovina, M.A.; Dinakarrao, S.M.P.; Ganguly, A. Look-up-Table Based Processing-in-Memory Architecture with Programmable Precision-Scaling for Deep Learning Applications. IEEE Trans. Parallel Distrib. Syst. 2022, 33, 263–275. [Google Scholar] [CrossRef]
- Kim, C.H.; Lee, W.; Paik, Y.; Kwon, K.; Kim, S.; Park, I.; Kim, S.W. Silent-PIM: Realizing the Processing-in-Memory Computing with Standard Memory Requests. IEEE Trans. Parallel Distrib. Syst. 2022, 33, 251–262. [Google Scholar] [CrossRef]
- Gómez-Luna, J.; Hajj, I.E.; Fernandez, I.; Giannoula, C.; Oliveira, G.F.; Mutlu, O. Benchmarking a New Paradigm: Experimental Analysis and Characterization of a Real Processing-in-Memory System. IEEE Access 2022, 10, 52565–52608. [Google Scholar] [CrossRef]
- Jain, S.; Ranjan, A.; Roy, K.; Raghunathan, A. Computing in Memory with Spin-Transfer Torque Magnetic RAM. IEEE Trans. Very Large Scale Integr. (VLSI) Syst. 2017, 26, 470–483. [Google Scholar] [CrossRef]
- Gallo, M.L.; Sebastian, A.; Mathis, R.; Manica, M.; Giefers, H.; Tůma, T.; Bekas, C.; Curioni, A.; Eleftheriou, E. Mixed-precision in-memory computing. Nat. Electron. 2017, 1, 246–253. [Google Scholar] [CrossRef]
- Soliman, T.; Chatterjee, S.; Laleni, N.; Müller, F.; Kirchner, T.; Wehn, N.; Kämpfe, T.; Chauhan, Y.; Amrouch, H. First demonstration of in-memory computing crossbar using multi-level Cell FeFET. Nat. Commun. 2023, 14, 6348. [Google Scholar] [CrossRef] [PubMed]
- Lu, Z.; Wang, X.; Arafin, M.T.; Yang, H.; Liu, Z.; Zhang, J.; Qu, G. An RRAM-Based Computing-in-Memory Architecture and Its Application in Accelerating Transformer Inference. IEEE Trans. Very Large Scale Integr. (VLSI) Syst. 2024, 32, 485–496. [Google Scholar] [CrossRef]
- Leitersdorf, O.; Ronen, R.; Kvatinsky, S. MultPIM: Fast Stateful Multiplication for Processing-in-Memory. IEEE Trans. Circuits Syst. II Express Briefs 2021, 69, 1647–1651. [Google Scholar] [CrossRef]
- Kim, D.; Yu, C.; Xie, S.; Chen, Y.; Kim, J.Y.; Kim, B.; Kulkarni, J.; Kim, T.T. An Overview of Processing-in-Memory Circuits for Artificial Intelligence and Machine Learning. IEEE J. Emerg. Sel. Top. Circuits Syst. 2022, 12, 338–353. [Google Scholar] [CrossRef]
- Ali, M.; Roy, S.; Saxena, U.; Sharma, T.; Raghunathan, A.; Roy, K. Compute-in-Memory Technologies and Architectures for Deep Learning Workloads. IEEE Trans. Very Large Scale Integr. (VLSI) Syst. 2022, 30, 1615–1630. [Google Scholar] [CrossRef]
- Mittal, S.; Verma, G.; Kaushik, B.K.; Khanday, F.A. A survey of SRAM-based in-memory computing techniques and applications. J. Syst. Archit. 2021, 119, 102276. [Google Scholar] [CrossRef]
- Cheng, C.; Tiw, P.J.; Cai, Y.; Yan, X.; Yang, Y.; Huang, R. In-memory computing with emerging nonvolatile memory devices. Sci. China Inf. Sci. 2021, 64, 221402. [Google Scholar] [CrossRef]
- Ahn, J.; Yoo, S.; Mutlu, O.; Choi, K. PIM-enabled instructions: A low-overhead, locality-aware processing-in-memory architecture. In Proceedings of the 2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA); Association for Computing Machinery: New York, NY, USA, 2015; pp. 336–348. [Google Scholar] [CrossRef]
- Mittal, S. A Survey of ReRAM-Based Architectures for Processing-In-Memory and Neural Networks. Mach. Learn. Knowl. Extr. 2018, 1, 75–114. [Google Scholar] [CrossRef]
- Wang, F.; Li, J.; Zhang, Z.; Ding, Y.; Xiong, Y.; Hou, X.; Chen, H.; Zhou, P. Multifunctional computing-in-memory SRAM cells based on two-surface-channel MoS2 transistors. iScience 2021, 24, 103138. [Google Scholar] [CrossRef]
- Chen, X.; Wang, X.; Jia, X.; Yang, J.; Qu, G.; Zhao, W. Accelerating Graph-Connected Component Computation with Emerging Processing-In-Memory Architecture. IEEE Trans. Comput.-Aided Des. Integr. Circuits Syst. 2022, 41, 5333–5342. [Google Scholar] [CrossRef]
- Jonatan, G.; Cho, H.; Son, H.; Wu, X.; Livesay, N.; Mora, E.; Shivdikar, K.; Abellán, J.L.; Joshi, A.; Kaeli, D.R.; et al. Scalability Limitations of Processing-in-Memory using Real System Evaluations. Proc. ACM Meas. Anal. Comput. Syst. 2024, 8, 5. [Google Scholar] [CrossRef]
- Lin, J.; Qu, H.; Ma, S.; Ji, X.; Li, H.; Li, X.; Song, C.; Zhang, W. SongC: A Compiler for Hybrid Near-Memory and In-Memory Many-Core Architecture. IEEE Trans. Comput. 2024, 73, 2420–2433. [Google Scholar] [CrossRef]
- Rashed, M.; Thijssen, S.; Jha, S.K.; Ewetz, R. LOGIC: Logic Synthesis for Digital In-Memory Computing. ACM Trans. Des. Autom. Electron. Syst. 2025, 30, 25. [Google Scholar] [CrossRef]
- Zhu, Z.; Sun, H.; Xie, T.; Zhu, Y.; Dai, G.; Xia, L.; Niu, D.; Chen, X.; Hu, X.S.; Cao, Y.; et al. MNSIM 2.0: A Behavior-Level Modeling Tool for Processing-In-Memory Architectures. IEEE Trans. Comput.-Aided Des. Integr. Circuits Syst. 2023, 42, 4112–4125. [Google Scholar] [CrossRef]
- Forlin, B.E.; Santos, P.C.; Becker, A.E.; Alves, M.; Carro, L. Sim2PIM: A complete simulation framework for Processing-in-Memory. J. Syst. Archit. 2022, 128, 102528. [Google Scholar] [CrossRef]
- Perach, B.; Ronen, R.; Kimelfeld, B.; Kvatinsky, S. Understanding Bulk-Bitwise Processing In-Memory Through Database Analytics. IEEE Trans. Emerg. Top. Comput. 2022, 12, 7–22. [Google Scholar] [CrossRef]
- Xu, S.; Chen, X.; Wang, Y.; Han, Y.; Qian, X.; Li, X. PIMSim: A Flexible and Detailed Processing-in-Memory Simulator. IEEE Comput. Archit. Lett. 2019, 18, 6–9. [Google Scholar] [CrossRef]
- Wu, N.; Xie, Y. A Survey of Machine Learning for Computer Architecture and Systems. ACM Comput. Surv. (CSUR) 2021, 55, 54. [Google Scholar] [CrossRef]
- Asif, N.A.; Sarker, Y.; Chakrabortty, R.; Ryan, M.J.; Ahamed, M.; Saha, D.K.; Badal, F.; Das, S.; Ali, M.; Moyeen, S.I.; et al. Graph Neural Network: A Comprehensive Review on Non-Euclidean Space. IEEE Access 2021, 9, 60588–60606. [Google Scholar] [CrossRef]
- Bilot, T.; Madhoun, N.E.; Agha, K.A.; Zouaoui, A. A Survey on Malware Detection with Graph Representation Learning. ACM Comput. Surv. 2023, 56, 278. [Google Scholar] [CrossRef]
- Liu, F.; Zhao, W.; Wang, Z.; Zhao, Y.; Yang, T.; Chen, Y.; Jiang, L. IVQ: In-Memory Acceleration of DNN Inference Exploiting Varied Quantization. IEEE Trans. Comput.-Aided Des. Integr. Circuits Syst. 2022, 41, 5313–5326. [Google Scholar] [CrossRef]
- Kierner, S.; Kucharski, J.; Kierner, Z. Taxonomy of hybrid architectures involving rule-based reasoning and machine learning in clinical decision systems: A scoping review. J. Biomed. Inform. 2023, 144, 104428. [Google Scholar] [CrossRef] [PubMed]
- Han, R.; John, L.; Zhan, J. Benchmarking Big Data Systems: A Review. IEEE Trans. Serv. Comput. 2018, 11, 580–597. [Google Scholar] [CrossRef]
- Bartolomeo, S.D.; Crnovrsanin, T.; Saffo, D.; Puerta, E.; Wilson, C.; Dunne, C. Evaluating Graph Layout Algorithms: A Systematic Review of Methods and Best Practices. Comput. Graph. Forum 2024, 43, e15073. [Google Scholar] [CrossRef]
- Meng, J.; Shim, W.; Yang, L.; Yeo, I.; Fan, D.; Yu, S.; Seo, J.W. Temperature-Resilient RRAM-Based In-Memory Computing for DNN Inference. IEEE Micro 2022, 42, 89–98. [Google Scholar] [CrossRef]
- Wang, Y.; Qin, Y.; Liu, L.; Wei, S.; Yin, S. SWPU: A 126.04 TFLOPS/W Edge-Device Sparse DNN Training Processor with Dynamic Sub-Structured Weight Pruning. IEEE Trans. Circuits Syst. I Regul. Pap. 2022, 69, 4014–4027. [Google Scholar] [CrossRef]
- Peng, X.; Huang, S.; Jiang, H.; Lu, A.; Yu, S. DNN+NeuroSim V2.0: An End-to-End Benchmarking Framework for Compute-in-Memory Accelerators for On-Chip Training. IEEE Trans. Comput.-Aided Des. Integr. Circuits Syst. 2020, 40, 2306–2319. [Google Scholar] [CrossRef]
- Zhu, J.; Zhang, T.; Yang, Y.; Huang, R. A comprehensive review on emerging artificial neuromorphic devices. Appl. Phys. Rev. 2020, 7, 011312. [Google Scholar] [CrossRef]
- Bavikadi, S.; Sutradhar, P.R.; Khasawneh, K.N.; Ganguly, A.; Dinakarrao, S.M.P. A Review of In-Memory Computing Architectures for Machine Learning Applications. In Proceedings of the 2020 on Great Lakes Symposium on VLSI; Association for Computing Machinery: New York, NY, USA, 2020. [Google Scholar] [CrossRef]
- Mannocci, P.; Farronato, M.; Lepri, N.; Cattaneo, L.; Glukhov, A.; Sun, Z.; Ielmini, D. In-memory computing with emerging memory devices: Status and outlook. APL Mach. Learn. 2023, 1, 010902. [Google Scholar] [CrossRef]
- Ielmini, D.; Pedretti, G. Device and Circuit Architectures for In-Memory Computing. Adv. Intell. Syst. 2020, 2, 2000040. [Google Scholar] [CrossRef]
- Hussain, H.; Tamizharasan, P.; Rahul, C.S. Design possibilities and challenges of DNN models: A review on the perspective of end devices. Artif. Intell. Rev. 2022, 55, 5109–5167. [Google Scholar] [CrossRef]
- Yu, S. Neuro-Inspired Computing with Emerging Nonvolatile Memorys. Proc. IEEE 2018, 106, 260–285. [Google Scholar] [CrossRef]
- Kwak, H.; Kim, N.; Jeon, S.; Kim, S.; Woo, J. Electrochemical random-access memory: Recent advances in materials, devices, and systems towards neuromorphic computing. Nano Converg. 2024, 11, 9. [Google Scholar] [CrossRef] [PubMed]
- Kingra, S.K.; Parmar, V.; Chang, C.C.; Hudec, B.; Hou, T. SLIM: Simultaneous Logic-in-Memory Computing Exploiting Bilayer Analog OxRAM Devices. Sci. Rep. 2018, 10, 2567. [Google Scholar] [CrossRef] [PubMed]
- Ling, Y.; Wang, Z.; Yang, Y.; Bao, L.; Bao, S.; Wang, Q.; Cai, Y.; Huang, R. An isolated symmetrical 2T2R cell enabling high precision and high density for RRAM-based in-memory computing. Sci. China Inf. Sci. 2024, 67, 152402. [Google Scholar] [CrossRef]
- Ren, S.; Dong, A.W.; Yang, L.; Xue, Y.B.; Li, J.; Yu, Y.; Zhou, H.; Zuo, W.B.; Li, Y.; Cheng, W.M.; et al. Self-Rectifying Memristors for Three-Dimensional In-Memory Computing. Adv. Mater. 2023, 36, 2307218. [Google Scholar] [CrossRef]
- Boniardi, M.; Baldo, M.; Allegra, M.; Redaelli, A. Phase Change Memory: A Review on Electrical Behavior and Use in Analog In-Memory-Computing (A-IMC) Applications. Adv. Electron. Mater. 2024, 10, 2400599. [Google Scholar] [CrossRef]
- Kim, J.; Kang, S.; Lee, S.; Ro, Y.; Lee, S.; Wang, D.; Choi, J.; So, J.; Cho, Y.; Song, J.; et al. Aquabolt-XL HBM2-PIM, LPDDR5-PIM with In-Memory Processing, and AXDIMM with Acceleration Buffer. IEEE Micro 2022, 42, 20–30. [Google Scholar] [CrossRef]
- Azarkhish, E.; Rossi, D.; Loi, I.; Benini, L. Neurostream: Scalable and Energy Efficient Deep Learning with Smart Memory Cubes. IEEE Trans. Parallel Distrib. Syst. 2017, 29, 420–434. [Google Scholar] [CrossRef]
- Lee, W.; Kim, C.H.; Paik, Y.; Park, J.; Park, I.; Kim, S. Design of Processing-“Inside”-Memory Optimized for DRAM Behaviors. IEEE Access 2019, 7, 82633–82648. [Google Scholar] [CrossRef]
- Kim, J.; Lee, J.; Lee, J.; Heo, J.; Kim, J.Y. Z-PIM: A Sparsity-Aware Processing-in-Memory Architecture with Fully Variable Weight Bit-Precision for Energy-Efficient Deep Neural Networks. IEEE J. Solid-State Circuits 2021, 56, 1093–1104. [Google Scholar] [CrossRef]
- Mamdouh, A.; Geng, H.; Niemier, M.; Hu, X.S.; Reis, D. Shared-PIM: Enabling Concurrent Computation and Data Flow for Faster Processing-in-DRAM. IEEE Trans. Comput.-Aided Des. Integr. Circuits Syst. 2025. [Google Scholar] [CrossRef]
- Kim, B.; Lee, C.; Kim, G.; Park, E. Cost-Effective Extension of DRAM-PIM for Group-Wise LLM Quantization. IEEE Comput. Archit. Lett. 2025, 24, 53–56. [Google Scholar] [CrossRef]
- Kim, C.H.; Lee, W.; Paik, Y.; Kim, S.; Kim, S.W. BL-PIM: Varying the Burst Length to Realize the All-Bank Performance and Minimize the Multi-Workload Interference for in-DRAM PIM. IEEE Access 2023, 11, 81143–81156. [Google Scholar] [CrossRef]
- Kim, S.; Kim, S.; Cho, K.; Shin, T.; Park, H.; Lho, D.; Park, S.; Son, K.; Park, G.; Jeong, S.; et al. Signal Integrity and Computing Performance Analysis of a Processing-In-Memory of High Bandwidth Memory (PIM-HBM) Scheme. IEEE Trans. Compon. Packag. Manuf. Technol. 2021, 11, 1955–1970. [Google Scholar] [CrossRef]
- Zhang, B.; Yin, S.; Kim, M.; Saikia, J.; Kwon, S.C.; Myung, S.; Kim, H.; Kim, S.J.; Seo, J.-S.; Seok, M. PIMCA: A Programmable In-Memory Computing Accelerator for Energy-Efficient DNN Inference. IEEE J. Solid-State Circuits 2023, 58, 1436–1449. [Google Scholar] [CrossRef]
- Gauchi, R.; Kooli, M.; Vivet, P.; Noël, J.; Beigné, E.; Mitra, S.; Charles, H. Memory Sizing of a Scalable SRAM In-Memory Computing Tile Based Architecture. In 2019 IFIP/IEEE 27th International Conference on Very Large Scale Integration (VLSI-SoC); IEEE: Piscataway, NJ, USA, 2019; pp. 166–171. [Google Scholar] [CrossRef]
- Zhou, W.; Farmakidis, N.; Feldmann, J.; Li, X.; Tan, J.Y.S.; He, Y.; Wright, C.; Pernice, W.; Bhaskaran, H. Phase-change materials for energy-efficient photonic memory and computing. MRS Bull. 2022, 47, 502–510. [Google Scholar] [CrossRef]
- Gallo, M.L.; Khaddam-Aljameh, R.; Stanisavljevic, M.; Vasilopoulos, A.; Kersting, B.; Dazzi, M.; Karunaratne, G.; Braendli, M.; Singh, A.; Mueller, S.; et al. A 64-core mixed-signal in-memory compute chip based on phase-change memory for deep neural network inference. Nat. Electron. 2022, 6, 680–693. [Google Scholar] [CrossRef]
- Kim, G.; Ko, D.H.; Kim, T.; Lee, S.; Jung, M.; Lee, Y.K.; Lim, S.; Jo, M.; Eom, T.; Shin, H.; et al. Power-Delay Area-Efficient Processing-In-Memory Based on Nanocrystalline Hafnia Ferroelectric Field-Effect Transistors. Acs Appl. Mater. Interfaces 2022, 15, 1463–1474. [Google Scholar] [CrossRef]
- Kim, M.; Lee, K.; Kim, S.; Lee, J.H.; Park, B.G.; Kwon, D. Double-Gated Ferroelectric-Gate Field-Effect-Transistor for Processing in Memory. IEEE Electron Device Lett. 2021, 42, 1607–1610. [Google Scholar] [CrossRef]
- Lee, M.; Narayan, D.M.; Kim, J.H.; Le, D.N.; Shirodkar, S.; Park, S.C.; Kang, J.; Lee, S.; Ahn, Y.; Ryu, S.W.; et al. Hafnium Oxide-Based Ferroelectric Devices for In-Memory Computing: Resistive and Capacitive Approaches. ACS Appl. Electron. Mater. 2024, 6, 5391–5401. [Google Scholar] [CrossRef]
- Jang, Y.; Kim, D.; Kim, Y.; Park, J. Big-Computing and Little-Storing STT-MRAM PIM Architecture with Charge Domain Based MAC Operation. IEEE Trans. Comput. 2025, 74, 1239–1252. [Google Scholar] [CrossRef]
- Kim, T.; Jang, Y.; Kang, M.; Park, B.; Lee, K.J.; Park, J. SOT-MRAM Digital PIM Architecture with Extended Parallelism in Matrix Multiplication. IEEE Trans. Comput. 2022, 71, 2816–2828. [Google Scholar] [CrossRef]
- Li, Y.; Bai, T.; Xu, X.; Zhang, Y.; Wu, B.; Cai, H.; Pan, B.; Zhao, W. A Survey of MRAM-Centric Computing: From Near Memory to In Memory. IEEE Trans. Emerg. Top. Comput. 2023, 11, 318–330. [Google Scholar] [CrossRef]
- Ghazal, O.; Wang, W.; Kvatinsky, S.; Merchant, F.; Yakovlev, A.; Shafik, R. IMPACT: In-Memory ComPuting Architecture based on Y-FlAsh Technology for Coalesced Tsetlin machine inference. Philos. Trans. Ser. A Math. Phys. Eng. Sci. 2025, 383, 20230393. [Google Scholar] [CrossRef]
- He, S.; Zhong, W.; Zhu, M.; Wu, S.; Xie, W.; Ouyang, Z.; Cheng, B.; Zhao, J. In-Memory Computing with Self-Rectification and Dynamic Logical Reconfiguration of 12 Algorithms in a Single Halide Perovskites. Adv. Funct. Mater. 2025, 35, 2424114. [Google Scholar] [CrossRef]
- Ren, S.; Xue, Y.B.; Zhang, Y.; Li, Y.; Miao, X. 3D Vertical Self-Rectifying Memristor Arrays with Split-Cell Structure, Large Nonlinearity (>104) and fJ-Level Switching Energy. IEEE Electron Device Lett. 2023, 44, 2059–2062. [Google Scholar] [CrossRef]
- Chung, K.Y.; Kim, H.; An, Y.; Seong, K.; Shin, D.H.; Baek, K.H.; Shim, Y. 8T-SRAM Based Process-In-Memory (PIM) System with Current Mirror for Accurate MAC Operation. IEEE Access 2024, 12, 24254–24261. [Google Scholar] [CrossRef]
- Duan, C.; Yang, J.; He, X.; Qi, Y.; Wang, Y.; Wang, Y.; He, Z.; Yan, B.; Wang, X.; Jia, X.; et al. DDC-PIM: Efficient Algorithm/Architecture Co-Design for Doubling Data Capacity of SRAM-Based Processing-in-Memory. IEEE Trans. Comput.-Aided Des. Integr. Circuits Syst. 2023, 43, 906–918. [Google Scholar] [CrossRef]
- Park, M.; Hwang, J.; Kim, S.; Shin, W.; Shim, W.; Bae, J.H.; Lee, J.; Cho, S. Charge-trap synaptic device with polycrystalline silicon channel for low power in-memory computing. Sci. Rep. 2024, 14, 29089. [Google Scholar] [CrossRef]
- Kim, S.; Um, S.; Jo, W.; Lee, J.; Ha, S.; Li, Z.; Yoo, H.J. Scaling-CIM: EDRAM In-Memory-Computing Accelerator with Dynamic-Scaling ADC and Adaptive Analog Operation. IEEE J. Solid-State Circuits 2024, 59, 2694–2705. [Google Scholar] [CrossRef]
- Azamat, A.; Asim, F.; Kim, J.; Lee, J. Partial Sum Quantization for Reducing ADC Size in ReRAM-Based Neural Network Accelerators. IEEE Trans. Comput.-Aided Des. Integr. Circuits Syst. 2023, 42, 4897–4908. [Google Scholar] [CrossRef]
- Xu, J.; Liu, H.; Duan, Z.; Liao, X.; Jin, H.; Yang, X.; Li, H.; Liu, C.; Mao, F.; Zhang, Y. ReHarvest: An ADC Resource-Harvesting Crossbar Architecture for ReRAM-Based DNN Accelerators. ACM Trans. Archit. Code Optim. 2024, 21, 63. [Google Scholar] [CrossRef]
- Amin, M.H.; Elbtity, M.E.; Zand, R. Xbar-Partitioning: A Practical Way for Parasitics and Noise Tolerance in Analog IMC Circuits. IEEE J. Emerg. Sel. Top. Circuits Syst. 2022, 12, 867–877. [Google Scholar] [CrossRef]
- Peng, X.; Liu, R.; Yu, S. Optimizing Weight Mapping and Data Flow for Convolutional Neural Networks on Processing-In-Memory Architectures. IEEE Trans. Circuits Syst. I Regul. Pap. 2019, 67, 1333–1343. [Google Scholar] [CrossRef]
- Gong, N.; Gong, N.; Idé, T.; Kim, S.; Boybat, I.; Boybat, I.; Sebastian, A.; Narayanan, V.; Ando, T. Signal and noise extraction from analog memory elements for neuromorphic computing. Nat. Commun. 2018, 9, 2102. [Google Scholar] [CrossRef]
- Han, L.; Huang, P.; Wang, Y.; Zhou, Z.; Yang, H.; Chen, Y.; Liu, X.; Kang, J. Mitigating methodology of hardware non-ideal characteristics for non-volatile memory based neural networks. Sci. China Inf. Sci. 2025, 68, 122403. [Google Scholar] [CrossRef]
- Saragada, P.K.; Das, B.P. Process-Variation-Aware In-Memory Computation with Improved Linearity Using On-Chip Configurable Current-Steering Thermometric DAC. IEEE Trans. Circuits Syst. I Regul. Pap. 2024, 71, 4586–4596. [Google Scholar] [CrossRef]
- Kneip, A.; Bol, D. Impact of Analog Non-Idealities on the Design Space of 6T-SRAM Current-Domain Dot-Product Operators for In-Memory Computing. IEEE Trans. Circuits Syst. I Regul. Pap. 2021, 68, 1931–1944. [Google Scholar] [CrossRef]
- Kim, Y.; Kwon, Y.J.; Kim, J.; An, C.H.; Park, T.; Kwon, D.; Woo, H.C.; Kim, H.; Yoon, J.; Hwang, C. Novel Selector-Induced Current-Limiting Effect through Asymmetry Control for High-Density One-Selector–One-Resistor Crossbar Arrays. Adv. Electron. Mater. 2019, 5, 1800806. [Google Scholar] [CrossRef]
- Park, J.H.; Kim, D.; Kang, D.; Jeon, D.; Kim, T.G. Nanoscale 3D-Stackable Ag-doped HfOx-Based Selector Devices Fabricated through Low-Temperature Hydrogen Annealing. ACS Appl. Mater. Interfaces 2019, 11, 29408–29415. [Google Scholar] [CrossRef]
- Lin, C.Y.; Tseng, Y.T.; Chen, P.H.; Chang, T.; Eshraghian, J.; Wang, Q.; Lin, Q.; Tan, Y.F.; Tai, M.; Hung, W.C.; et al. A high-speed MIM resistive memory cell with an inherent vanadium selector. Appl. Mater. Today 2020, 21, 100848. [Google Scholar] [CrossRef]
- Upadhyay, N.; Sun, W.; Lin, P.; Joshi, S.; Midya, R.; Zhang, X.; Wang, Z.; Jiang, H.; Yoon, J.; Rao, M.; et al. A Memristor with Low Switching Current and Voltage for 1S1R Integration and Array Operation. Adv. Electron. Mater. 2020, 6, 1901411. [Google Scholar] [CrossRef]
- Bae, Y.C.; Lee, A.R.; Baek, G.; Chung, J.B.; Kim, T.Y.; Park, J.G.; Hong, J.P. All oxide semiconductor-based bidirectional vertical p-n-p selectors for 3D stackable crossbar-array electronics. Sci. Rep. 2015, 5, 13362. [Google Scholar] [CrossRef] [PubMed]
- Sun, L.; Zhang, Y.; Han, G.; Hwang, G.; Jiang, J.; Joo, B.; Watanabe, K.; Taniguchi, T.; Kim, Y.M.; Yu, W.; et al. Self-selective van der Waals heterostructures for large scale memory array. Nat. Commun. 2019, 10, 3161. [Google Scholar] [CrossRef]
- Luo, Q.; Xu, X.; Liu, H.; Lv, H.; Gong, T.; Long, S.; Liu, Q.; Sun, H.; Banerjee, W.; Li, L.; et al. Super non-linear RRAM with ultra-low power for 3D vertical nano-crossbar arrays. Nanoscale 2016, 8, 15629–15636. [Google Scholar] [CrossRef] [PubMed]
- Rao, M.; Song, W.; Kiani, F.; Asapu, S.; Zhuo, Y.; Midya, R.; Upadhyay, N.; Wu, Q.; Barnell, M.D.; Lin, P.; et al. Timing Selector: Using Transient Switching Dynamics to Solve the Sneak Path Issue of Crossbar Arrays. Small Sci. 2021, 2, 2100072. [Google Scholar] [CrossRef] [PubMed]
- Li, S.; Pam, M.; Li, Y.; Chen, L.; Chien, Y.C.; Fong, X.; Chi, D.; Ang, K. Wafer-Scale 2D Hafnium Diselenide Based Memristor Crossbar Array for Energy-Efficient Neural Network Hardware. Adv. Mater. 2021, 34, 2103376. [Google Scholar] [CrossRef]
- Jain, S.; Li, S.; Zheng, H.; Li, L.; Fong, X.; Ang, K. Heterogeneous integration of 2D memristor arrays and silicon selectors for compute-in-memory hardware in convolutional neural networks. Nat. Commun. 2025, 16, 2719. [Google Scholar] [CrossRef]
- Filipovich, M.; Guo, Z.; Al-Qadasi, M.; Marquez, B.A.; Morison, H.; Sorger, V.; Prucnal, P.; Shekhar, S.; Shastri, B. Silicon Photonic Architecture for Training Deep Neural Networks with Direct Feedback Alignment. Optica 2021, 9, 1323–1332. [Google Scholar] [CrossRef]
- Feldmann, J.; Youngblood, N.; Karpov, M.; Gehring, H.; Li, X.; Stappers, M.; Gallo, M.L.; Fu, X.; Lukashchuk, A.; Raja, A.; et al. Parallel convolutional processing using an integrated photonic tensor core. Nature 2021, 589, 52–58. [Google Scholar] [CrossRef] [PubMed]
- Totović, A.; Dabos, G.; Passalis, N.; Tefas, A.; Pleros, N. Femtojoule per MAC Neuromorphic Photonics: An Energy and Technology Roadmap. IEEE J. Sel. Top. Quantum Electron. 2020, 26, 1–15. [Google Scholar] [CrossRef]
- Nahmias, M.; de Lima, T.F.; Tait, A.; Peng, H.T.; Shastri, B.; Prucnal, P. Photonic Multiply-Accumulate Operations for Neural Networks. IEEE J. Sel. Top. Quantum Electron. 2020, 26, 8800115. [Google Scholar] [CrossRef]
- Jain, S.; Sengupta, A.; Roy, K.; Raghunathan, A. RxNN: A Framework for Evaluating Deep Neural Networks on Resistive Crossbars. IEEE Trans. Comput.-Aided Des. Integr. Circuits Syst. 2018, 40, 326–338. [Google Scholar] [CrossRef]
- Burrello, A.; Garofalo, A.; Bruschi, N.; Tagliavini, G.; Rossi, D.; Conti, F. DORY: Automatic End-to-End Deployment of Real-World DNNs on Low-Cost IoT MCUs. IEEE Trans. Comput. 2020, 70, 1253–1268. [Google Scholar] [CrossRef]
- Wang, Y.; Fong, X. Benchmarking DNN Mapping Methods for the in-Memory Computing Accelerators. IEEE J. Emerg. Sel. Top. Circuits Syst. 2023, 13, 1040–1051. [Google Scholar] [CrossRef]
- Kwon, H.; Chatarasi, P.; Sarkar, V.; Krishna, T.; Pellauer, M.; Parashar, A. MAESTRO: A Data-Centric Approach to Understand Reuse, Performance, and Hardware Cost of DNN Mappings. IEEE Micro 2020, 40, 20–29. [Google Scholar] [CrossRef]
- Wang, Y.; Zhao, Z.; Jin, X.; Zheng, H.; Nie, M.; Zou, Q.; Shi, C. AutoMap: Automatic Mapping of Neural Networks to Deep Learning Accelerators for Edge Devices. IEEE Trans. Comput.-Aided Des. Integr. Circuits Syst. 2023, 42, 2994–3006. [Google Scholar] [CrossRef]
- Kim, S.; Lee, J.; Paik, Y.; Kim, C.H.; Lee, W.; Kim, S.W. Optimal Model Partitioning with Low-Overhead Profiling on the PIM-based Platform for Deep Learning Inference. ACM Trans. Des. Autom. Electron. Syst. 2023, 29, 28. [Google Scholar] [CrossRef]
- Wang, J.; Ge, M.; Ding, B.; Xu, Q.; Chen, S.; Kang, Y. NicePIM: Design Space Exploration for Processing-In-Memory DNN Accelerators with 3-D Stacked-DRAM. IEEE Trans. Comput.-Aided Des. Integr. Circuits Syst. 2023, 43, 1456–1469. [Google Scholar] [CrossRef]
- Sun, X.; Wang, X.; Li, W.; Han, Y.; Chen, X. PIMCOMP: An End-to-End DNN Compiler for Processing-In-Memory Accelerators. IEEE Trans. Comput.-Aided Des. Integr. Circuits Syst. 2024, 44, 1745–1759. [Google Scholar] [CrossRef]
- Wang, X.; Zhou, M.; Rosing, T.S. Fast-OverlaPIM: A Fast Overlap-Driven Mapping Framework for Processing In-Memory Neural Network Acceleration. IEEE Trans. Comput.-Aided Des. Integr. Circuits Syst. 2024, 44, 130–143. [Google Scholar] [CrossRef]
- Dai, P.; Han, B.; Li, K.; Xu, X.; Xing, H.; Liu, K. Joint Optimization of Device Placement and Model Partitioning for Cooperative DNN Inference in Heterogeneous Edge Computing. IEEE Trans. Mob. Comput. 2025, 24, 210–226. [Google Scholar] [CrossRef]
- Jun, H.; Kim, T.; Kim, S.C.; Eom, Y.I. A Hierarchical Dispatcher for Scheduling Multiple Deep Neural Networks (DNNs) on Edge Devices. Sensors 2025, 25, 2243. [Google Scholar] [CrossRef] [PubMed]
- Ogbogu, C.; Narang, G.; Joardar, B.K.; Doppa, J.; Chakrabarty, K.; Pande, P. HuNT: Exploiting Heterogeneous PIM Devices to Design a 3-D Manycore Architecture for DNN Training. IEEE Trans. Comput.-Aided Des. Integr. Circuits Syst. 2024, 43, 3300–3311. [Google Scholar] [CrossRef]
- Zou, X.; Chen, C.; Lin, P.; Zhang, L.; Xu, Y.; Zhang, W. Scalable Heterogeneous Scheduling Based Model Parallelism for Real-Time Inference of Large-Scale Deep Neural Networks. IEEE Trans. Emerg. Top. Comput. Intell. 2024, 8, 2962–2973. [Google Scholar] [CrossRef]
- Krishnan, G.; Wang, Z.; Yeo, I.; Meng, J.; Liehr, M.; Joshi, R.; Cady, N.; Fan, D.; Seo, J.-S.; Cao, Y. Hybrid RRAM/SRAM in-Memory Computing for Robust DNN Acceleration. IEEE Trans. Comput.-Aided Des. Integr. Circuits Syst. 2022, 41, 4241–4252. [Google Scholar] [CrossRef]
- Huang, B.; Huang, X.; Liu, X.; Ding, C.; Yin, Y.; Deng, S. Adaptive partitioning and efficient scheduling for distributed DNN training in heterogeneous IoT environment. Comput. Commun. 2023, 215, 169–179. [Google Scholar] [CrossRef]
- Bai, C.; Wei, X.; Zhuo, Y.; Cai, Y.; Zheng, H.; Yu, B.; Xie, Y. Klotski v2: Improved DNN Model Orchestration Framework for Dataflow Architecture Accelerators. IEEE Trans. Comput.-Aided Des. Integr. Circuits Syst. 2025, 44, 1045–1058. [Google Scholar] [CrossRef]
- Mirmahaleh, S.Y.H.; Reshadi, M.; Bagherzadeh, N.; Khademzadeh, A. Data scheduling and placement in deep learning accelerator. Clust. Comput. 2021, 24, 3651–3669. [Google Scholar] [CrossRef]
- Shi, L.; Xu, Z.; Sun, Y.; Shi, Y.; Fan, Y.; Ding, X. A DNN inference acceleration algorithm combining model partition and task allocation in heterogeneous edge computing system. Peer-To-Peer Netw. Appl. 2021, 14, 4031–4045. [Google Scholar] [CrossRef]
- Li, J.; Liang, W.; Li, Y.; Xu, Z.; Jia, X.; Guo, S. Throughput Maximization of Delay-Aware DNN Inference in Edge Computing by Exploring DNN Model Partitioning and Inference Parallelism. IEEE Trans. Mob. Comput. 2023, 22, 3017–3030. [Google Scholar] [CrossRef]
- Chen, Z.; Hu, J.; Chen, X.; Hu, J.; Zheng, X.; Min, G. Computation Offloading and Task Scheduling for DNN-Based Applications in Cloud-Edge Computing. IEEE Access 2020, 8, 115537–115547. [Google Scholar] [CrossRef]
- Zhang, J.; Niu, G.; Dai, Q.; Li, H.; Wu, Z.; Dong, F.; Wu, Z. PipePar: Enabling fast DNN pipeline parallel training in heterogeneous GPU clusters. Neurocomputing 2023, 555, 126661. [Google Scholar] [CrossRef]
- Chen, X.; Zhang, J.; Lin, B.; Chen, Z.; Wolter, K.; Min, G. Energy-Efficient Offloading for DNN-Based Smart IoT Systems in Cloud-Edge Environments. IEEE Trans. Parallel Distrib. Syst. 2022, 33, 683–697. [Google Scholar] [CrossRef]
- Sahoo, R.M.; Padhy, S. A novel algorithm for priority-based task scheduling on a multiprocessor heterogeneous system. Microprocess. Microsyst. 2022, 95, 104685. [Google Scholar] [CrossRef]
- Li, Y.; Ge, X.; Lei, B.; Zhang, X.; Wang, W. Joint Task Partitioning and Parallel Scheduling in Device-Assisted Mobile Edge Networks. IEEE Internet Things J. 2023, 11, 14058–14075. [Google Scholar] [CrossRef]
- Dong, F.; Wang, H.; Shen, D.; Huang, Z.; He, Q.; Zhang, J.; Wen, L.; Zhang, T. Multi-Exit DNN Inference Acceleration Based on Multi-Dimensional Optimization for Edge Intelligence. IEEE Trans. Mob. Comput. 2023, 22, 5389–5405. [Google Scholar] [CrossRef]
- Li, H.; Li, X.; Fan, Q.; He, Q.; Wang, X.; Leung, V.C.M. Distributed DNN Inference with Fine-Grained Model Partitioning in Mobile Edge Computing Networks. IEEE Trans. Mob. Comput. 2024, 23, 9060–9074. [Google Scholar] [CrossRef]
- Zhang, J.; Ma, S.; Yan, Z.; Huang, J. Joint DNN partitioning and task offloading in mobile edge computing via deep reinforcement learning. J. Cloud Comput. 2023, 12, 116. [Google Scholar] [CrossRef]
- Su, Y.; Fan, W.; Gao, L.; Qiao, L.; Liu, Y.; Wu, F. Joint DNN Partition and Resource Allocation Optimization for Energy-Constrained Hierarchical Edge-Cloud Systems. IEEE Trans. Veh. Technol. 2023, 72, 3930–3944. [Google Scholar] [CrossRef]
- Yuan, S.; Zhang, Z.; Li, Q.; Li, W.; Zhang, Y. Joint Optimization of DNN Partition and Continuous Task Scheduling for Digital Twin-Aided MEC Network with Deep Reinforcement Learning. IEEE Access 2023, 11, 27099–27110. [Google Scholar] [CrossRef]
- Zhang, S.; Li, Y.; Liu, X.; Guo, S.; Wang, W.; Wang, J.; Ding, B.; Wu, D. Towards Real-time Cooperative Deep Inference over the Cloud and Edge End Devices. Proc. ACM Interact. Mob. Wearable Ubiquitous Technol. 2020, 4, 69. [Google Scholar] [CrossRef]
- Liu, Y.; Wang, J.J.; Wang, H.Z.; Liu, S.; Wu, Y.; Hu, S.; Yu, Q.; Liu, Z.; Chen, T.; Yin, Y.; et al. Braille recognition by E-skin system based on binary memristive neural network. Sci. Rep. 2023, 13, 5437. [Google Scholar] [CrossRef]
- Zeng, S.; Dai, G.; Zhang, N.; Yang, X.; Zhang, H.; Zhu, Z.; Yang, H.; Wang, Y. Serving Multi-DNN Workloads on FPGAs: A Coordinated Architecture, Scheduling, and Mapping Perspective. IEEE Trans. Comput. 2023, 72, 1314–1328. [Google Scholar] [CrossRef]
- Mei, L.; Houshmand, P.; Jain, V.; Giraldo, S.; Verhelst, M. ZigZag: Enlarging Joint Architecture-Mapping Design Space Exploration for DNN Accelerators. IEEE Trans. Comput. 2021, 70, 1160–1174. [Google Scholar] [CrossRef]
- Heidari, S.; Ghasemi, M.; Kim, Y.G.; Wu, C.J.; Vrudhula, S. CAMDNN: Content-Aware Mapping of a Network of Deep Neural Networks on Edge MPSoCs. IEEE Trans. Comput. 2022, 71, 3191–3202. [Google Scholar] [CrossRef]
- Ren, W.; Qu, Y.; Dong, C.; Jing, Y.; Wu, Q.; Guo, S. A Survey on Collaborative DNN Inference for Edge Intelligence. Mach. Intell. Res. 2023, 20, 370–395. [Google Scholar] [CrossRef]
- Khaledian, N.; Voelp, M.; Azizi, S.; Shirvani, M.H. AI-based & heuristic workflow scheduling in cloud and fog computing: A systematic review. Clust. Comput. 2024, 27, 10265–10298. [Google Scholar] [CrossRef]
- Park, J.; Sung, H. XLA-NDP: Efficient Scheduling and Code Generation for Deep Learning Model Training on Near-Data Processing Memory. IEEE Comput. Archit. Lett. 2023, 22, 61–64. [Google Scholar] [CrossRef]
- Akbari, M.; Rashidi, H. A multi-objectives scheduling algorithm based on cuckoo optimization for task allocation problem at compile time in heterogeneous systems. Expert Syst. Appl. 2016, 60, 234–248. [Google Scholar] [CrossRef]
- Zhang, Z.; Kouzani, A. Implementation of DNNs on IoT devices. Neural Comput. Appl. 2019, 32, 1327–1356. [Google Scholar] [CrossRef]
- Al-Maytami, B.A.; Fan, P.; Hussain, A.; Baker, T.; Liatsist, P. A Task Scheduling Algorithm with Improved Makespan Based on Prediction of Tasks Computation Time algorithm for Cloud Computing. IEEE Access 2019, 7, 160916–160926. [Google Scholar] [CrossRef]
- Chen, H.; Zhang, Z.; Chen, P.; Luo, X.; Li, S.; Liu, W. MARCO: A High-performance Task Mapping and Routing Co-optimization Framework for Point-to-Point NoC-based Heterogeneous Computing Systems. ACM Trans. Embed. Comput. Syst. (TECS) 2021, 20, 54. [Google Scholar] [CrossRef]
- Houssein, E.H.; Gad, A.G.; Wazery, Y.; Suganthan, P. Task Scheduling in Cloud Computing based on Meta-heuristics: Review, Taxonomy, Open Challenges, and Future Trends. Swarm Evol. Comput. 2021, 62, 100841. [Google Scholar] [CrossRef]
- Kojima, T.; Ohwada, A.; Amano, H. Mapping-Aware Kernel Partitioning Method for CGRAs Assisted by Deep Learning. IEEE Trans. Parallel Distrib. Syst. 2022, 33, 1213–1230. [Google Scholar] [CrossRef]
- Wang, Z.; Zhao, W.; Pu, Y.; Chen, L.; Thong, W.W.; Sheng, W.; Ho, T.Y.; Yu, B. ParSGCN: Bridging the Gap Between Emulation Partitioning and Scheduling. IEEE Trans. Comput.-Aided Des. Integr. Circuits Syst. 2025, 44, 1180–1192. [Google Scholar] [CrossRef]
- Li, J.; Zhang, X.; Wei, J.; Ji, Z.; Wei, Z. GARLSched: Generative adversarial deep reinforcement learning task scheduling optimization for large-scale high performance computing systems. Future Gener. Comput. Syst. 2022, 135, 259–269. [Google Scholar] [CrossRef]
- Zhang, B.; Zeng, H.; Prasanna, V. GraphAGILE: An FPGA-Based Overlay Accelerator for Low-Latency GNN Inference. IEEE Trans. Parallel Distrib. Syst. 2023, 34, 2580–2597. [Google Scholar] [CrossRef]
- Sharif, H.; Srivastava, P.; Huzaifa, M.; Kotsifakou, M.; Joshi, K.; Sarita, Y.; Zhao, N.; Adve, V.S.; Misailovic, S.; Adve, S. ApproxHPVM: A portable compiler IR for accuracy-aware optimizations. Proc. ACM Program. Lang. 2019, 3, 186. [Google Scholar] [CrossRef]
- Hamdi, M.A.; Daghero, F.; Sarda, G.M.; Delm, J.V.; Symons, A.; Benini, L.; Verhelst, M.; Pagliari, D.J.; Burrello, A. MATCH: Model-Aware TVM-based Compilation for Heterogeneous Edge Devices. arXiv 2024, arXiv:2410.08855. [Google Scholar] [CrossRef]
- Anderson, L.; Adams, A.; Ma, K.; Li, T.M.; Jin, T.; Ragan-Kelley, J. Efficient automatic scheduling of imaging and vision pipelines for the GPU. Proc. ACM Program. Lang. 2020, 5, 109. [Google Scholar] [CrossRef]
- Lin, C.; Chen, Z.; Zhang, Z.; Liu, J. TOP: Task-Based Operator Parallelism for Asynchronous Deep Learning Inference on GPU. IEEE Trans. Parallel Distrib. Syst. 2025, 36, 266–281. [Google Scholar] [CrossRef]
- Xiao, Y.; Nazarian, S.; Bogdan, P. Self-Optimizing and Self-Programming Computing Systems: A Combined Compiler, Complex Networks, and Machine Learning Approach. IEEE Trans. Very Large Scale Integr. (VLSI) Syst. 2019, 27, 1416–1427. [Google Scholar] [CrossRef]
- Schmitz, A.; Burak, S.; Miller, J.; Müller, M.S. Parallel Pattern Compiler for Automatic Global Optimizations. Parallel Comput. 2024, 122, 103112. [Google Scholar] [CrossRef]
- Ma, Z.; Jin, Y.; Tang, S.; Wang, H.; Xue, W.; Zhai, J.D.; Zheng, W.M. Unified Programming Models for Heterogeneous High-Performance Computers. J. Comput. Sci. Technol. 2023, 38, 211–218. [Google Scholar] [CrossRef]
- De Andrade, H.S.; Schroeder, J.; Crnkovic, I. Software Deployment on Heterogeneous Platforms: A Systematic Mapping Study. IEEE Trans. Softw. Eng. 2019, 47, 1683–1707. [Google Scholar] [CrossRef]
- Liu, S.; Guo, B.; Fang, C.; Wang, Z.; Luo, S.; Zhou, Z.; Yu, Z. Enabling Resource-Efficient AIoT System with Cross-Level Optimization: A Survey. IEEE Commun. Surv. Tutor. 2023, 26, 389–427. [Google Scholar] [CrossRef]
- Kojima, T.; Doan, N.; Amano, H. GenMap: A Genetic Algorithmic Approach for Optimizing Spatial Mapping of Coarse-Grained Reconfigurable Architectures. IEEE Trans. Very Large Scale Integr. (VLSI) Syst. 2020, 28, 2383–2396. [Google Scholar] [CrossRef]
- Pudi, D.; Malviya, S.; Boppu, S.; Yang, Y.; Hemani, A.; Cenkeramaddi, L.R. Integer Linear Programming-Based Simultaneous Scheduling and Binding for SiLago Framework. IEEE Access 2024, 12, 124081–124094. [Google Scholar] [CrossRef]
- Tang, Z.; Jia, W.; Zhou, X.; Yang, W.; You, Y. Representation and Reinforcement Learning for Task Scheduling in Edge Computing. IEEE Trans. Big Data 2020, 8, 795–808. [Google Scholar] [CrossRef]
- Wang, C.; Yu, X.; Xu, L.; Wang, W. Energy-Efficient Task Scheduling Based on Traffic Mapping in Heterogeneous Mobile-Edge Computing: A Green IoT Perspective. IEEE Trans. Green Commun. Netw. 2023, 7, 972–982. [Google Scholar] [CrossRef]
- Krishnakumar, A.; Arda, S.E.; Goksoy, A.; Mandal, S.K.; Ogras, U.; Sartor, A.L.; Marculescu, R. Runtime Task Scheduling Using Imitation Learning for Heterogeneous Many-Core Systems. IEEE Trans. Comput.-Aided Des. Integr. Circuits Syst. 2020, 39, 4064–4077. [Google Scholar] [CrossRef]
- Sulaiman, M.; Halim, Z.; Lebbah, M.; Waqas, M.; Tu, S. An Evolutionary Computing-Based Efficient Hybrid Task Scheduling Approach for Heterogeneous Computing Environment. J. Grid Comput. 2021, 19, 11. [Google Scholar] [CrossRef]
- Mu, S.; Zeng, Y.; Wang, B. Routability-Enhanced Scheduling for Application Mapping on CGRAs. IEEE Access 2021, 9, 92358–92366. [Google Scholar] [CrossRef]
- Tirelli, C.; Sapriza, J.; Álvarez, R.R.; Ferretti, L.; Denkinger, B.; Ansaloni, G.; Calero, J.A.M.; Atienza, D.; Pozzi, L. SAT-Based Exact Modulo Scheduling Mapping for Resource-Constrained CGRAs. ACM J. Emerg. Technol. Comput. Syst. 2024, 20, 8. [Google Scholar] [CrossRef]
- Liu, W.; Gu, Z.; Xu, J.; Wu, X.; Ye, Y. Satisfiability Modulo Graph Theory for Task Mapping and Scheduling on Multiprocessor Systems. IEEE Trans. Parallel Distrib. Syst. 2011, 22, 1382–1389. [Google Scholar] [CrossRef]
- Elaziz, M.A.; Attiya, I. An improved Henry gas solubility optimization algorithm for task scheduling in cloud computing. Artif. Intell. Rev. 2020, 54, 3599–3637. [Google Scholar] [CrossRef]
- Dave, S.; Balasubramanian, M.; Shrivastava, A. RAMP: Resource-Aware Mapping for CGRAs. In Proceedings of the 2018 55th ACM/ESDA/IEEE Design Automation Conference (DAC); Association for Computing Machinery: New York, NY, USA, 2018; pp. 1–6. [Google Scholar] [CrossRef]
- Balasubramanian, M.; Shrivastava, A. CRIMSON: Compute-Intensive Loop Acceleration by Randomized Iterative Modulo Scheduling and Optimized Mapping on CGRAs. IEEE Trans. Comput.-Aided Des. Integr. Circuits Syst. 2020, 39, 3300–3310. [Google Scholar] [CrossRef]
- Hamzeh, M.; Shrivastava, A.; Vrudhula, S. REGIMap: Register-aware application mapping on Coarse-Grained Reconfigurable Architectures (CGRAs). In Proceedings of the 2013 50th ACM/EDAC/IEEE Design Automation Conference (DAC); Association for Computing Machinery: New York, NY, USA, 2013; pp. 1–10. [Google Scholar] [CrossRef]
- Li, B.; Doppa, J.; Pande, P.; Chakrabarty, K.; Qiu, J.X.; Li, H. 3D-ReG. ACM J. Emerg. Technol. Comput. Syst. (JETC) 2020, 16, 20. [Google Scholar] [CrossRef]
- Gu, P.; Xie, X.; Li, S.; Niu, D.; Zheng, H.; Malladi, K.T.; Xie, Y. DLUX: A LUT-Based Near-Bank Accelerator for Data Center Deep Learning Training Workloads. IEEE Trans. Comput.-Aided Des. Integr. Circuits Syst. 2020, 40, 1586–1599. [Google Scholar] [CrossRef]
- Dash, S.; Luo, Y.; Lu, A.; Yu, S.; Mukhopadhyay, S. Robust Processing-In-Memory with Multibit ReRAM Using Hessian-Driven Mixed-Precision Computation. IEEE Trans. Comput.-Aided Des. Integr. Circuits Syst. 2021, 41, 1006–1019. [Google Scholar] [CrossRef]
- Roy, S.; Ali, M.; Raghunathan, A. PIM-DRAM: Accelerating Machine Learning Workloads Using Processing in Commodity DRAM. IEEE J. Emerg. Sel. Top. Circuits Syst. 2021, 11, 701–710. [Google Scholar] [CrossRef]
- Sun, H.; Shen, J.; Zhang, T.; Tang, Z.; Zhang, C.; Li, Y.; Shi, Y.; Liu, H. FAMS: A FrAmework of Memory-Centric Mapping for DNNs on Systolic Array Accelerators. IEEE Trans. Very Large Scale Integr. (VLSI) Syst. 2025, 33, 976–989. [Google Scholar] [CrossRef]
- Li, B.; Qu, S.; Wang, Y. An Automated Quantization Framework for High-Utilization RRAM-Based PIM. IEEE Trans. Comput.-Aided Des. Integr. Circuits Syst. 2022, 41, 583–596. [Google Scholar] [CrossRef]
- Qu, S.; Li, B.; Wang, Y.; Xu, D.; Zhao, X.; Zhang, L. RaQu: An automatic high-utilization CNN quantization and mapping framework for general-purpose RRAM Accelerator. In 2020 57th ACM/IEEE Design Automation Conference (DAC); IEEE: Piscataway, NJ, USA, 2020; pp. 1–6. [Google Scholar] [CrossRef]
- Zhang, Y.; Jia, Z.; Du, H.; Xue, R.; Shen, Z.; Shao, Z. A Practical Highly Paralleled ReRAM-Based DNN Accelerator by Reusing Weight Pattern Repetitions. IEEE Trans. Comput.-Aided Des. Integr. Circuits Syst. 2022, 41, 922–935. [Google Scholar] [CrossRef]
- Zhang, Y.; Wang, X.; Jiang, X.; Yang, Y.; Shen, Z.; Jia, Z. PQ-PIM: A pruning-quantization joint optimization framework for ReRAM-based processing-in-memory DNN accelerator. J. Syst. Archit. 2022, 127, 102531. [Google Scholar] [CrossRef]
- Sun, H.; Zhu, Z.; Wang, C.; Ning, X.; Dai, G.; Yang, H.; Wang, Y. Gibbon: An Efficient Co-Exploration Framework of NN Model and Processing-In-Memory Architecture. IEEE Trans. Comput.-Aided Des. Integr. Circuits Syst. 2023, 42, 4075–4089. [Google Scholar] [CrossRef]
- Wu, X.; Hanson, E.; Wang, N.; Zheng, Q.; Yang, X.; Yang, H.; Li, S.; Cheng, F.; Pande, P.; Doppa, J.; et al. Block-Wise Mixed-Precision Quantization: Enabling High Efficiency for Practical ReRAM-Based DNN Accelerators. IEEE Trans. Comput.-Aided Des. Integr. Circuits Syst. 2023, 43, 4558–4571. [Google Scholar] [CrossRef]
- Gao, X.; Wang, H.; Chen, Y.; Zhang, Y.; Shen, Z.; Ju, L. Static Scheduling of Weight Programming for DNN Acceleration with Resource Constrained PIM. ACM Trans. Embed. Comput. Syst. 2023, 23, 89. [Google Scholar] [CrossRef]
- Zhang, J.; Wang, X.; Ye, Y.; Lyu, D.; Xiong, G.; Xu, N.; Lian, Y.; He, G. M2M: A Fine-Grained Mapping Framework to Accelerate Multiple DNNs on a Multi-Chiplet Architecture. IEEE Trans. Very Large Scale Integr. (VLSI) Syst. 2024, 32, 1864–1877. [Google Scholar] [CrossRef]
- Dorostkar, A.; Farbeh, H.; Zarandi, H.R. An Empirical Fault Vulnerability Exploration of ReRAM-Based Process-in-Memory CNN Accelerators. IEEE Trans. Reliab. 2025, 74, 2290–2304. [Google Scholar] [CrossRef]
- Wang, J.; Du, H.; Ding, B.; Xu, Q.; Chen, S.; Kang, Y. DDAM: Data Distribution-Aware Mapping of CNNs on Processing-In-Memory Systems. ACM Trans. Des. Autom. Electron. Syst. 2022, 28, 36. [Google Scholar] [CrossRef]
- Li, C.; Zhou, Z.; Wang, Y.; Yang, F.; Cao, T.; Yang, M.; Liang, Y.; Sun, G. PIM-DL: Expanding the Applicability of Commodity DRAM-PIMs for Deep Learning via Algorithm-System Co-Optimization. In Proceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems; Association for Computing Machinery: New York, NY, USA, 2024; Volume 2, pp. 879–896. [Google Scholar] [CrossRef]
- Rhe, J.; Jeon, K.E.; Lee, J.; Jeong, S.; Ko, J.H. KERNTROL: Kernel Shape Control Toward Ultimate Memory Utilization for In-Memory Convolutional Weight Mapping. IEEE Trans. Circuits Syst. I Regul. Pap. 2024, 71, 6138–6151. [Google Scholar] [CrossRef]
- Lee, Y.S.; Han, T. Task Parallelism-Aware Deep Neural Network Scheduling on Multiple Hybrid Memory Cube-Based Processing-in-Memory. IEEE Access 2021, 9, 68561–68572. [Google Scholar] [CrossRef]
- Giannoula, C.; Yang, P.; Fernandez, I.; Yang, J.; Durvasula, S.; Li, Y.X.; Sadrosadati, M.; Luna, J.G.; Mutlu, O.; Pekhimenko, G. PyGim: An Efficient Graph Neural Network Library for Real Processing-In-Memory Architectures. Proc. ACM Meas. Anal. Comput. Syst. 2024, 8, 43. [Google Scholar] [CrossRef]
- Liu, F.; Zhao, W.; Wang, Z.; Chen, Y.; Liang, X.; Jiang, L. ERA-BS: Boosting the Efficiency of ReRAM-Based PIM Accelerator with Fine-Grained Bit-Level Sparsity. IEEE Trans. Comput. 2024, 73, 2320–2334. [Google Scholar] [CrossRef]
- Dhilleswararao, P.; Boppu, S.; Manikandan, M.; Cenkeramaddi, L.R. Efficient Hardware Architectures for Accelerating Deep Neural Networks: Survey. IEEE Access 2022, 10, 131788–131828. [Google Scholar] [CrossRef]
- Han, L.; Pan, R.; Zhou, Z.; Lu, H.; Chen, Y.; Yang, H.; Huang, P.; Sun, G.; Liu, X.; Kang, J. CoMN: Algorithm-Hardware Co-Design Platform for Nonvolatile Memory-Based Convolutional Neural Network Accelerators. IEEE Trans. Comput.-Aided Des. Integr. Circuits Syst. 2024, 43, 2043–2056. [Google Scholar] [CrossRef]
- Krishnan, G.; Mandal, S.K.; Pannala, M.; Chakrabarti, C.; Seo, J.-S.; Ogras, Ü.Y.; Cao, Y. SIAM: Chiplet-based Scalable In-Memory Acceleration with Mesh for Deep Neural Networks. ACM Trans. Embed. Comput. Syst. (TECS) 2021, 20, 68. [Google Scholar] [CrossRef]
- Rakka, M.; Fouda, M.; Khargonekar, P.P.; Kurdahi, F.J. A Review of State-of-the-art Mixed-Precision Neural Network Frameworks. IEEE Trans. Pattern Anal. Mach. Intell. 2024, 46, 7793–7812. [Google Scholar] [CrossRef] [PubMed]
- Liu, F.; Li, H.; Hu, W.; He, Y. Review of neural network model acceleration techniques based on FPGA platforms. Neurocomputing 2024, 610, 128511. [Google Scholar] [CrossRef]
- Prasad, N.S.; Sundar, S. Comprehensive Review on the Exploitation of Advanced Memory Optimization Strategies to Improve Performance for Convolutional and Spiking Neural Networks in Medical Imaging Using Hardware Accelerators. IEEE Access 2025, 13, 62449–62461. [Google Scholar] [CrossRef]
- Woźniak, S.; Pantazi, A.; Bohnstingl, T.; Eleftheriou, E. Deep learning incorporating biologically inspired neural dynamics and in-memory computing. Nat. Mach. Intell. 2020, 2, 325–336. [Google Scholar] [CrossRef]
- Xing, Y.; Liang, S.; Sui, L.; Jia, X.; Qiu, J.; Liu, X.; Wang, Y.; Wang, Y.; Shan, Y. DNNVM: End-to-End Compiler Leveraging Heterogeneous Optimizations on FPGA-Based CNN Accelerators. IEEE Trans. Comput.-Aided Des. Integr. Circuits Syst. 2019, 39, 2668–2681. [Google Scholar] [CrossRef]
- Yin, S.; Sun, X.; Yu, S.; Seo, J.-S. High-Throughput In-Memory Computing for Binary Deep Neural Networks with Monolithically Integrated RRAM and 90-nm CMOS. IEEE Trans. Electron Devices 2020, 67, 4185–4192. [Google Scholar] [CrossRef]
- Guo, K.; Han, S.; Yao, S.; Wang, Y.; Xie, Y.; Yang, H. Software-Hardware Codesign for Efficient Neural Network Acceleration. IEEE Micro 2017, 37, 18–25. [Google Scholar] [CrossRef]
- Yu, R.; Wang, Z.; Liu, Q.; Gao, B.; Hao, Z.; Guo, T.; Ding, S.; Zhang, J.; Qin, Q.; Wu, D.; et al. A full-stack memristor-based computation-in-memory system with software-hardware co-development. Nat. Commun. 2025, 16, 2123. [Google Scholar] [CrossRef]
- Mackin, C.; Rasch, M.; Chen, A.; Timcheck, J.; Bruce, R.L.; Li, N.; Narayanan, P.; Ambrogio, S.; Gallo, M.L.; Nandakumar, S.; et al. Optimised weight programming for analogue memory-based deep neural networks. Nat. Commun. 2022, 13, 3765. [Google Scholar] [CrossRef]
- Antolini, A.; Paolino, C.; Zavalloni, F.; Lico, A.; Scarselli, E.F.; Mangia, M.; Pareschi, F.; Setti, G.; Rovatti, R.; Torres, M.L.; et al. Combined HW/SW Drift and Variability Mitigation for PCM-Based Analog In-Memory Computing for Neural Network Applications. IEEE J. Emerg. Sel. Top. Circuits Syst. 2023, 13, 395–407. [Google Scholar] [CrossRef]
- Sze, V.; Chen, Y.-H.; Yang, T.-J.; Emer, J. Efficient Processing of Deep Neural Networks: A Tutorial and Survey. Proc. IEEE 2017, 105, 2295–2329. [Google Scholar] [CrossRef]
- Garofalo, A.; Ottavi, G.; Conti, F.; Karunaratne, G.; Boybat, I.; Benini, L.; Rossi, D. A Heterogeneous In-Memory Computing Cluster for Flexible End-to-End Inference of Real-World Deep Neural Networks. IEEE J. Emerg. Sel. Top. Circuits Syst. 2022, 12, 422–435. [Google Scholar] [CrossRef]
- Oliveira, G.F.; Gómez-Luna, J.; Ghose, S.; Boroumand, A.; Mutlu, O. Accelerating Neural Network Inference with Processing-in-DRAM: From the Edge to the Cloud. IEEE Micro 2022, 42, 25–38. [Google Scholar] [CrossRef]
- Zheng, Q.; Li, X.; Guan, Y.; Wang, Z.; Cai, Y.; Chen, Y.; Sun, G.; Huang, R. PIMulator-NN: An Event-Driven, Cross-Level Simulation Framework for Processing-In-Memory-Based Neural Network Accelerators. IEEE Trans. Comput.-Aided Des. Integr. Circuits Syst. 2022, 41, 5464–5475. [Google Scholar] [CrossRef]
- Jeon, W.; Lee, J.; Kang, D.; Kal, H.; Ro, W. PIMCaffe: Functional Evaluation of a Machine Learning Framework for In-Memory Neural Processing Unit. IEEE Access 2021, 9, 96629–96640. [Google Scholar] [CrossRef]
- Velasco-Montero, D.; Fernández-Berni, J.; Carmona-Galán, R.; Rodríguez-Vázquez, Á. Optimum Selection of DNN Model and Framework for Edge Inference. IEEE Access 2018, 6, 51680–51692. [Google Scholar] [CrossRef]
- Wess, M.; Schnöll, D.; Dallinger, D.; Bittner, M.; Jantsch, A. Conformal Prediction Based Confidence for Latency Estimation of DNN Accelerators: A Black-Box Approach. IEEE Access 2024, 12, 109847–109860. [Google Scholar] [CrossRef]
- Krishnan, G.; Sun, J.; Hazra, J.; Du, X.; Liehr, M.; Li, Z.; Beckmann, K.; Joshi, R.; Cady, N.; Fan, D.; et al. Exploring Model Stability of Deep Neural Networks for Reliable RRAM-Based In-Memory Acceleration. IEEE Trans. Comput. 2022, 71, 2740–2752. [Google Scholar] [CrossRef]
- Moitra, A.; Bhattacharjee, A.; Kuang, R.; Krishnan, G.; Cao, Y.; Panda, P. SpikeSim: An End-to-End Compute-in-Memory Hardware Evaluation Tool for Benchmarking Spiking Neural Networks. IEEE Trans. Comput.-Aided Des. Integr. Circuits Syst. 2022, 42, 3815–3828. [Google Scholar] [CrossRef]
- Geirhos, R.; Jacobsen, J.; Michaelis, C.; Zemel, R.; Brendel, W.; Bethge, M.; Wichmann, F. Shortcut learning in deep neural networks. Nat. Mach. Intell. 2020, 2, 665–673. [Google Scholar] [CrossRef]
- Lones, M. Avoiding common machine learning pitfalls. Patterns 2021, 5, 101046. [Google Scholar] [CrossRef] [PubMed]
- Maleki, F.; Ovens, K.; Gupta, R.; Reinhold, C.; Spatz, A.; Forghani, R. Generalizability of Machine Learning Models: Quantitative Evaluation of Three Methodological Pitfalls. Radiol. Artif. Intell. 2022, 5, e220028. [Google Scholar] [CrossRef] [PubMed]
- Yan, Z.; Hu, X.; Shi, Y. Compute-in-Memory-Based Neural Network Accelerators for Safety-Critical Systems: Worst-Case Scenarios and Protections. IEEE Trans. Comput.-Aided Des. Integr. Circuits Syst. 2023, 43, 2452–2464. [Google Scholar] [CrossRef]
- Shapira, G.; Chen, Y. Common Pitfalls of Benchmarking Big Data Systems. IEEE Trans. Serv. Comput. 2016, 9, 152–160. [Google Scholar] [CrossRef]
- Smagulova, K.; Fouda, M.; Kurdahi, F.; Salama, K.; Eltawil, A. Resistive Neural Hardware Accelerators. Proc. IEEE 2021, 111, 500–527. [Google Scholar] [CrossRef]
- Kriegel, H.; Schubert, E.; Zimek, A. The (black) art of runtime evaluation: Are we comparing algorithms or implementations? Knowl. Inf. Syst. 2016, 52, 341–378. [Google Scholar] [CrossRef]
- Meng, J.; Yang, L.; Peng, X.; Yu, S.; Fan, D.; Seo, J.-S. Structured Pruning of RRAM Crossbars for Efficient In-Memory Computing Acceleration of Deep Neural Networks. IEEE Trans. Circuits Syst. II Express Briefs 2021, 68, 1576–1580. [Google Scholar] [CrossRef]
- Camus, V.; Mei, L.; Enz, C.; Verhelst, M. Review and Benchmarking of Precision-Scalable Multiply-Accumulate Unit Architectures for Embedded Neural-Network Processing. IEEE J. Emerg. Sel. Top. Circuits Syst. 2019, 9, 697–711. [Google Scholar] [CrossRef]
- Zhou, Y.; Dong, H.; Saddik, A.E. Deep Learning in Next-Frame Prediction: A Benchmark Review. IEEE Access 2020, 8, 69273–69283. [Google Scholar] [CrossRef]
- Vinayakumar, R.; Alazab, M.; Soman, K.P.; Poornachandran, P.; Al-Nemrat, A.; Venkatraman, S. Deep Learning Approach for Intelligent Intrusion Detection System. IEEE Access 2019, 7, 41525–41550. [Google Scholar] [CrossRef]
- Lee, K.; Eo, M.; Jung, E.; Yoon, Y.; Rhee, W. Short-Term Traffic Prediction with Deep Neural Networks: A Survey. IEEE Access 2020, 9, 54739–54756. [Google Scholar] [CrossRef]
- Cui, X.; Zheng, S.; Jia, T.; Ye, L.; Liang, Y. ARES: A Mapping Framework of DNNs Towards Diverse PIMs with General Abstractions. In 2023 IEEE/ACM International Conference on Computer Aided Design (ICCAD); IEEE: Piscataway, NJ, USA, 2023; pp. 1–9. [Google Scholar] [CrossRef]
- Lyu, B.; Wang, S.; Wen, S.; Shi, K.; Yang, Y.; Zeng, L.; Huang, T. AutoGMap: Learning to Map Large-Scale Sparse Graphs on Memristive Crossbars. IEEE Trans. Neural Netw. Learn. Syst. 2021, 35, 12888–12898. [Google Scholar] [CrossRef]
- Chen, X. Instruction Set Architecture (ISA) for Processing-in-Memory DNN Accelerators. arXiv 2023, arXiv:2308.06449. [Google Scholar] [CrossRef]
- Negi, S.; Chakraborty, I.; Ankit, A.; Roy, K. NAX: Neural architecture and memristive xbar based accelerator co-design. In Proceedings of the 59th ACM/IEEE Design Automation Conference; Association for Computing Machinery: New York, NY, USA, 2022; pp. 451–456. [Google Scholar] [CrossRef]
- Cao, W.; Zhao, Y.; Boloor, A.; Han, Y.; Zhang, X.; Jiang, L. Neural-PIM: Efficient Processing-In-Memory with Neural Approximation of Peripherals. IEEE Trans. Comput. 2022, 71, 2142–2155. [Google Scholar] [CrossRef]
- Ghosh, S.K.; Raha, A.; Raghunathan, V. Energy-Efficient Approximate Edge Inference Systems. ACM Trans. Embed. Comput. Syst. 2023, 22, 77. [Google Scholar] [CrossRef]
- Noh, S.H.; Lee, S.; Shin, B.; Park, S.; Jang, Y.; Kung, J. All-rounder: A flexible DNN accelerator with diverse data format support. arXiv 2023, arXiv:2310.16757. [Google Scholar] [CrossRef]
- Zhang, X.; Ye, H.; Wang, J.; Lin, Y.; Xiong, J.; Hwu, W.-m.; Chen, D. DNNExplorer: A Framework for Modeling and Exploring a Novel Paradigm of FPGA-based DNN Accelerator. In Proceedings of the 2020 IEEE/ACM International Conference on Computer Aided Design (ICCAD); Association for Computing Machinery: New York, NY, USA, 2020; pp. 1–9. [Google Scholar] [CrossRef]
- Wang, Z.; Sun, G.; Zhu, J.; Zhou, Z.; Guo, Y.; Yuan, Z. METRO: A Software-Hardware Co-Design of Interconnections for Spatial DNN Accelerators. arXiv 2021, arXiv:2108.10570. [Google Scholar]
- Liu, M.; Yin, M.; Han, K.; Demara, R.; Yuan, B.; Bai, Y. Algorithm and hardware co-design co-optimization framework for LSTM accelerator using quantized fully decomposed tensor train. Internet Things 2023, 22, 100680. [Google Scholar] [CrossRef]
- Krishnan, G.; Mandal, S.K.; Chakrabarti, C.; Seo, J.-s.; Ogras, U.; Cao, Y. Interconnect-Aware Area and Energy Optimization for In-Memory Acceleration of DNNs. IEEE Des. Test 2020, 37, 79–87. [Google Scholar] [CrossRef]
- Xu, Z.; Yang, D.; Yin, C.; Tang, J.; Wang, Y.; Xue, G. A Co-Scheduling Framework for DNN Models on Mobile and Edge Devices with Heterogeneous Hardware. IEEE Trans. Mob. Comput. 2021, 22, 1275–1288. [Google Scholar] [CrossRef]
- Lee, E.; Han, T.; Seo, D.H.; Shin, G.; Kim, J.; Kim, S.; Jeong, S.; Rhe, J.; Park, J.; Ko, J.; et al. A Charge-Domain Scalable-Weight In-Memory Computing Macro with Dual-SRAM Architecture for Precision-Scalable DNN Accelerators. IEEE Trans. Circuits Syst. I Regul. Pap. 2021, 68, 3305–3316. [Google Scholar] [CrossRef]
- Houshmand, P.; Sarda, G.M.; Jain, V.; Ueyoshi, K.; Papistas, I.A.; Shi, M.; Zheng, Q.; Bhattacharjee, D.; Mallik, A.; Debacker, P.; et al. DIANA: An End-to-End Hybrid DIgital and ANAlog Neural Network SoC for the Edge. IEEE J. Solid-State Circuits 2023, 58, 203–215. [Google Scholar] [CrossRef]
- Jia, H.; Ozatay, M.; Tang, Y.; Valavi, H.; Pathak, R.; Lee, J.; Verma, N. Scalable and Programmable Neural Network Inference Accelerator Based on In-Memory Computing. IEEE J. Solid-State Circuits 2021, 57, 198–211. [Google Scholar] [CrossRef]
- Rasch, M.; Mackin, C.; Gallo, M.L.; Chen, A.; Fasoli, A.; Odermatt, F.; Li, N.; Nandakumar, S.; Narayanan, P.; Tsai, H.; et al. Hardware-aware training for large-scale and diverse deep learning inference workloads using in-memory computing-based accelerators. Nat. Commun. 2023, 14, 5282. [Google Scholar] [CrossRef]
- Keller, B.; Venkatesan, R.; Dai, S.; Tell, S.; Zimmer, B.; Sakr, C.; Dally, W.; Gray, C.T.; Khailany, B. A 95.6-TOPS/W Deep Learning Inference Accelerator with Per-Vector Scaled 4-bit Quantization in 5 nm. IEEE J. Solid-State Circuits 2023, 58, 1129–1141. [Google Scholar] [CrossRef]
- Zimmer, B.; Venkatesan, R.; Shao, Y.; Clemons, J.; Fojtik, M.R.; Jiang, N.; Keller, B.; Klinefelter, A.; Pinckney, N.; Raina, P.; et al. A 0.32–128 TOPS, Scalable Multi-Chip-Module-Based Deep Neural Network Inference Accelerator with Ground-Referenced Signaling in 16 nm. IEEE J. Solid-State Circuits 2020, 55, 920–932. [Google Scholar] [CrossRef]
- Long, Y.; Na, T.; Mukhopadhyay, S. ReRAM-Based Processing-in-Memory Architecture for Recurrent Neural Network Acceleration. IEEE Trans. Very Large Scale Integr. (VLSI) Syst. 2018, 26, 2781–2794. [Google Scholar] [CrossRef]
- Liu, Z.; Dou, Y.; Jiang, J.; Xu, J.; Li, S.; Zhou, Y.; Xu, Y. Throughput-Optimized FPGA Accelerator for Deep Convolutional Neural Networks. ACM Trans. Reconfig. Technol. Syst. (TRETS) 2017, 10, 17. [Google Scholar] [CrossRef]
- Nie, C.; Tang, C.; Lin, J.; Hu, H.; Lv, C.; Cao, T.; Zhang, W.; Jiang, L.; Liang, X.; Qian, W.; et al. VSPIM: SRAM Processing-in-Memory DNN Acceleration via Vector-Scalar Operations. IEEE Trans. Comput. 2024, 73, 2378–2390. [Google Scholar] [CrossRef]
- Krishnan, G.; Mandal, S.K.; Chakrabarti, C.; Seo, J.-S.; Ogras, U.; Cao, Y. Impact of On-chip Interconnect on In-memory Acceleration of Deep Neural Networks. ACM J. Emerg. Technol. Comput. Syst. (JETC) 2021, 18, 34. [Google Scholar] [CrossRef]
- Wei, Y.; Wang, Z.; Wang, Z.; Dai, Y.; Ou, G.; Gao, H.; Yang, H.; Wang, Y.; Cao, C.C.; Weng, L.; et al. Visual Diagnostics of Parallel Performance in Training Large-Scale DNN Models. IEEE Trans. Vis. Comput. Graph. 2023, 30, 3915–3929. [Google Scholar] [CrossRef]
- Wang, Y.; Chen, W.; Yang, J.; Li, T. Exploiting Parallelism for CNN Applications on 3D Stacked Processing-In-Memory Architecture. IEEE Trans. Parallel Distrib. Syst. 2019, 30, 589–600. [Google Scholar] [CrossRef]
- Angizi, S.; He, Z.; Awad, A.; Fan, D. MRIMA: An MRAM-Based In-Memory Accelerator. IEEE Trans. Comput.-Aided Des. Integr. Circuits Syst. 2020, 39, 1123–1136. [Google Scholar] [CrossRef]
- Soliman, T.; Laleni, N.; Kirchner, T.; Müller, F.; Shrivastava, A.; Kämpfe, T.; Guntoro, A.; Wehn, N. FELIX: A Ferroelectric FET Based Low Power Mixed-Signal In-Memory Architecture for DNN Acceleration. ACM Trans. Embed. Comput. Syst. 2022, 21, 84. [Google Scholar] [CrossRef]
- Xiao, T.; Bennett, C.; Feinberg, B.; Agarwal, S.; Marinella, M. Analog architectures for neural network acceleration based on non-volatile memory. Appl. Phys. Rev. 2020, 7, 031301. [Google Scholar] [CrossRef]
- Kraidia, I.; Ghenai, A.; Belhaouari, S. Defense against adversarial attacks: Robust and efficient compressed optimized neural networks. Sci. Rep. 2024, 14, 6420. [Google Scholar] [CrossRef]
- Ogbogu, C.; Arka, A.I.; Pfromm, L.; Joardar, B.K.; Doppa, J.; Chakrabarty, K.; Pande, P. Accelerating Graph Neural Network Training on ReRAM-Based PIM Architectures via Graph and Model Pruning. IEEE Trans. Comput.-Aided Des. Integr. Circuits Syst. 2023, 42, 2703–2716. [Google Scholar] [CrossRef]
- Ogbogu, C.; Joardar, B.K.; Chakrabarty, K.; Doppa, J.; Pande, P. Data Pruning-enabled High Performance and Reliable Graph Neural Network Training on ReRAM-based Processing-in-Memory Accelerators. Acm Trans. Des. Autom. Electron. Syst. 2024, 29, 72. [Google Scholar] [CrossRef]
- Hamdia, K.M.; Zhuang, X.; Rabczuk, T. An efficient optimization approach for designing machine learning models based on genetic algorithm. Neural Comput. Appl. 2020, 33, 1923–1933. [Google Scholar] [CrossRef]
- Guo, F.; Han, D.; Kim, N. Multi-Objectives Optimization of Plastic Injection Molding Process Parameters Based on Numerical DNN-GA-MCS Strategy. Polymers 2024, 16, 2247. [Google Scholar] [CrossRef]
- Goerigk, M.; Kurtz, J. Data-driven robust optimization using deep neural networks. Comput. Oper. Res. 2022, 151, 106087. [Google Scholar] [CrossRef]
- Katz, J.; Pappas, I.; Avraamidou, S.; Pistikopoulos, E. Integrating deep learning models and multiparametric programming. Comput. Chem. Eng. 2020, 136, 106801. [Google Scholar] [CrossRef]
- Lin, H.; Yan, M.; Ye, X.; Fan, D.; Pan, S.; Chen, W.; Xie, Y. A Comprehensive Survey on Distributed Training of Graph Neural Networks. Proc. IEEE 2022, 111, 1572–1606. [Google Scholar] [CrossRef]
- Besta, M.; Hoefler, T. Parallel and Distributed Graph Neural Networks: An In-Depth Concurrency Analysis. IEEE Trans. Pattern Anal. Mach. Intell. 2022, 46, 2584–2606. [Google Scholar] [CrossRef]
- Zhang, S.; Yi, X.; Diao, L.; Wu, C.; Wang, S.; Lin, W. Expediting Distributed DNN Training with Device Topology-Aware Graph Deployment. IEEE Trans. Parallel Distrib. Syst. 2023, 34, 1281–1293. [Google Scholar] [CrossRef]
- Yang, T.; Li, D.; Ma, F.; Song, Z.; Zhao, Y.; Zhang, J.; Liu, F.; Jiang, L. PASGCN: An ReRAM-Based PIM Design for GCN with Adaptively Sparsified Graphs. IEEE Trans. Comput.-Aided Des. Integr. Circuits Syst. 2023, 42, 150–163. [Google Scholar] [CrossRef]
- Raman, S.R.S.; John, L.; Kulkarni, J.P. NEM-GNN: DAC/ADC-less, Scalable, Reconfigurable, Graph and Sparsity-Aware Near-Memory Accelerator for Graph Neural Networks. ACM Trans. Archit. Code Optim. 2024, 21, 39. [Google Scholar] [CrossRef]
- Ghasemi, S.A.; Jahannia, B.; Farbeh, H. GraphA: An efficient ReRAM-based architecture to accelerate large scale graph processing. J. Syst. Archit. 2022, 133, 102755. [Google Scholar] [CrossRef]
- Dai, G.; Huang, T.; Chi, Y.; Zhao, J.; Sun, G.; Liu, Y.; Wang, Y.; Xie, Y.; Yang, H. GraphH: A Processing-in-Memory Architecture for Large-Scale Graph Processing. IEEE Trans. Comput.-Aided Des. Integr. Circuits Syst. 2019, 38, 640–653. [Google Scholar] [CrossRef]
- Wei, Y.; Wang, X.; Zhang, S.; Yang, J.; Jia, X.; Wang, Z.; Qu, G.; Zhao, W. IMGA: Efficient In-Memory Graph Convolution Network Aggregation with Data Flow Optimizations. IEEE Trans. Comput.-Aided Des. Integr. Circuits Syst. 2023, 42, 4695–4705. [Google Scholar] [CrossRef]
- Wang, T.; Zheng, X.; Zhang, L.; Cui, Z.; Xu, C. A graph-based interpretability method for deep neural networks. Neurocomputing 2023, 555, 126651. [Google Scholar] [CrossRef]
- Wu, Z.; Pan, S.; Chen, F.; Long, G.; Zhang, C.; Yu, P.S. A Comprehensive Survey on Graph Neural Networks. IEEE Trans. Neural Netw. Learn. Syst. 2019, 32, 4–24. [Google Scholar] [CrossRef] [PubMed]
- Jin, H.; Chen, D.; Zheng, L.; Huang, Y.; Yao, P.; Zhao, J.; Liao, X.; Jiang, W. Accelerating Graph Convolutional Networks Through a PIM-Accelerated Approach. IEEE Trans. Comput. 2023, 72, 2628–2640. [Google Scholar] [CrossRef]
- Black, J.E.; Kueper, J.K.; Williamson, T. An introduction to machine learning for classification and prediction. Fam. Pract. 2022. [Google Scholar] [CrossRef] [PubMed]
- Hüttenrauch, M.; Neumann, G. Robust Black-Box Optimization for Stochastic Search and Episodic Reinforcement Learning. J. Mach. Learn. Res. 2024, 25, 1–44. [Google Scholar]
- Kim, S.; Li, Z.; Um, S.; Jo, W.; Ha, S.; Lee, J.; Kim, S.; Han, D.; Yoo, H.J. DynaPlasia: An eDRAM In-Memory Computing-Based Reconfigurable Spatial Accelerator with Triple-Mode Cell. IEEE J. Solid-State Circuits 2024, 59, 102–115. [Google Scholar] [CrossRef]
- Lin, C.T.; Wang, D.; Zhang, B.; Chen, G.K.; Knag, P.C.; Krishnamurthy, R.; Seok, M. DIMCA: An Area-Efficient Digital In-Memory Computing Macro Featuring Approximate Arithmetic Hardware in 28 nm. IEEE J. Solid-State Circuits 2024, 59, 960–971. [Google Scholar] [CrossRef]






| Technology | Device Type | Integration Challenge | Typical Application | Notable Example |
|---|---|---|---|---|
| DRAM | Volatile | Volatility, refresh overhead | AI, HPC, general compute | HBM2-PIM, Smart Memory Cube [51,52] |
| SRAM | Volatile | Area/cost scaling | Edge AI, embedded, cache | PIMCA, IMC-Sort [41,59,60] |
| RRAM (ReRAM) | Non-volatile | Device variability, endurance | AI accelerators, neuromorphic | SLIM, 2T2R RRAM [47,48,49] |
| PCM | Non-volatile | Drift, write energy | Analog IMC, photonic IMC | PCM-AIMC, Photonic PCM [50,61,62] |
| FeFET | Non-volatile | Endurance, process control | Low-power PIM, logic-in-mem | FeFET-PIM, DG-FeFET [63,64,65] |
| STT-MRAM | Non-volatile | Write energy, integration | Digital IMC, NVM cache | BCLS-SP, SOT-MRAM [66,67,68] |
| Photonic Memristor | Non-volatile | Integration with electronics | Photonic IMC, neuromorphic | Photonic PCM [61] |
| Y-Flash | Non-volatile | Process maturity | ML inference, logic-in-mem | IMPACT [69] |
| Halide Perovskite | Non-volatile | Stability, scalability | Reconfigurable logic-in-mem | Perovskite IMC [70] |
| 3D Self-Rectifying Memristor | Non-volatile | 3D integration, sneak paths | 3D IMC, high-density arrays | 3D SRM [49,71] |
| Method Type | When to Use | Core Mechanism | Strength | Key Limitation | Example System/Paper |
|---|---|---|---|---|---|
| Rule-Based | Well-known workload, simple hardware | Predefined rules | Predictable, low-cost | Inflexible, not adaptive | [100,102,114] |
| Heuristic | Scalable, time-constrained, moderate complexity | Greedy/DP/Graph | Fast, scalable | May miss optimum, local optima | [103,115,116,117,118] |
| Metaheuristic | Large search space, complex/ heterogeneous environments | Genetic/Swarm/ particle swarm optimization/genetic algorithm (PSO/GA) | Finds good solutions in complex space | High computation cost, slow | [104,112,119,120] |
| Learning-Based | Dynamic, non-stationary, or highly variable workloads | deep reinforcement learning (DRL)/Policy Gradient/Actor-Critic | Adaptive, handles dynamics | Requires training, data-hungry | [121,122,123,124,125,126] |
| Graph-Based | Irregular DNNs, complex dependencies, DAG topologies | Graph partition/ min-cut | Captures structure, flexible | High overhead, complex tuning | [103,126,127,128] |
| Hybrid/ Automated | Heterogeneous hardware, multi-objective optimization | Auto-tuning, DSE frameworks | Balances trade-offs, flexible | Complexity, may need profiling | [102,129,130,131] |
| Framework | Year | Best Benchmark | Key Innovation |
|---|---|---|---|
| NNPIM [5] | 2019 | 48.2× speedup, 131.5× energy efficiency vs. GPU | Crossbar memory, weight sharing, parallel in-memory compute |
| Weight Mapping [79] | 2019 | 2.03× speedup, 1.4× throughput/energy vs. prior mapping | Kernel division, spatial PE assignment, optimal pipeline |
| 3D-ReG [166] | 2020 | 5.64× training speedup, 3.56× energy efficiency vs. GPU + DRAM | Heterogeneous GPU + PIM, 3D integration, task-mapping schemes |
| DLUX [167] | 2020 | 6.3× speedup, 42× energy efficiency vs. Tesla V100 GPU | In-DRAM LUT, near-bank mapping, loop tiling, layout transposition |
| ZigZag [130] | 2021 | Up to 64% more energy-efficient vs. prior DSE frameworks | Uneven mapping, nested-for-loop DSE, mapping search engines |
| Robust PIM [166,168] | 2021 | Maintains accuracy under process variation (ResNet/MobileNet) | Hessian-driven mixed-precision, sensitivity-based quantization |
| PIM-DRAM [169] | 2021 | Up to 19.5× speedup vs. NVIDIA Titan Xp GPU | DRAM-based PIM primitive, intra-bank accumulation, data mapping |
| FAMS [170] | 2021 | 29.7% lower latency, 42.4% higher throughput vs. TENET | Memory-centric mapping, cycle-accurate simulation, hardware constraints |
| IVQ [33] | 2022 | 19.7–91.7× speedup, 17.7–541× energy vs. ISAAC/CASCADE/ASIC/ FPGA/GPU | Varied quantization, spatial mapping, temporal scheduling, pipeline |
| RaQu [171,172] | 2022 | 29.2–37.4% resource utilization, 1.8–3.3% accuracy vs. prior quantization | AutoML-based quantization, hardware-aware mapping |
| PattPIM [173] | 2022 | Significant performance, energy, and resource efficiency | Weight pattern reuse, WPR-aware mapping, PE pipeline |
| PQ-PIM [174] | 2022 | 1.74× performance, 62% energy saving vs. prior ReRAM | Patch-wise pruning-quantization, mixed OU-based engine |
| NicePIM [104] | 2023 | 37% latency and 28% energy reduction vs. baseline | PIM-Tuner, PIM-Mapper, data-scheduler, deep kernel learning |
| Gibbon [175] | 2023 | 9.8–48.2× speedup, 5.96× EDP reduction vs. prior work | Evolutionary search, multilevel joint simulator, co-exploration |
| BWQ [176] | 2023 | 6.08× speedup, 17.47× energy saving vs. prior ReRAM | Block-wise mixed-precision, precision-aware mapping |
| Benchmarking IMC [100] | 2023 | 30% minimum execution-time reduction vs. prior mapping | PE-level mapping, hybrid mapping, benchmarking framework |
| Static Scheduling [177] | 2023 | Significant latency reduction vs. prior ReRAM mapping | Static scheduling, weight-to-(output unit) OU mapping, latency model |
| Fast-OverlaPIM [106] | 2024 | 4.6–18.1× faster mapping vs. prior overlap-based framework | Overlap-driven mapping, analytical overlap analysis, transformation mechanism |
| ReHarvest [77] | 2024 | 3.5× speedup, 3.1× resource reduction vs. FORMS | ADC-crossbar decoupling, multi-tile mapping, bus-based multicast |
| M2M [178] | 2024 | 7.18–61.09% latency reduction vs. prior multi-DNN mapping | Fine-grained mapping, temporal/spatial scheduling, QoS for NoP |
| PIMCOMP [105] | 2024 | Throughput, latency, energy improved vs. prior PIM | End-to-end compiler, multilevel optimization, weight-layout mapping |
| HuNT [109] | 2024 | 10× energy, 8× compute efficiency vs. homogeneous PIM | Heterogeneous 3D PIM, neural layer mapping, tier configuration |
| Empirical Fault Framework [179] | 2025 | Fault impact characterization, model-specific vulnerabilities | Fault-injection, layer/location-aware mapping, reliability analysis |
| Framework/Tool Name | Year | Hardware Type | Supported Models/Workloads | Open Source? | Primary Validation |
|---|---|---|---|---|---|
| Fast-OverlaPIM [106] | 2024 | Generic PIM | CNNs (full networks) | No | Simulation |
| NicePIM [104] | 2023 | 3D Stacked DRAM-PIM | CNNs (various), ResNet, VGG | No | Simulation |
| Optimized Weight Mapping [79] | 2019 | RRAM-PIM | CNNs (ResNet-34), ImageNet | No | Simulation |
| DDAM [180] | 2022 | Generic PIM | CNNs (various) | No | Simulation |
| Gibbon [175] | 2023 | Memristor PIM | DNNs (various), CIFAR-10 | No | Simulation |
| HuNT [179] | 2024 | Hybrid (ReRAM, FeFET, PCM, MRAM, SRAM) | DNNs (training/inference) | No | Simulation |
| IVQ [33] | 2022 | Crossbar PIM | DNNs (varied quantization) | No | Simulation |
| PIMCOMP [105] | 2024 | Generic PIM | CNNs, DNNs (ResNet, VGG, etc.) | Yes (GitHub) | Simulation |
| MAESTRO [101] | 2020 | Simulator/Model | CNNs, DNNs (various) | Yes (GitHub) | Simulation |
| ZigZag [130] | 2021 | Simulator/ Framework | CNNs, DNNs (various) | Yes (GitHub) | Simulation |
| DNN + NeuroSim V2.0 [39] | 2020 | SRAM/ReRAM/ Analog PIM | VGG-8, CIFAR-10, DNNs | Yes (GitHub) | Simulation |
| PIM-DRAM [169] | 2021 | Commodity DRAM-PIM | AlexNet, VGG16, ResNet18 | No | Simulation |
| FAMS [170] | 2025 | Systolic Array (SRAM/DRAM) | DNNs (various) | No | Simulation |
| PattPIM [173] | 2022 | ReRAM-PIM | DNNs (6 models) | No | Simulation |
| PQ-PIM [174] | 2022 | ReRAM-PIM | DNNs (various) | No | Simulation |
| Block-Wise Mixed-Precision Quantization (BWQ) [176] | 2023 | ReRAM-PIM | DNNs (various) | No | Simulation |
| SDP [99] | 2023 | SRAM-PIM (Digital) | Sparse NNs, DNNs | No | Simulation |
| ReHarvest [77] | 2024 | ReRAM-PIM | DNNs (various) | No | Simulation |
| Klotski v2 [113] | 2025 | Dataflow Accelerator | DNNs (various) | No | Simulation |
| Spiking DNN Mapping [101] | 2015 | Neuromorphic/ Spike-based | CNN → SNN, CIFAR-10, Neovision2 | No | Simulation; hardware mapping |
| Benchmark/ Framework | Device/Arch/ System Coverage | Supported Models | Dataset Type | Main Metric(s) | Major Limitation |
|---|---|---|---|---|---|
| DNN + NeuroSim V2.0 [39] | Device, Circuit, Architecture, System | VGG-8, flexible via PyTorch | CIFAR-10 | Area, Energy, Throughput, Accuracy | Limited to SRAM/eNVM; focus on on-chip training |
| SIAM [188] | Device, Circuit, Architecture, System | ResNet-50, wide DNN support | CIFAR-10, CIFAR-100, ImageNet | Energy Efficiency, Throughput | Focused on chiplet-based IMC; calibration needed |
| Benchmarking DNN Mapping [100] | Architecture, PE-level | Convolutional DNNs | Public DNN benchmarks | Area-Efficiency, Scalability | Emphasis on mapping, not full system |
| NicePIM [104] | Architecture, System | DNNs (various, mapping-focused) | Not specified | Latency, Energy | 3D-DRAM PIM focus; dataset coverage unclear |
| MNSIM 2.0 [26] | Device, Architecture, System | Large-scale NNs, Digital/Analog PIM | Case studies, fabricated macros | Accuracy, Performance, Modeling Error | Generalized, but not all real-world datasets |
| Gibbon [175] | Device, Architecture, System | Memristor-based DNNs | Not specified | Accuracy, Energy-Delay Product | Focus on co-exploration, not dataset diversity |
| Heterogeneous IMC Cluster [200] | System, Architecture | MobileNetV2, TinyML tasks | Real-world IoT tasks | Latency, Energy, Area | System-level, limited model diversity |
| Accelerating NN Inference with PIM [201] | Device, Architecture, System | UPMEM, Mensa, SIMDRAM (various NNs) | Matrix–vector kernels, Google edge models | Performance, Energy Efficiency | Focused on DRAM-based PIM, not all model types |
| PIMulator-NN [202] | Circuit, Architecture, System | Several PIM designs, templates | Not specified | Area, Latency, Energy | Lacks standardized dataset integration |
| PIMCaffe [203] | Device, Architecture, System | Recommendation, AlexNet, ResNet-50 | Not specified | Speedup vs. CPU | Prototype neural processing unit (NPU), limited model/dataset coverage |
| RaQu [171] | Device, Architecture | CNNs (various, quantization focus) | Not specified | Resource Utilization, Accuracy | RRAM-specific, not broad dataset coverage |
| BWQ [176] | Device, Architecture | DNNs (block-wise quantization) | Not specified | Speedup, Energy Saving | ReRAM focus, limited to quantization studies |
| Pitfall | Severity | Concrete Example | Real Impact | Solution/Best Practice | Literature Ref. |
|---|---|---|---|---|---|
| Shortcut learning and overfitting to benchmarks | High | DNNs perform well on standard benchmarks but fail in real-world deployment | Inflated performance claims, poor generalization, unreliable hardware evaluation | Use diverse, real-world datasets; test under adversarial/noisy conditions; robust cross-validation | [208,209,210] |
| Ignoring hardware non-idealities (noise, variation) | High | Benchmarking IMC/PIM without modeling device variation, quantization, or faults | Overestimation of accuracy and efficiency; missed reliability issues in deployment | Incorporate device-level noise, variation, and quantization effects in benchmarks and simulations | [39,78,168,179,206,211] |
| Apples-to-oranges comparisons | Medium | Comparing PIM/IMC results with different DNN models, batch sizes, or metrics | Misleading performance/efficiency claims; unfair hardware comparisons | Standardize benchmarks, report all parameters, use common datasets and metrics | [199,212,213,214] |
| Unrealistic or unrepresentative workloads | Medium | Using synthetic or toy datasets not reflective of target applications | Results do not translate to real-world performance; misguides design choices | Benchmark with application-relevant, large-scale, and diverse workloads | [9,104,199,212] |
| Poor reproducibility and lack of transparency | Low | Missing code, incomplete reporting of hardware/ software stack | Results cannot be verified or built upon; slows progress and trust in the field | Open-source code, detailed reporting of hardware/ software, use public benchmarks | [9,199,209,214] |
| Not testing at scale or under realistic conditions | Low | Evaluating on small models or idealized hardware, ignoring edge cases | Overlooks bottlenecks, scalability issues, or failure modes | Test at production scale, include stress tests, adversarial and worst-case scenarios | [188,208,211,212] |
| Metric | Definition | Units | Notes |
|---|---|---|---|
| TOPS/W | Operations per second per watt | TOPS/W | Higher is better |
| pJ/op | Energy per operation (MAC) | pJ/op | Lower is better |
| Throughput | Operations/inferences per second | TOPS, GOPS, FPS | Essential for real-time tasks |
| Throughput per Area | Throughput normalized by silicon area | TOPS/mm2 | Indicates computational density |
| Energy per Inference | Energy consumed per full inference | μJ/inference | End-to-end efficiency measure |
| Energy-Delay Product (EDP) | Energy–time trade-off metric | pJ·s, nJ·s | Critical for balancing performance |
| Latency | Time per inference/operation | ns, μs, ms | Lower is better |
| Accuracy | Model accuracy under constraints | % | Must accompany efficiency |
| Utilization | Hardware resource efficiency | % | Higher indicates efficient mapping |
| Paper (Cite) | Hardware Target | Mapping/Scheduling Strategy | Performance/Efficiency Gains |
|---|---|---|---|
| [220] | Multi-tenant DNN accelerators | Co-optimization of mapping and scheduling for multi-tenancy | Improved resource utilization and multi-tenant efficiency |
| [104] | 3D-stacked DRAM PIM | PIM-Tuner, PIM-Mapper, data-scheduler (ILP-based) | 37% latency, 28% energy reduction vs. baseline |
| [149] | Diverse PIMs | General abstraction-based mapping | Generalized mapping for diverse PIMs |
| [105] | Crossbar-based PIM DNN accelerators | Universal compilation framework | Universal mapping, improved portability |
| [221] | Memristive crossbars | RL-based dynamic sparsity-aware mapping on crossbars | Area reduction to 43% (small), 22.5% & 17.1% (large datasets) |
| [222] | PIM DNN accelerators | ISA-level mapping | Improved programmability |
| [129] | FPGAs (cloud) | Joint architecture, scheduling, mapping | Up to 9× EDP reduction vs. SOTA |
| [130] | DNN accelerators (general) | Joint architecture–mapping DSE, uneven mapping | Up to 64% more energy-efficient solutions |
| [223] | Memristive crossbar systems | Co-design of NN and hardware | Improved co-design for memristive systems |
| [224] | PIM with neural approximation | Approximate peripherals for PIM | Improved efficiency via approximation |
| [225] | Edge inference systems | Approximate computing strategies | Energy-efficient edge inference |
| [226] | Flexible DNN accelerator | Diverse data-format support | Flexibility for multiple data formats |
| [100] | In-memory computing accelerators | Benchmarking DNN mapping methods | Comparative benchmarking |
| [177] | Resource-constrained PIM | Static scheduling of weight programming | Improved scheduling for resource constraints |
| [109] | Heterogeneous PIM devices (3D manycore) | Exploiting heterogeneity for DNN training | Improved training efficiency |
| [227] | FPGA-based DNN accelerator | Modeling and exploration framework | Up to 4.2× performance, 4.4× DSP efficiency vs. SOTA |
| [228] | Spatial DNN accelerators | SW–HW co-design of interconnections | Improved interconnect efficiency |
| [229] | LSTM accelerator | Algorithm–hardware co-design, quantized tensor train | Co-optimized LSTM acceleration |
| [230] | In-memory DNN accelerators | Interconnect-aware area/energy optimization | Area and energy optimization |
| [231] | Mobile/edge devices (heterogeneous HW) | Co-scheduling framework | Latency and energy improvements |
| Platform/Technology (Cite) | Workload | Energy Efficiency (TOPS/W) | Throughput per Area (TOPS/mm2) | Latency (ms/Inference) |
|---|---|---|---|---|
| Binary/Ternary Reconfigurable IMC (SRAM) [138] | Binary/Ternary DNNs (ImageNet, CIFAR-10) | 2.33 | 0.35 | 0.71 (batch-1, ImageNet) |
| SRAM & eNVM-based CIM [39] | VGG-8 (CIFAR-10) | 10–100 (SRAM), 20–200 (eNVM, simulated + silicon) | 0.1–0.5 | 0.5–2 (batch-1) |
| UPMEM DRAM-PIM [9] | DNNs, Graph, Analytics (PrIM suite) | 0.1–0.5 | 0.01–0.05 | 10–100 (varies by workload) |
| Dual-SRAM Charge-Domain IMC [232] | DNNs (CIFAR-10, VGG-8) | 18.4–119.2 | 0.2–0.8 | 0.2–1.0 |
| 10T1C SRAM-based IMC [59] | VGG9, ResNet-18 (CIFAR-10) | 437 | 1.2 | 0.15 (batch-1) |
| Hybrid Analog IMC + Digital [233] | CIFAR-10, ImageNet | 600 (AIMC), 14 (Digital) | 0.5 | 0.12 (CIFAR-10) |
| Capacitor-based Analog IMC [234] | 11-layer CNN (CIFAR-10), ResNet-50 (ImageNet) | 30–51.5 | 0.3 | 0.13 (CIFAR-10) |
| Analog IMC (various) [235] | ConvNets, RNNs, Transformers (CIFAR-10, ImageNet, NLP) | 10–100 (varies) | 0.1–0.5 | 0.2–1.0 |
| 5 nm Digital Accelerator [236] | BERT-Base, ResNet-50 | 38.6–95.6 | 1.5 | 0.08 (BERT, batch-1) |
| 16 nm Multi-Chip-Module [237] | ResNet-50 (ImageNet) | 9.5 | 1.29 | 0.52 (batch-1) |
| RRAM-based Analog IMC + RISC-V [200] | MobileNetV2 (CIFAR-10) | 9.5 | 0.2 | 0.11 |
| ReRAM-based PIM [238] | RNNs (LSTM, GRU) | 10–80 | 0.1–0.3 | 0.5–2.0 |
| FPGA-based [239] | LeNet, AlexNet, VGG-S | 0.5–1.2 | 0.05–0.1 | 0.8–2.0 |
| Digital SRAM-PIM [240] | GEMM, ResNet, VGG | 5–8.9 | 0.2 | 0.3–1.0 |
| 3D Stacked DRAM-PIM [104] | ResNet-18, VGG-16 (ImageNet) | 2–10 | 0.1 | 0.6–1.5 |
| Hybrid Memory Cube (HMC) PIM [115] | DCNNs (ImageNet) | 1.5–3.0 | 0.05 | 1.0–2.0 |
| LUT-based DRAM-PIM [7] | AlexNet, VGG | 2.5–12 | 0.1 | 0.5–1.5 |
| SRAM/ReRAM IMC [241] | VGG-19, ResNet | 3–6 | 0.1 | 0.4–1.2 |
| ReRAM-based PIM [176] | ResNet-18, VGG-16 | 6–17.5 | 0.2 | 0.3–1.0 |
| IMC (SRAM, RRAM) [100] | ResNet, VGG (CIFAR-10, ImageNet) | 10–30 | 0.1–0.3 | 0.2–1.0 |
| ML/GNN Task | Typical Input | Prediction Target | Example Use Case | Main Benefit | Notable Limitation | Ref. |
|---|---|---|---|---|---|---|
| Hardware-Aware DNN Mapping | DNN architecture, hardware specs | Optimal mapping/ configuration | Mapping DNN layers to PIM nodes for latency/energy optimization | Significant speedup and energy savings | Balancing hardware and model constraints is complex | [7,104,169,184] |
| Graph-Based Resource Scheduling | Device topology, computation graph | Scheduling strategy | Assigning DNN tasks to PIM cores in distributed/ heterogeneous systems | Better utilization, reduced communication | Scalability and comms overhead | [175,210,232,239] |
| Model Compression & Quantization | DNN weights, hardware constraints | Compressed/ quantized parameters | Deploying DNNs on RRAM/ DRAM PIM with limited resources | Lower footprint, faster inference | Potential accuracy loss, HW mismatch | [7,171,184,247] |
| Data/Model Pruning for GNNs | Graph data, GNN model | Pruned graph/model | Accelerating GNN training on ReRAM-based PIM | Faster, energy-efficient training | Risk of over-pruning, info loss | [248,249] |
| Automated Architecture Optimization | DNN/graph structure, performance data | Optimized architecture/ config | AutoML-driven design for DNN-PIM deployment | Automation, high-quality solutions | Search space explosion, compute cost | [104,171,250,251] |
| Robust Optimization via Deep Learning | Historical data, uncertainty models | Robust solution/ uncertainty set | Reliable DNN-PIM deployment under variable conditions | Improved robustness, generalization | NP-hardness, integration complexity | [252,253] |
| Distributed GNN Training | Large-scale graph data, cluster info | Training workflow/strategy | Scaling GNN training across PIM-enabled clusters | Scalability, efficient distributed use | Communication bottlenecks | [254,255,256] |
| Focus Area | Key Statement | Strength of Support | Brief Justification | Literature Reference |
|---|---|---|---|---|
| Hardware & Devices | DRAM and SRAM are most mature and widely integrated IMC/PIM tech | Strong | Commercial deployment, extensive research, robust performance | [4,51,59,60,201] |
| RRAM and PCM enable high energy efficiency/density for AI/neuromorphic | Strong | Multiple experimental/prototype systems; some integration/endurance challenges | [42,47,48,50,61,62] | |
| Analog IMC faces precision/variation limitations | Moderate | Device non-idealities, process variation impact accuracy | [42,82] | |
| Integration/process compatibility are major barriers for emerging IMC/PIM | Moderate | Key challenge in reviews and experimental reports | [47,48,49,50,61,62,63,65,67,68,71] | |
| Photonic/3D memristor IMC promise ultra-high density/efficiency | Moderate | Early-stage research, mostly prototypes; integration hurdles | [49,61,71] | |
| FeFET, STT-MRAM offer high endurance and low power for emerging PIM | Moderate | Promising in recent studies, but limited large-scale deployment | [63,64,65,66,67,68] | |
| Mapping & Algorithms | All methods face trade-offs between efficiency, adaptability, complexity | Moderate | Context-dependent; highlighted in comparative analyses | [42,51,54,55,56,57,58,129,130,131] |
| Graph-based/hybrid methods excel in irregular DNNs, multi-objective | Moderate | Effective for DAGs and heterogeneous systems, but complex to implement | [60,63,74,102,103,127,128] | |
| Rule-based methods predictable but inflexible | Moderate | Low-overhead, not adaptive | [100,102,114] | |
| Metaheuristics outperform heuristics in large, complex search spaces | Moderate | Outperform heuristics in complex/heterogeneous settings | [104,112,119,120] | |
| Heuristic methods are fast and scalable for moderate complexity | Strong | Supported by many studies and benchmarks | [103,115,116,117,118] | |
| Learning-based methods adapt to dynamic, heterogeneous environments | Strong | Demonstrated in dynamic scheduling, partitioning | [121,122,123,124,125,126] | |
| Software/ Tools | PIMCOMP enables fully automated, modular DNN deployment on PIM/IMC | Strong | Demonstrated end-to-end automation, modular passes, broad hardware support | [105] |
| Open-source availability accelerates research and adoption | Moderate | Open tools cited in multiple studies/benchmarks | [105,192] | |
| Benchmarking and standardization are ongoing challenges | Moderate | Few tools provide unified benchmarks across hardware/applications | [105,192,193,194,195] | |
| DNN + NeuroSim provides hierarchical evaluation and device modeling | Strong | Widely used for device-level simulation and partial automation | [192] | |
| Heterogeneous hardware support remains limited | Moderate | Most tools focus on specific memory types/platforms | [193,194,195] | |
| Proprietary tools limit reproducibility and community improvement | Moderate | Closed/partially accessible frameworks hinder impact | [193,195] | |
| Benchmarking & Datasets | Lack of standardized benchmarks/diverse hardware complicates comparison | Weak | Heterogeneous setups limit cross-framework comparability | [5,39,104,175,180,184,185,224,257] |
| Leading frameworks provide multi-level hardware coverage | Strong | Multiple frameworks benchmark device to system level | [26,39,175,188] | |
| Some frameworks enable flexible, extensible benchmarking | Moderate | Subset support flexible model/hardware integration | [26,39,175,188] | |
| Lack of end-to-end system-level evaluation with real-world datasets | Moderate | Few frameworks benchmark full systems with diverse, real-world tasks | [26,175,200,202,203] | |
| Open access/standardized benchmarks are inconsistently available | Moderate | Few are public, most lack open access/dataset integration | [39,175,200,202,203] | |
| Most frameworks are tailored to specific hardware or tasks | Moderate | Limits generalizability and cross-platform comparison | [100,104,171,176,202] | |
| Dataset/model diversity is limited in most frameworks | Strong | Most focus on specific models, lack real-world dataset integration | [100,104,171,175,176,200,202,203] | |
| Pitfalls & Methodological Gaps | Ignoring hardware non-idealities leads to unreliable results | Strong | Device noise/variation significantly affect real-world performance | [39,78,168,179,206,211] |
| Apples-to-oranges comparisons mislead hardware evaluation | Strong | Lack of standardization skews results | [100,199,212,213] | |
| Shortcut learning/overfitting to benchmarks are common and severe | Strong | DNNs exploit spurious correlations in benchmarks | [208,209,210] | |
| Evaluation & Reproducibility | Unrealistic workloads undermine generalizability | Moderate | Synthetic/toy datasets do not reflect real application performance | [9,104,199,212] |
| Poor reproducibility slows progress | Moderate | Missing code/incomplete reporting prevent verification/reuse | [9,199,209,214] | |
| Not testing at scale or under stress misses bottlenecks | Moderate | Small-scale tests overlook scalability/failure modes | [188,208,211,212] | |
| Learning- Driven DNN-to-PIM | ML/GNN-based mapping significantly improves DNN-PIM performance | Strong | Multiple studies show large speedup and energy savings | [7,104,169,184] |
| Model compression/ quantization enables efficient PIM deployment | Strong | Consistent improvements in memory use, inference speed | [7,171,184,247] | |
| Automated optimization (AutoML) finds high-quality DNN-PIM configs | Moderate | Good results, but search space and cost are challenges | [104,171,250,251] | |
| Distributed GNN training on PIM is scalable but faces bottlenecks | Moderate | Scalability shown, but communication/workflow divergence issues | [254,255,256] | |
| Robust optimization via deep learning improves reliability | Moderate | Some evidence for better generalization; integration is complex | [252,253] | |
| Data/model pruning accelerates GNN training on PIM | Moderate | Speed/energy gains, but risk of over-pruning | [248,249] | |
| Frameworks & Benchmarking Trends | Lack of standardized benchmarks/diverse hardware complicates comparison | Weak | Heterogeneous setups limit cross-framework comparability | [5,39,104,175,180,184,185,224,257] |
| Modern DNN-to-PIM frameworks achieve 2–48× speedup, 28–131× energy efficiency | Strong | Consistent, large improvements in throughput/energy over CPU/GPU | [5,104,175,180,184,257] | |
| Real-system benchmarks (e.g., PyGim) are essential for validating frameworks | Moderate | Provide real hardware results, highlight deployment issues | [5,184] | |
| Most frameworks focus on inference; training/GNN support underexplored | Moderate | Few studies address training/support for new model types | [39,81,257] | |
| Integer/dynamic programming effective for mapping/partitioning | Moderate | Improved data locality, reduced comm. overhead, but high complexity | [104,180] | |
| Co-design approaches yield better optimization than hardware/software only | Strong | Jointly optimizing NN and PIM architectures gives superior results | [104,175,257] | |
| Performance Metrics | Energy efficiency (TOPS/W), throughput (TOPS) are primary metrics | Strong | Consistently reported; definitions standardized | [26,54,59,79,185,234,267,268] |
| Standardized workloads (CIFAR-10, ImageNet) enable fair comparison | Strong | Most comparative studies use these datasets | [54,59,79,267,268] | |
| Some platforms report inflated efficiency via aggressive quantization | Moderate | Accuracy must be reported alongside efficiency | [54,59,268] | |
| Secondary metrics (EDP, area, utilization) less standardized/varied | Moderate | Some report these, but usage is inconsistent | [54,79,185] | |
| Variability in reporting granularity hinders direct comparison | Moderate | Some report only partial results, complicating meta-analysis | [54,79,185] | |
| Reporting mean ± std. over ≥5 runs is now common | Strong | Multiple papers report statistical results for reliability | [59,267,268] |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2026 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license.
Share and Cite
Marium, S.M.; Chen, S. Bridging Architectures, Mapping, and Learning for DNN Acceleration with Processing-in-Memory and In-Memory Computing Systems. Microelectronics 2026, 2, 10. https://doi.org/10.3390/microelectronics2020010
Marium SM, Chen S. Bridging Architectures, Mapping, and Learning for DNN Acceleration with Processing-in-Memory and In-Memory Computing Systems. Microelectronics. 2026; 2(2):10. https://doi.org/10.3390/microelectronics2020010
Chicago/Turabian StyleMarium, Syeda Munazza, and Song Chen. 2026. "Bridging Architectures, Mapping, and Learning for DNN Acceleration with Processing-in-Memory and In-Memory Computing Systems" Microelectronics 2, no. 2: 10. https://doi.org/10.3390/microelectronics2020010
APA StyleMarium, S. M., & Chen, S. (2026). Bridging Architectures, Mapping, and Learning for DNN Acceleration with Processing-in-Memory and In-Memory Computing Systems. Microelectronics, 2(2), 10. https://doi.org/10.3390/microelectronics2020010
