Bridging Architectures, Mapping, and Learning for DNN Acceleration with Processing-in-Memory and In-Memory Computing Systems

Marium, Syeda Munazza; Chen, Song

doi:10.3390/microelectronics2020010

Open AccessReview

Bridging Architectures, Mapping, and Learning for DNN Acceleration with Processing-in-Memory and In-Memory Computing Systems

by

Syeda Munazza Marium

and

Song Chen

^*

School of Microelectronics, University of Science and Technology of China, Hefei 230029, China

^*

Author to whom correspondence should be addressed.

Microelectronics 2026, 2(2), 10; https://doi.org/10.3390/microelectronics2020010

Submission received: 3 February 2026 / Revised: 31 March 2026 / Accepted: 14 May 2026 / Published: 10 June 2026

Download

Browse Figures

Versions Notes

Abstract

Processing-in-memory and in-memory computing (PIM/IMC) are increasingly explored to mitigate the von Neumann data-movement bottleneck that limits deep neural network (DNN) performance and energy efficiency. Progress, however, remains fragmented across device substrates, architectural prototypes, mapping and scheduling methods, compiler toolchains, and benchmarking practices, making results hard to compare and slowing deployment. This survey synthesizes developments from 2019–2025 along four coupled axes: (i) memory substrates and architectural design, (ii) mapping, partitioning, and scheduling, including learning- and graph-based strategies, (iii) compilers and end-to-end deployment flows, and (iv) benchmarking datasets, metrics, and reporting norms. Drawing on over twenty representative platforms spanning static random-access memory (SRAM) and dynamic random-access memory (DRAM), emerging non-volatile, capacitive, and photonic substrates, we clarify the trade-offs separating analog/charge-domain IMC from digital SRAM/DRAM-centric PIM, including reported peaks up to 600 TOPS/W and 1.5 TOPS/mm². We organize mapping frameworks into a unified reference taxonomy, identify recurrent evaluation pitfalls that undermine reproducibility, and highlight persistent gaps in training support, robustness under non-idealities, and coverage of large-scale GNN workloads. Finally, we outline a five-phase roadmap from benchmark standardization to industrial validation toward compiler-integrated, GNN-informed PIM/IMC systems validated on production-scale workloads.

Keywords:

processing-in-memory; in-memory computing; deep neural networks; mapping; benchmarking

1. Introduction

Deep neural networks (DNNs) have scaled from millions of parameters to projecting forward hundreds of billions, intensifying the “memory wall” imposed by von Neumann data shuttling between compute and memory [1,2,3,4]. Processing-in-memory (PIM) and in-memory computing (IMC) address this bottleneck by relocating computation to, or directly inside, memory arrays, yielding large gains in energy and latency for DNN workloads frequently reported in the range of 10–100× energy efficiency and 4–50× speed/latency improvements versus conventional CPU/GPU baselines [5,6,7,8,9,10,11,12,13,14].

1.1. Tutorial Snapshot: PIM vs. IMC

In a conventional system, the CPU must fetch, operate on, and write back data, paying bandwidth and energy at each hop. PIM reduces this transfer by integrating simple compute near or within memory chips, while IMC realizes operations (e.g., analog/digital matrix–vector products) directly in the memory fabric, executing “in place” to minimize movement and delay [1,2,15,16,17,18]. Early PIM-instruction proposals exposed near-data operations to the instruction set architecture (ISA) and runtime [19]. Figure 1 summarizes the interactions between devices, mapping strategies, co-design challenges, and benchmarking practices.

Terminology note. We treat PIM as computation integrated within a memory module or stack (including logic-near-memory and in/near-chip support), whereas IMC denotes operations realized inside the memory-array fabric (bitline/wordline/cell level) in SRAM, DRAM, or emerging non-volatile memories. These definitions are used for consistency rather than to impose a community-wide standard.

1.2. Scope and Taxonomy of This Survey

We cover the broader PIM/IMC landscape but emphasize DRAM-based PIM systems as the main deployment thread (including DRAM-array primitives and 3D-stacked DRAM with logic layers), using SRAM- and NVM-centric IMC primarily for context and contrast. Throughout, we categorize systems using two orthogonal cues: (i) where computation occurs (logic-near-memory vs. inside-array), and (ii) how it is realized (digital/bit-serial primitives vs. analog/charge- or conductance-domain accumulation). We synthesize the field across four tightly coupled axes:

1.: Devices and architectures: SRAM, DRAM, and non-volatile memories (ReRAM/PCM/FeFET/STT-RAM) as substrates for both logic-near-memory PIM and inside-array IMC, including their integration and scalability trade-offs [1,2,6,12,15,20,21].
2.: Mapping, partitioning, and scheduling: from rule-based/ILP to learning- and graph-based strategies tuned to device constraints and hierarchies [5,13,22,23].
3.: Compiler and toolchains: programming models, compilers, and simulation frameworks enabling end-to-end exploration, including SongC, LOGIC, MNSTM 2.0, and Sim2PIM as representative prior works rather than software packages used in this study [24,25,26,27].
4.: Benchmarking datasets and evaluation practices: standardized workloads/metrics to enable fair comparison and reproducibility [7,8,9,28,29].

To improve methodological transparency, this survey focuses on representative studies reported primarily during 2019–2025 and selects them to span the principal design axes of the field: memory substrate, compute modality, mapping/scheduling strategy, software-stack maturity, and benchmarking practice. We prioritize works that provide quantitative evaluation and clear architectural context while avoiding direct performance ranking when workload, precision, batch size, technology node, or validation basis are insufficiently disclosed. Accordingly, comparative conclusions in this review are intended to identify consistent trends and limitations rather than to claim strict cross-paper equivalence.

1.3. Where Practice Stands: Opportunities and Frictions

Graph-centric modeling, especially graph neural networks (GNNs), captures DNN structure for intelligent mapping to heterogeneous memory fabrics, while machine learning (ML) methods increasingly automate hardware-aware scheduling and co-design [30,31,32,33,34]. Benchmarking frameworks are also expanding beyond raw throughput toward big-data and graph-centric metrics [35,36]. However, mismatches persist between prototype claims and deployment realities: academic IMC demonstrations on 90 nm RRAM arrays show accuracy sensitivities to environmental conditions often abstracted away in idealized simulation [37], while industrial-scale designs such as a 28 nm sparse-DNN chip highlight the need for dynamic sparsity and on-device training capabilities still underrepresented across surveys [38]. End-to-end pipelines like DNN + NeuroSim V2.0 partially close the loop by modeling device non-idealities through architecture, yet they are rarely integrated into actionable cross-stack design playbooks [39]. As a result, reported PIM/IMC efficiency can vary by >10× under ostensibly similar DNN workloads due to inconsistent datasets, metrics, and tooling, limiting rigorous comparison and cumulative progress [7,8,9,28,29].

1.4. Survey Positioning and Roadmap

This survey advances a unified view linking device realities to mapping policies, compiler support, and benchmarking standards. We spotlight graph- and learning-driven mappers that move beyond static rules or single-objective integer linear programming (ILP), targeting adaptive partitioning and scheduling under heterogeneous resource and precision constraints [13,22,23]. We also enumerate the software infrastructure SongC, LOGIC, MNSIM 2.0, and Sim2PIM needed to make these methods practical at scale [24,25,26,27]. Throughout, we emphasize reproducible evaluation and open interfaces so that results translate from academic prototypes into robust, deployment-ready systems [8,9,28,29,30,31,32,33,34,35,36,39]. By offering an integrative roadmap, this survey aims to catalyze the next phase of intelligent, graph-aware PIM/IMC systems. Section 2 details the architectural landscape and emerging memory technologies that form the hardware substrate for DNN deployment.

2. Landscape of IMC/PIM Architectures (Hybrid)

Processing-in-memory (PIM) and in-memory computing (IMC) reduce data movement by bringing computation into memory modules/stacks or directly into the array fabric. We treat them as a continuum: PIM covers logic-near-memory integration, while IMC covers in-array operations realized in SRAM, DRAM, or non-volatile devices, spanning digital primitives to analog accumulation for bandwidth-efficient, energy-aware DNN deployment [4,15,40,41,42,43,44,45,46,47,48,49,50,51,52].

As illustrated in Figure 2, these paradigms reshape the hardware stack by integrating computation closer to memory, reducing the overhead of frequent transfers. This section synthesizes the architectural shift, device taxonomy and quantitative trade-offs, peripheral bottlenecks and their mitigation, integration and 3D stacking, and the near-/long-term application outlook, grounding the discussion in concrete digital-PIM performance and photonic-IMC specifics. Our aim is tutorial-plus-review: explain why the choices matter, show how they perform, and tie device-level realities to system-level consequences.

2.1. Paradigm Shift: From Data Movement to Memory-Centric Compute

Eight decades of system design have progressively relaxed strict logic–memory separation. As shown in Figure 3, the historical trajectory highlights a steady dismantling of the von Neumann bottleneck. IMC performs analog/digital operations directly in arrays using RRAM, PCM, FeFET, and related devices, exposing locality that conventional accelerators leave on the table [4,15,40,41,42,43,45,46,47,48,49,50]. PIM, often realized in DRAM/SRAM, co-locates digital compute with banks to improve integration, reliability, and scalability while preserving mature toolchains and process flows [44,51,52]. The resulting spectrum balances energy, precision, process compatibility, programmability, and system complexity, with hybrids increasingly blurring analog–digital and memory–compute boundaries to meet application constraints and deployment timelines.

2.2. Digital PIM in DRAM/SRAM: Operations, Gains, and Exemplars

Modern digital PIM already exposes a rich on-die operation set arithmetic (add, mul, multiply-accumulate operation (MAC), reduction, bitwise, LUT), data movement (row/column copy, gather/scatter, concurrent compute-transfer), and programmable precision from 1 to 16 bits, and it does so with measured system-level benefits [7,53,54,55,56]. Quantitatively, DRAM-PIM delivers up to 406% faster matrix–vector multiplication than host burst reads, demonstrating that near-array compute can outrun host bandwidth ceilings [53]. Shared-PIM reports 5× latency reduction and 1.2× energy improvement versus low-cost inter-linked subarrays LISA by eliminating redundant transfers [55]. BL-PIM achieves 28.9× CPU single-thread and 12× CPU multi-thread speedup on transformer workloads, indicating that attention-heavy models benefit from row-stationary dataflows near DRAM [57]. In stacked memories, high-bandwidth memory (HBM)-PIM integrates compute in logic layers, yielding 53% higher performance and 10.4% better energy efficiency than GPU-HBM systems at comparable capacities [58]. Commercial DRAM-PIM (UPMEM) scales from 640–2556 dram processing units (DPUs) at 0.4–2.5 TOPS/W with 1.5–2.5× lower latency than CPUs across common kernels, showing that production software stacks can already target PIM primitives [9]. On SRAM, Z-PIM spans 0.31–49.12 TOPS/W via sparsity-aware, variable-precision execution that trades bit-width for utilization on the fly [54]. These comparative system-level outcomes are consolidated in Table 1, which highlights representative memory technologies and their integration trade-offs. Collectively, these outcomes position digital PIM as a deployable class of accelerators rather than a lab curiosity, with roadmaps that naturally dovetail with existing DRAM/SRAM ecosystems [7,9,53,54,55,56,57,58].

2.3. Device Taxonomy and Comparative Trade-Offs (Volatile vs. Non-Volatile; Analog vs. Digital)

A relational view across volatility and computational modality clarifies deployment niches [15,51]. Volatile devices (DRAM, SRAM) dominate near-term systems: DRAM offers ∼1–8 effective bits and

10^{14}

–

10^{16}

endurance cycles but pays ∼10–100 pJ/op for writes/reads, with refresh shaping scheduling and energy envelopes [15,51]. SRAM provides up to 16-bit precision with sub-nanosecond switching and effectively unlimited endurance, yet density and leakage constrain scale in area-bound deployments [54,59,72,73]. Non-volatile families (FeFET, MRAM, charge-trap devices) trade density and instant-on operation against drift, endurance, and process variability, which shifts where calibration complexity sits in the stack [63,74]. In practice, the maturity of DRAM/SRAM underpins scalable PIM, e.g., LUT-based DRAM PIM, Z-PIM, and HBM-PIM, while emerging NVMs feed analog/digital IMC prototypes that explore precision–energy frontiers [7,54,58].

Concrete exemplars illustrate these niches in practice. DRAM systems such as HBM2-PIM and the Smart Memory Cube pair stacked bandwidth with near-array compute to amortize movement costs [51,52]. SRAM-centric designs including PIMCA and IMC-Sort target low-latency control and predictable timing for embedded inference [41,59,60]. RRAM-based platforms SLIM, 2T2R RRAM and PCM variants PCM-AIMC, Photonic PCM probe analog accumulation and multi-level storage for dense MACs [47,48,49,50,61,62]. FeFET lines (e.g., FeFET-PIM, DG-FeFET) and MRAM families (BCLS-SP, SOT-MRAM) examine retention, write energy, and selector co-design under realistic process windows [34,64,65,66,67,68]. These comparative insights are visualized in Figure 4, which contrasts technologies along precision, endurance, energy, and maturity dimensions, and detailed further in Table 1 for concrete examples such as DRAM-PIM, PIMCA, RRAM SLIM, and PCM-AIMC.

2.4. Peripheral Overheads and System Bottlenecks

Array-internal efficiency is necessary but not sufficient: ADCs, DACs, and periphery often dominate system budgets. Converter chains can consume 50–95% of energy, occupy 10–50% of area, be 10–12× larger than a 128 × 128 array, and dictate latency frequently orders of magnitude above MAC time when left unoptimized [75,76,77]. Effective systems respond with quantization-aware training, mixed-precision and architected converters, converter sharing across tiles, and accumulation schemes that keep precision where it matters, reclaiming energy/area while preserving accuracy under deployment constraints [75,76,77,78,79].

2.5. Device Non-Idealities and Cross-Layer Mitigation

Non-volatile/analog arrays face IR-drop, variability, resistance drift, write disturbance, and sneak paths; unmodeled, these effects erode accuracy and mask energy benefits [45,80,81,82,83]. Cross-layer strategies restore lost performance: selector devices and iterative write-verify tighten state distributions; quantization-aware training (QAT) and noise injection at training time desensitize models to analog error; and converter-linearity calibration limits systematic bias at the boundary between analog arrays and digital logic [45,80,81,82,83]. The practical lesson is simple: mitigation must be budgeted alongside peak TOPS/W, because accuracy lost to variation rarely returns without explicit cross-layer investment.

2.6. Integration Challenges, Selectors, and 3D Stacking

CMOS-proximate scaling diverges by device family. DRAM/SRAM continue to scale with well-understood design rules, whereas RRAM/PCM variability and FeFET process control place a premium on layout discipline and verification [63,64,65,66,67,68]. Selector technologies OTS devices, 1S1R stacks, and self-selective elements suppress sneak paths and enable larger arrays without prohibitive leakage [84,85,86,87,88]. Recent prototypes demonstrate 3D self-rectifying memristors for high-density readout and wafer-scale 2D/3D stacks surpassing 8 TOPS/W at >93% inference accuracy, underscoring that vertical integration can lift both density and energy envelopes when periphery keeps pace [89,90,91,92,93]. In practice, thermal crosstalk, interconnect resistance, and converter limits not cell density alone bound scaling trajectories and dictate where system architects should spend their margin [84,85,86,87,88,89,90,91,92,93].

2.7. Photonic IMC: Modulators, MAC Mechanics, Precision, and Energy

Photonic IMC exploits optical interference and wavelength-division multiplexing for massively parallel MAC. Weights are realized with microring resonators or PCM-based photonics; inputs are encoded optically and summed interferometrically or at photodiodes, with the electrical boundary set by source and detector characteristics [94,95,96,97]. Typical precision sits at 4–6 bits, limited by shot noise, nonlinearity, and thermal drift; pushing higher raises required photon counts and energy per MAC [94,97]. Reported energies span sub-pJ/MAC in silicon photonics, with projections to a few fJ/MAC in advanced platforms, positioning photonics toward ultra-fast, spatially parallel compute where quantization–energy trade-offs can be managed by algorithmic tolerance and mixed-signal partitioning [94,95,96,97].

2.8. Practical Rule-of-Thumb Bridge to Mapping (Device-Aware Constraints)

Because Section 3 builds on device-aware mapping, we summarize the immediate implications here. When I/O bandwidth limits throughput, output- or row-stationary flows keep activations local and favor large GEMMs and attention blocks; when arrays are shallow or local SRAM is tight, deeper pipelines with double-buffering maintain utilization. Short retention argues for periodic refresh or remapping schedules; low ADC resolution motivates more frequent digital accumulation to preserve accuracy; and large or high-precision layers should be tiled to respect array and converter limits so that timing closure is routine rather than heroic [39,44,98,99,100,101]. These rules keep the architecture–mapping handshake within endurance, retention, precision, array-size, and converter envelopes that real hardware enforces.

2.9. Application Scope and Commercialization Pathways

Near-term deployments anchor in DRAM/SRAM (e.g., HBM-PIM, Smart Memory Cube, PIMCA, IMC-Sort) for AI/HPC, where software stacks and packaging already support memory-proximate compute [41,51,52,59,60]. Mid-term directions leverage manufacturable NVMs as endurance and switching characteristics improve, allowing more aggressive mixed-signal designs with calibrated variation [71]. Long-term trajectories point to photonic and novel logic-in-memory substrates for neuromorphic and ultra-fast inference, provided that converter-aware co-design, selector-enabled scaling, and robust cross-layer mitigation remain first-class constraints rather than afterthoughts [50,61,62]. Across these paths, commercial success tracks the teams that integrate device physics, periphery design, and compiler/mapping choices into a single, testable story that holds under real-world noise, drift, and workload diversity. In summary, the PIM/IMC hardware landscape is both technologically diverse and rapidly evolving. Mature platforms offer immediate scalability, while emerging memory technologies promise disruptive efficiency gains pending solutions to longstanding device and integration hurdles. This interplay of trade-offs sets the stage for the next section, where we examine how these architectures influence the design of DNN mapping strategies, scheduling algorithms, and partitioning heuristics.

3. Taxonomy and Pipeline for Mapping, Partitioning, and Scheduling on PIM/IMC (Hybrid, Combined)

Deploying DNNs on heterogeneous PIM/IMC platforms demands mapping strategies that respect device realities (precision, endurance, variability), memory hierarchies, and system constraints. Rule-based and heuristic methods remain useful in well-characterized settings; metaheuristics open larger design spaces. Recent work adds RL/DRL, graph-based methods, and hybrids that integrate ILP/heuristics with learning-guided search for scalability and adaptivity [100,102,103,104,105,106,107,108,109,110]. These approaches account for non-idealities (RRAM variability, PCM drift, DRAM refresh), improving cross-layer robustness [26,37,111], and often beat traditional flows on latency, energy, and scalability, especially for large or heterogeneous deployments [106,107,110,112]. Hybrid combinations (ILP + ML, RL + graph) and compiler-integrated toolchains further boost adaptability and end-to-end usability [102,104,105,106,107,112,113]. The spectrum of conventional and advanced methods is consolidated in Table 2, which compares their mechanisms, strengths, and limitations.

3.1. Conventional → Advanced Strategy Spectrum

Rule-based. Predefined rules tied to layer cues (e.g., kernel size, reuse) offer simplicity and low cost but generalize poorly to heterogeneous hardware and evolving topologies [100,102,114,132].
Heuristics. Greedy/DP/graph-guided methods add scalable cost models (latency, memory, energy) yet can miss global optima under intertwined constraints [103,115,116,117,118,133].
Metaheuristics. GA/PSO/simulated annealing explore high-dimensional spaces and heterogeneous constraints but incur compute overheads that limit real-time use [104,112,119].
Learning-/graph-based and hybrids. DRL/RL learns adaptive mapping/scheduling under non-stationary workloads (training cost is the key hurdle) [121,122,123,124,125,126]. Graph-based methods cast models as data flow graphs (DFGs) and apply partitioning/min-cut to reduce communication and balance load [127,128]. Hybrids mix rules with auto-tuned design space exploration (DSE) and profiling to co-optimize multi-objective trade-offs [102,129,130,131]. Current taxonomies emphasize metaheuristics for search efficiency [58,134,135], RL/DRL for runtime adaptability [106,107,112], and hybrid (ILP + ML, RL + graph) dominance at scale [102,104,105,106,107,112,113].

3.2. Device-Aware and Cross-Layer Constraints (First-Class)

Mapping must respect analog-to-digital/digital-to-analog converters (ADC/DAC) precision/bandwidth limits, retention/endurance envelopes, array/tiling limits, local SRAM capacity, and DRAM/SRAM/NVM bandwidth/reuse; advanced strategies integrate these to preserve accuracy and throughput [26,37,111].

3.3. Pipeline Primitives and Design Knobs

Operator mapping (layer → resource).

Deterministic rule-based schemes work when profiling is stable and hardware well characterized [132,136]; under heterogeneity or tight capacity, learned/hybrid policies improve locality and prevent overruns.

Model partitioning.

Layer-based, tensor/spatial, pipeline/temporal, graph-cut, and hybrid (layer + tensor) options are used [105,118]. Weight-layout-guided hybrid partitioning (PIMCOMP) yields 32% latency reduction and 1.7× throughput, while profiled graph-cut achieves 18% execution-time reduction vs. manual strategies [103,105].

Task scheduling.

Static compile-time ILP (NicePIM) gives strong resource/communication optimality but less elasticity (37% latency and 28% energy gains reported); profile-guided and RL-based runtime schedulers adapt to system-state changes and heterogeneity [103,104,105]. Formally, the schedule length (makespan) across heterogeneous resources can be expressed as:

T = max_{r \subseteq R} (\sum_{t \in T (r)} d_{t})

(1)

where

d_{t}

denotes task duration,

T (r)

the set of tasks assigned to resource r and R the set of all resources. This definition provides the mathematical basis for reporting “improved makespan” results in runtime scheduling studies [137,138].

Dataflow scheduling.

Weight-/output-/row-stationary patterns trade bandwidth efficiency and reuse to minimize latency/energy in memory-centric pipelines [136].

End-to-end view.

Mapping decisions feed hardware execution via a feedback co-design loop, exposing partitioning/scheduling to compiler/runtime for iterative improvement. These elements are synthesized in Figure 5, which illustrates the complete deployment pipeline from high-level mapping strategies to hardware execution and feedback-driven co-design.

3.4. Strategy-to-Flow Schematics (Fit and Roles)

A modern taxonomy spans rule-based/ILP/metaheuristics/RL/GNN/hybrids with characteristic flows and targets [135,139,140,141,142,143,144,145,146,147]. Typical schematics: loop-tiling pipelines (heuristic/metaheuristic tiling → ILP/RL scheduling → GNN-guided mapping → code generation) [145,146]; DAG scheduling (ILP/metaheuristics for order, GNN/RL for placement) [140,141,142]; RL feedback loops (state → agent → action → reward → policy) [142,147,148].

3.5. Compiler Integration and Portability

Integration with TVM, Halide, XLA is advancing, yet end-to-end RL/GNN/hybrid scheduling remains uneven due to application programming interface (API) fragmentation, reproducibility gaps, and portability barriers [134,144,145,149,150,151]. Portable intermediate representations (IRs) (e.g., ApproxHPVM) and model-based layers (e.g., MATCH) decouple policies from hardware specifics, improving transfer [144,145,150].

3.6. Cross-Layer Co-Optimization and Feedback

Co-optimization unifies mapping/routing with device-level constraints (energy, quantization, wire length) [138,152,153]. Compiler ↔ device feedback uses runtime telemetry (sensors, counters) to adapt schedules under variability and endurance limits [140,143,148,152,154]. At the edge, event-driven re-tiling, kernel hot-swaps, and latency-aware control are increasingly common [142,152,155,156,157].

3.7. Quantitative Evidence and Case Studies

RL/GNN schedulers often surpass ILP/heuristics on large models (e.g., ResNets, Transformers) when exact solvers become intractable [141,142,143,145,147]. Examples: GraphAGILE (compiler + GNN): 47.1× latency vs. CPU, 2.9× vs. FPGA [143]; MARCO (RL + Tabu): 44–50% shorter schedules [138]; RL + GNN (TOP): throughput/concurrency gains over SOTA [147]; TVM-based model-aware mapping: up to 60.9× vs. tensor virtual machine (TVM) [145]. These complement PIMCOMP and graph-cut scheduling gains above [103,105].

3.8. Contributors Snapshot

The literature spans CGRA mapping (REGIMap/CRIMSON/RAMP), SAT-/SMT-based scheduling, evolutionary/hyper-heuristics, and GA-based spatial mapping [137,153,158,159,160,161,162,163,164,165]. Frequently appearing contributors include Aviral Shrivastava [162,163,164], Takuya Kojima [140,153], and L. Pozzi [160,165].

3.9. Limits and Gaps

Conventional pipelines can under-utilize parallelism under sparse/bursty workloads, struggle with variability, and lack deep device-aware co-optimization and learning-centric control. Embedding runtime adaptivity and graph-based intelligence via automated compilers and GNN-augmented tools offers a path forward [118,126,131]. Standardized APIs, portable IRs, and reproducible benchmarks are still required for broad industrial uptake [134,144,145,149,150,151].

3.10. Takeaway

Rule-based/heuristic pipelines remain baselines, but metaheuristics, RL/DRL, graph-based, and hybrid designs are now the main levers for scalable, device-aware mapping on modern PIM/IMC platforms [102,104,105,106,107,112,113], often surpassing classical ILP/heuristic flows under heterogeneity and runtime variability [106,107,110,112]. Building on this taxonomy, the following section provides a chronological walkthrough of how mapping frameworks have evolved from rule-based pipelines to graph- and learning-driven systems marking key innovations across 2019 to 2025.

4. Chronological Evolution and Comparative Analysis of DNN-to-PIM Mapping Frameworks (2019–2025)

The deployment of deep neural networks (DNNs) onto processing-in-memory (PIM) and in-memory computing (IMC) platforms has evolved rapidly, reflecting a transition from rule-based heuristics to advanced, learning-driven and graph-based co-design strategies. This trajectory is best understood by situating individual contributions within a taxonomy of frameworks heuristic, ILP/optimization, metaheuristic, reinforcement learning (RL/DRL), graph-based, and hybrid approaches and tracing their chronological progression. Table 3 provides an overview of representative frameworks spanning 2019 to 2025, with their innovations and benchmark outcomes.

4.1. Early Years (2019–2020): Foundational Simulators and Rule-Based Flows

The first wave of frameworks demonstrated the feasibility of mapping convolutional workloads onto crossbar and 3D-stacked memory systems. NNPIM [5] pioneered weight-sharing and parallel execution across analog crossbars, achieving up to 48.2× speedup and 131.5× energy efficiency relative to GPUs. DLUX [167] exploited in-DRAM look-up tables with near-bank mapping and loop tiling, while 3D-ReG [166] integrated GPUs with 3D-stacked PIM, reporting 5.64× training speedup and 3.56× energy savings. These efforts emphasized that mapping is inherently a cross-stack optimization problem. In parallel, the release of DNN + NeuroSim V2.0 [39] introduced an end-to-end simulator that captured device non-idealities, circuit-level dynamics, and system-level performance, laying the foundation for reproducible benchmarking.

From the perspective of the new taxonomy, these frameworks exemplify heuristic and rule-based strategies static scheduling and tiling flows that optimize for locality but struggle to scale across heterogeneous architectures [101,104,106,109,167,170,175,180,181,182,183].

4.2. Maturity Phase (2021–2022): Hardware-Aware Optimization and ILP-Based Scheduling

As devices scaled, mapping frameworks adopted optimization-driven and quantization-aware approaches. ZigZag [130] introduced uneven mapping and nested-for-loop design space exploration, improving energy efficiency by up to 64% compared with prior DSE flows. Robust PIM [168] integrated Hessian-driven quantization and sensitivity analysis to maintain inference accuracy under variability. PIM-DRAM [169] leveraged intra-bank accumulation in DRAM-PIM primitives, demonstrating up to 19.5× speedup over GPU baselines.

A key inflection point arrived with DDAM [180] and NICE-PIM [104]. DDAM applied a dynamic programming and traveling salesman problem (TSP) formulation to partition CNNs efficiently, reducing inter-layer communication costs. NICE-PIM employed a tri-layer flow PIM-Tuner, Mapper, and Scheduler integrating ILP-based scheduling and deep kernel learning for joint optimization of placement and configuration.

These advances align with the ILP/optimization category, where mapping is treated as a constrained mathematical problem to maximize throughput, minimize latency, or balance communication overheads [64,104,177]. Complementary works such as IVQ [33] and RaQu [171,172] incorporated quantization- and AutoML-driven strategies, delivering multi-fold gains in energy efficiency and accuracy. Their integration of hardware-aware quantization, hybrid scheduling, and temporal–spatial mapping signaled a growing emphasis on automated optimization [105,106,130,175].

Formally, ILP-based mapping can be expressed as the following constrained optimization problem:

\begin{matrix} min_{x, s} & α T + β E + γ C s . t . \sum_{l} mem (l) x_{l, p} \leq {Cap}_{p}, \\ s_{j} \geq s_{i} + d_{i} \forall (i \to j), x_{l, p} \in {0, 1} \end{matrix}

(2)

Here, T, E, and C denote the total latency, energy consumption, and communication cost, respectively, while

α

,

β

,

γ

are weighting factors that balance these objectives. The variable

x_{l, p}

is a binary decision indicating whether layer l is assigned to processing unit p, subject to memory capacity

{Cap}_{p}

. The scheduling constraint ensures that a dependent task j can only begin after task i completes, shifted by its delay

d_{i}

[104,177,180].

4.3. Recent Innovations (2023–2025): Graph-Based, RL/DRL, and Hybrid Frameworks

The most recent frameworks demonstrate a decisive shift toward adaptability, scalability, and graph-structural modeling. Gibbon [175] leveraged evolutionary co-exploration of DNN architectures and PIM hardware, reducing energy–delay product by up to 5.96× [175,178]. NICE-PIM’s successors extended ILP scheduling with runtime adaptivity. Reinforcement learning–based frameworks [175] introduced dynamic policy adaptation, learning scheduling rules robust to workload variability.

Graph-partitioning objectives can be formalized through a min-cut optimization that balances partition loads while minimizing communication:

min_{π} \sum_{(i, j) \in E} w_{i j} 1 [π (i) \neq π (j)] s . t . \sum_{v : π (v) = k} w_{v} \approx \frac{1}{K} \sum_{v} w_{v}, \forall k

(3)

In this formulation,

π (v)

denotes the partition assignment of node v, and

w_{i j}

is the communication weight of edge

(i, j)

. The indicator

1 [π (i) \neq π (j)]

counts edges that cross partitions, while the balance constraint enforces that each partition k receives approximately an equal share of the total node weight

\sum_{v} w_{v}

. This captures the dual goal of minimizing communication cost while preventing load imbalance across partitions [105,130].

PyGim [184] enabled real GNN workloads to be directly mapped onto PIM platforms, outperforming PyTorch with Python 3.8 CPU baselines by up to 3.44×. PASGCN [185] further advanced GNN inference on crossbar accelerators, achieving up to 16,455× speedup alongside significant energy reduction.

Hybrid methods combined heuristics with metaheuristics (e.g., genetic algorithms, simulated annealing) or RL-based refinement. These approaches balance scalability and mapping optimality [104,105,130,175,178]. For instance, PIMCOMP [105] consolidated these principles into an end-to-end compiler, embedding weight-layout guided partitioning and multi-objective optimization across latency, throughput, and energy.

In RL/DRL-based scheduling, the reward function often integrates latency, energy, and communication cost as follows:

r_{t} = - (λ_{L} L_{t} + λ_{E} E_{t} + λ_{C} C_{t}), J (θ) = E [\sum_{t} γ^{t} r_{t}]

(4)

Here,

L_{t}

,

E_{t}

and

C_{t}

represent the latency, energy, and communication cost observed at time step t, weighted by coefficients

λ_{L}, λ_{E}, λ_{C}

. The immediate reward

r_{t}

is accumulated into a long-term return

J (θ)

, where

γ

is the discount factor and

θ

parameterizes the agent’s policy. This formulation explicitly encodes the multi-objective trade-offs faced by RL-based schedulers [142,147,148,175].

Policy-gradient methods then update the scheduling policy according to:

\nabla_{θ} J (θ) = E [\nabla_{θ} log π_{θ} (a_{t} ∣ s_{t}) A_{t}]

(5)

In this expression,

π_{θ} (a_{t} ∣ s_{t}) A_{t}

denotes the probability of selecting action

a_{t}

under state

s_{t}

given parameters

θ

. The advantage term

A_{t}

measures how much better (or worse) the chosen action is relative to a baseline, guiding updates to the policy. The expectation over trajectories captures the stochastic nature of reinforcement learning in dynamic scheduling environments [142,147,148,175]. From the taxonomy viewpoint, these frameworks represent the convergence of graph-based [105,130], RL/DRL [175], and hybrid [104,105,130,175] paradigms, where adaptability to dynamic conditions is increasingly prioritized over static efficiency.

4.4. Comparative Trade-Offs and Open Gaps

The chronological trajectory underscores three trends. First, there is a clear shift from static heuristics to dynamic, learning- and graph-informed scheduling, improving adaptability under non-stationary workloads. ILP/metaheuristics deliver higher mapping quality but incur runtime costs, while heuristics are scalable but suboptimal [104,170,175,178,180]. RL and hybrid methods improve generality but add design complexity [105,130,175].

Second, frameworks now embed device-aware modeling. Mixed-precision quantization, robustness-aware pruning, and training-time noise injection mitigate variability and drift in RRAM, PCM, and FeFET arrays [168,171,174,176]. Yet, DRAM, SRAM, and photonic-specific non-idealities remain underexplored [109,167,169,177,182].

Third, there is a growing emphasis on benchmark-driven reproducibility. Integrated pipelines such as DNN + NeuroSim [39] and PyGim [184] provide standardized comparisons across diverse substrates. Yet, even large-scale benchmarks like MLPerf and OGB remain insufficiently calibrated for device-level non-idealities, leaving compilers under-validated for real PIM contexts. Benchmarking protocols also remain fragmented, limiting cross-framework comparability. While this evolution reveals increasing automation and architectural awareness, a critical enabler of such progress lies in the compilers and end-to-end mapping tools discussed next bridging software abstractions with memory-centric hardware execution.

Open gaps remain: most frameworks target inference only, leaving training support largely absent. Support for non-CNN workloads such as transformers and GNNs remains piecemeal, underscoring the limited generality of current frameworks. Research must therefore pursue generalizable, compiler-integrated solutions that unify device-aware modeling, graph-based scheduling, and adaptive mapping validated on production-scale hardware [33,170].

5. Software-Centric Approaches: Compilers and End-to-End Mapping Tools

5.1. Bridging Algorithm–Hardware Abstraction Gaps

Compilers and end-to-end mapping systems form the critical interface between high-level DNN descriptions and deployment on processing-in-memory (PIM) and in-memory computing (IMC) architectures. They coordinate pruning, quantization, and memory scheduling to satisfy hardware constraints while balancing energy, throughput, and accuracy trade-offs [44,136]. Mainstream ecosystems such as TensorFlow and PyTorch dominate model development yet remain largely agnostic to memory-centric execution; specialized compilers and mapping frameworks therefore address the abstraction and hardware-alignment gaps unique to IMC/PIM systems [44,136,186].

This ecosystem spans DRAM, SRAM, RRAM, PCM, FeFET, and photonic PIM/IMC, and can be organized into a taxonomy of rule-based, graph-transformation, ML-assisted, co-design, and hybrid human + ML frameworks [24,104,105,106,175,187]. Early rule-based flows such as Timeloop relied on static heuristics, whereas graph-level compilers like SongC and ML-assisted frameworks such as NicePIM and Gibbon enable cross-layer optimization [24,104,106,175]. End-to-end co-design tools like CoMN and benchmarking flows such as DNN + NeuroSim illustrate the spectrum from modular automation to full-stack device-aware deployment [39,187]. Simulator-centric validation has accelerated prototyping but can inflate expectations when real-world effects such as process variation, temperature drift, and circuit-level noise are omitted.

5.2. Mapping Pipelines and Dataflow Strategies

Mapping pipelines partition DNN workloads across memory arrays or PIM cores under constraints such as array dimensions, interconnect bandwidth, and memory compatibility; operators are assigned to resources, and scheduling engines refine execution to minimize latency and maximize data reuse [44,136]. Dataflow strategies align model operations with physical layouts weight-stationary for RRAM, row-stationary for DRAM while memory allocation and parallelization exploit spatial and temporal hierarchy [44,136]. Frameworks such as Fast-OverlaPIM and HuNT explicitly target tiling and heterogeneous multi-device training, and SIAM demonstrates chiplet-based scalability, together highlighting the diversity of execution strategies across heterogeneous devices [106,109,188].

5.3. Dynamic Optimizations and Latency Reduction

Dynamic techniques extend compiler capabilities. Mixed-precision quantization adjusts bit-widths adaptively to support analog or low-resolution digital operations while limiting accuracy loss [189]. Workload distribution across arrays increases parallelism, although photonic PIM and other non-traditional variants impose connectivity and throughput limits that compilers must respect [44,62,136]. To curb data-movement bottlenecks, modern flows integrate compression, tiling, and hierarchical memory management [173,174].

Quantitative evaluations consistently report 3–18× speedups and substantial energy/resource savings over manual or heuristic mapping [39,104,105,106,175,188]. Pruning and quantization remain effective for footprint reduction, with the largest gains realized when coupled to hardware-specific scheduling and mapping [186,189,190]. Comparative studies further confirm that IMC/PIM, when paired with intelligent compilers, outperforms conventional accelerators in energy and latency efficiency [186,190,191].

5.4. Framework Diversity: Open-Source vs. Proprietary

Open-source frameworks such as PIMCOMP and DNN + NeuroSim promote transparency and reproducibility, though they often lack the large-scale industrial calibration of proprietary flows like DeePhi, REDI, and MATIC [105,192]. Proprietary toolchains typically deliver optimized performance but remain closed, limiting comparability and reproducibility. Tools such as MNSIM 2.0 extend modeling fidelity, while benchmarking frameworks including DNN + NeuroSim and SIAM reflect ongoing efforts toward evaluation standardization [26,39,188]. An illustrative taxonomy Table 4 situates these frameworks within rule-based, ML-assisted, and co-design paradigms, underscoring both their diversity and fragmentation.

5.5. Unresolved Gaps and Future Directions

Despite progress, fully automated deployment flows remain elusive; most frameworks still require manual adjustment or hardware-specific customization [44,186]. Emerging trends emphasize hardware-aware passes, cross-layer co-optimization, and ML-based autotuning for design-space exploration, yet support beyond CNN inference covering transformers, GNNs, and multimodal pipelines remains piecemeal [44,176]. The lack of unified APIs, common IR formats, and co-simulation environments further fragments the stack, even as modular tools (e.g., PIMCOMP) and hierarchical device-to-system models (e.g., DNN + NeuroSim) push in complementary directions [105,189,192,193,194,195]. A central challenge is devising a unified intermediate representation that enables portability across compilers and heterogeneous PIM/IMC hardware, alongside strategies that mitigate device non-idealities (e.g., variability and drift) to close the gap between simulation and deployment. In parallel, hybrid human + ML scheduling where expert intuition guides automated search remains a promising but underexplored opportunity for complex workloads [24,36,187].

Ultimately, progress depends on unifying fragmented flows into modular, open-source compilers that bridge device-level variability, workload heterogeneity, and system-scale deployment. The influence of contributors such as Sun, Yu, and Wang, and the visibility of venues including IEEE TCAD, Nature Communications, and JETCAS, underscore the vitality of this field; these evolving platforms underpin intelligent DNN-to-PIM deployment and lay the groundwork for the standardized metrics and curated datasets discussed next [39,105,113,196,197,198].

6. Benchmarking and Dataset Resources

6.1. Scope and Rationale

Benchmarking and dataset resources underpin rigorous evaluation of DNN deployment on memory-centric hardware specifically PIM and IMC by revealing algorithm–hardware interactions and making energy–latency–area trade-offs comparable across heterogeneous platforms. The field has moved from ad hoc tests to standardized, graph-aware, and hardware-in-the-loop suites, and this section distills that evolution while aligning metrics, coverage, and reporting so that apples-to-oranges comparisons give way to reproducible, decision-relevant evidence [9,39,104,188,199].

6.2. Taxonomy of Benchmarking Practices and Datasets

In practice, evaluation now spans two poles and their meeting point. Synthetic kernels isolate stress points under controlled conditions, enabling repeatable micro-studies that expose bottlenecks; real-world workloads, in contrast, surface end-to-end constraints and system effects that ultimately determine deployment viability. As mapping becomes graph- and GNN-aware, graph-based datasets (e.g., OGB) probe structural irregularity and memory/interconnect pressure, while hybrid pipelines fuse synthetic and real/graph sources with cross-layer metrics to balance experimental control with realism and interpretability [9,39,104,188]. Representative coverage across benchmarking frameworks and datasets is consolidated in Table 5, which highlights device/architecture scope, supported models, dataset types, and major limitations.

6.3. Public Ecosystem and Coverage Map

Across the stack, tools play complementary roles rather than acting as substitutes. DNN + NeuroSim V2.0 couples device/circuit modeling with higher-level performance and accuracy across SRAM/eNVM and is widely used for on-chip training [69], whereas SIAM emphasizes chiplet-scale network-on-chip (NoC)/DRAM/crossbar co-modeling that captures interconnect bottlenecks. NeuroSim abstracts and therefore requires careful calibration for robust claims [188]. Benchmarking DNN Mapping concentrates on PE-level architectural mapping and area efficiency instead of full system breadth [100], while NicePIM demonstrates latency and energy gains in 3D-DRAM PIM even as its dataset breadth remains unclear [104]. MNSIM 2.0 covers digital and analog PIM, includes fabricated macros, and reports modeling errors explicitly [26], and Gibbon co-explores accuracy–energy delay product (EDP) for memristor-based DNNs [175]. A heterogeneous IMC cluster extends coverage to TinyML and mobile tasks, grounding system, area, and latency metrics in realistic use [200]; in parallel, UPMEM, Mensa, and SIMDRAM empirically characterize DRAM-PIM across diverse networks with a strong energy/performance focus [9,111]. PIMulator-NN contributes cross-level templates but lacks a community-adopted dataset interface, limiting plug-and-play comparability [202]. Taken together, these efforts tile device-through-system coverage without fully overlapping; no single framework spans calibration-faithful device variation and production-scale traffic [9,26,39,100,104,175,188,200,201,202]. These comparative efforts, summarized in Table 5, demonstrate how no single framework yet spans both calibration-faithful device variation and production-scale workload traffic.

6.4. Standard Metrics and Normalization

To keep results portable and fair, we adopt a single metric card with explicit experimental context: TOPS/W, picojoules per operation (pJ/op), throughput, TOPS/mm², energy per inference, EDP, latency, accuracy, and utilization; together with model and dataset, batch size, quantization/precision, and mapping choices (tiling, dataflow, sparsity handling, analog or digital mode). Nominally identical TOPS/W or pJ/op can diverge when ADC range or batch size differs, so these parameters must be disclosed. Reporting 95% confidence intervals over ≥3 runs and releasing a compact reproducibility pack (configs, seeds, calibration manifests) further ensures that results are both defensible and transferable across platforms [9,58,186,204,205].

To capture component-wise contributions, the total inference energy is often decomposed as:

E_{total} = \sum_{l} (N_{l}^{MAC} e_{MAC} + N_{l}^{mem} e_{mem} + N_{l}^{comm} e_{comm})

(6)

where

N_{l}^{MAC}, N_{l}^{mem}, N_{l}^{comm}

denote the number of MAC, memory access, and communication operations at layer ℓ, respectively, and

e_{MAC}, e_{mem}, e_{comm}

are their corresponding per-operation energy costs. This roll-up formulation aligns with common reporting practices in benchmark tables that track pJ/MAC and communication overhead [104,105,130,206,207].

6.5. Methodological Pitfalls (“Red-Flag Matrix”) and Best Practices

The most common failure modes are well known: shortcut learning and benchmark overfitting that exploit dataset idiosyncrasies and yield gains that do not generalize [208,209,210]; neglect of non-idealities device noise/variation, quantization, and fault behaviors that overstate accuracy or efficiency when left unmodeled [39,78,168,179,206,211]; and apples-to-oranges setups that mix models, batch sizes, or metric definitions and obscure true architectural merit [199,212,213,214]. Add to this the problem of toy workloads and the reproducibility gaps created by missing code, seeds, or metadata [9,104,188,203,205]. The corrective is straightforward and non-negotiable: evaluate diverse real-world datasets alongside targeted synthetic probes; model adversarial noise and calibrated device variation; use rigorous cross-validation; report parameters explicitly, including model, workload class, batch size, precision, technology node, mapping policy, and validation basis; and release open-source, end-to-end pipelines. In cross-paper comparisons, metrics such as throughput, TOPS/W, TOPS/mm², latency, and accuracy should be interpreted together with workload class, model scale, precision, batch size, technology node, and validation basis, because headline values reported under heterogeneous conditions are not directly interchangeable. These pitfalls and corresponding corrective measures are synthesized in Table 6, which provides a “red-flag matrix” mapping pitfalls to severity, real-world impact, and recommended solutions.

6.6. Chronology (2018–2025): From Ad Hoc Suites to Graph-Centric and Hardware-in-the-Loop

From 2018–2019, operator-centric baselines established the micro-foundations of evaluation; by 2020–2022, standardized suites and cross-level co-modeling tied device variation and interconnect bottlenecks to end metrics DNN + NeuroSim emphasizing device/circuit phenomena and SIAM exposing NoC/DRAM/crossbar behavior [9,39,104,188,199,200,206,207]. Between 2023–2025, graph-scale datasets and learning-driven mapping frameworks made benchmarking explicitly variation- and interconnect-sensitive, with OGB-class datasets (millions of nodes/edges) stressing mapping, hierarchy, and communication patterns beyond small suites, while PIMBench and bespoke IMC/PIM sets offered device-specific control at smaller scales and therefore demanded especially careful reporting for reproducibility [9,39,104]. Representative studies retained here include end-to-end benchmarking for compute-in-memory accelerators [39]; real DRAM-PIM (UPMEM) system benchmarking [9]; chiplet-based IMC (SIAM) [188]; variation-aware RRAM stability [206]; spiking neural network (SNN) benchmarking on IMC hardware (SpikeSim) [207]; structured pruning for RRAM crossbars [215]; automated quantization for high-utilization RRAM-PIM [171]; mapping-method benchmarking for IMC [100]; precision-scalable MAC reviews [216]; broader ML benchmarks across prediction, time-series, and security that motivate dataset diversity [217,218,219]; and black-box latency confidence estimation for accelerators [205].

6.7. Integration with Software Stacks and Transition

Ultimately, benchmarks matter only insofar as they align with compiler/mapping passes and platform constraints. Standardized reporting paired with hardware-aware toolchains (e.g., PIMCOMP, DNN + NeuroSim) and open interfaces and IRs makes results portable across memory technologies and model classes. The datasets, metrics, and protocols consolidated above anchor the cross-platform comparisons that follow and feed directly into the next compiler/mapping study, enabling apples-to-apples evaluation across memory technologies [105,186,189,190,192,193,194,195].

The development of open execution graph datasets, realistic non-ideality modeling, and formalized evaluation protocols are essential next steps. Table 5 summarizes current benchmarks and datasets, highlighting both coverage and gaps in diversity and generality, while Table 6 presents common pitfalls and best practices to avoid misleading or non-reproducible results. Together, these tables capture the available resources and methodological gaps. Having examined the tools, datasets, and pitfalls shaping benchmarking infrastructure, we now turn to a comparative analysis of real-world PIM accelerators across energy, latency, and throughput metrics.

7. Comprehensive Reference Card and Comparative Benchmarking of DNN-to-PIM Accelerators

As DNN-to-PIM architectures advance toward practical deployment, the need for fair and unified benchmarking becomes critical. The wide diversity of memory substrates, mapping strategies, and model choices makes direct comparison challenging, yet establishing common ground is essential for evaluating real-world feasibility. To this end, we consolidate benchmarking results from recent studies into a structured reference framework that goes beyond isolated peak numbers to emphasize deployment viability energy efficiency, latency under system-level constraints, and throughput normalized by area while capturing architectural trade-offs across SRAM, DRAM, RRAM/ReRAM, embedded non-volatile memories, capacitor-based schemes, and hybrid analog–digital domains [9,58,186,204,205]. In parallel, we adopt the comprehensive reference card viewpoint from recent work, which standardizes the reporting of device/architecture features, mapping frameworks, supported workloads, and metrics (TOPS/W, bandwidth, latency, scalability, programmability) to enable reproducibility and design-space exploration [39,101,105,166,201]. The resulting perspective, illustrated conceptually in Figure 6, highlights a field moving away from raw peak performance toward balanced metrics that integrate efficiency, scalability, and practicality. Table 7 summarizes the unified metric definitions, while Table 8 and Table 9 compile comparative benchmarking results and cross-platform mapping/scheduling context, respectively, clarifying trends, bottlenecks, and standardization needs. Each point in Figure 6 is derived from Table 9: the marker is the midpoint of the reported range, and the whiskers indicate the min–max values from the source papers. Figure 6b uses throughput normalized by silicon area (TOPS/mm²) exactly as defined in the metric card (Table 7), ensuring apples-to-apples density comparisons.

Figure 6. Comparative benchmarking of PIM/IMC platforms. (a) Energy–latency trade-off across platform families, showing energy efficiency as a function of inference latency. (b) Energy–density trade-off across platform families, showing energy efficiency as a function of throughput per area. The color-coded platform labels correspond to the benchmark entries summarized in plot, including SRAM-IMC platforms such as 10T1C SRAM-based IMC [59], binary/ternary reconfigurable SRAM IMC [138], Dual-SRAM charge-domain IMC [232], mixed SRAM/RRAM IMC [100], and SRAM/ReRAM IMC [241]; DRAM-PIM platforms including 3D-stacked DRAM-PIM [104], Hybrid Memory Cube PIM [115], LUT-based DRAM-PIM [7], and UPMEM DRAM-PIM [9]; ReRAM-based platforms including RRAM-based analog IMC with RISC-V [200], ReRAM-based PIM for RNN workloads [238], and ReRAM-based PIM for ResNet/VGG workloads [176]; analog and mixed-signal IMC platforms including hybrid analog IMC plus digital acceleration [233], capacitor-based analog IMC [234], and analog IMC platforms [235]; and digital/other accelerators including a 5 nm digital accelerator [236], FPGA-based acceleration [239], a 16 nm multi-chip-module accelerator [237], SRAM/eNVM-based CIM [39], and digital SRAM-PIM [240].

7.1. Energy Efficiency vs. Latency: Trading Power for Real-Time Operation

A central pattern is the nonlinear trade-off between energy efficiency and latency. This pattern is evident in Figure 6a as clusters spanning roughly 0.1–10 ms with order-of-magnitude spreads in TOPS/W. Hybrid Analog IMC + Digital platforms deliver extreme efficiency (e.g., up to 600 TOPS/W) with competitive latency (≈0.12 ms on CIFAR-10) yet face precision and recalibration constraints across models and nodes [233]. 10T1C SRAM-based IMC achieves 437 TOPS/W and

1.2

TOPS/mm² on VGG/ResNet, ranking among the densest and most efficient charge-domain designs [59]. By contrast, digital accelerators such as 5 nm CMOS PIMs and 16 nm MCM report moderate efficiency (≈38.6–95.6 TOPS/W and

9.5

TOPS/W, respectively) but offer deterministic low-latency inference (≈0.08–

0.52

ms) on BERT and ResNet-50, favoring latency-critical AI tasks [236,237]. In practice, edge inference often benefits from analog and SRAM charge-domain systems [232,234], whereas datacenter-grade deployments prioritize deterministic throughput and tight compiler/runtime control [104,236,237].

7.2. Workload Compatibility and Generalization: Versatility vs. Specialization

General-purpose vs. specialized capability remains a defining axis. UPMEM DRAM-PIM supports broad workloads (DNNs, graph analytics) but exhibits low energy efficiency (≈0.1–

0.5

TOPS/W) and longer latency (≈10–100 ms) due to DRAM access overheads and coarse primitives offset by runtime adaptability and host offloading that suit heterogeneous pipelines [133]. Capacitor-Based Analog IMC and ReRAM-PIM hybrids deliver higher efficiency (≈10–80 TOPS/W) and sub-ms inference but are frequently tailored to CNN/RNN due to dataflow rigidity and peripheral bottlenecks [176,234,238]. SRAM-based FPGA-backed designs and digital PIMs occupy the middle ground (≈0.5–

8.9

TOPS/W) when paired with structured models [239,240]. As emphasized in Figure 6b, throughput-per-area (TOPS/mm²) distinguishes dense small-die designs from larger monolithic chips and avoids misleading raw-throughput comparisons. From the reference-card lens, clarifying supported model classes (CNNs, RNNs, transformers, emerging GNNs) alongside programmability and precision scaling is essential for apples-to-apples comparison [6,39,101,105,109,166,185,201].

7.3. Area Efficiency and Scaling Potential: Density Beyond TOPS/W

While TOPS/W has become canonical, TOPS/mm² exposes deeper architectural trade-offs. Digital platforms of 5 nm lead with ≈1.5 TOPS/mm², but capacitor-based IMC and 10T1C SRAM-PIM (≈1.2 TOPS/mm²) demonstrate that analog and charge-domain schemes remain competitive in compute density [59,234,236]. In contrast, DRAM-PIM and HMC-based PIM trail (<0.1 TOPS/mm²) due to interconnect overheads and host dependence [9,104,115]. These density-normalized trends align with the reference-card imperative to report throughput-per-area and energy per inference alongside peak efficiencies [39,105,201].

7.4. Latency–Accuracy–Energy Trifecta: Navigating the Pareto Frontier

Dual-SRAM charge-domain IMC attains ≈18–119 TOPS/W with ≤0.3 ms latency on VGG-8/CIFAR-10 while maintaining near-baseline accuracy [139]. Analog IMC ensembles can require retraining or calibration to mitigate quantization noise on deeper models (ResNet-50, Transformer-NLP) [43,235]. SRAM/ReRAM hybrids mitigate this via tunable quantization and hybrid phases, sustaining ≲30 TOPS/W with ≈0.4–

1.2

ms latency and improved generalizability [241]. Compiler-aware stacks ZigZag, Optimized Weight Mapping, FAMS prove pivotal, aligning quantization, pruning, and scheduling to architectural constraints and reporting >90% hardware utilization and EDP reductions under sparse workloads [104,129,225]. These observations reinforce the new-section emphasis that realized performance is software-stack-limited, requiring robust compilers and mapping frameworks [39,101,105]. Using these standardized metrics, Figure 6 visualizes the entries of Table 9: markers are the midpoints of the reported ranges with min–max whiskers; both panels use log–log axes; Figure 6b employs throughput per area (TOPS/mm²) for fair density comparison; and colors/labels group platforms by family as in the right-hand legend. Latency values are reported for batch

= 1

unless otherwise specified, and Table 8 provides the platform/context mapping.

7.5. Toward Unified Evaluation: Open Source, Real-World Workloads, and Toolchain Co-Design

Community momentum is shifting toward standardized benchmarks and open toolchains (e.g., ZigZag, MAERI, NeuroSim V2.0, FAMS) to enable reproducible cross-hardware comparisons [129,130,225]. However, several analog-centric prototypes (e.g., [233,234]) still lack public validation, complicating independent verification under stress tests and unseen workloads. To close this gap, Table 7 provides a Unified Metric Reference Card (definitions and units), Table 8 compiles comparative results from leading platforms, and Table 9 ties hardware targets to mapping/scheduling strategies [9,58,186,204,205,220,221,222,223,225,226,227,228,229]. Complementing these, the new-section additions advocate a living, MLPerf-like repository and reference baseline accelerators, with continual updates to capture evolving models (transformers, foundation/sparse models, GNNs) and heterogeneous integration [39,101,105,166,201].

7.6. Key Takeaways from Comparative Benchmarking

Analog IMC and SRAM-based designs achieve the highest energy efficiency (up to ≈600 TOPS/W) but face calibration and migration challenges [59,233].
Digital PIMs and DRAM-based platforms provide predictable latency and broader workload compatibility, critical for industrial deployment [9,104,236,237].
Workload sensitivity remains high: general-purpose DRAM-PIMs excel in adaptability yet lag in density/efficiency; specialized analog/charge-domain designs lead efficiency but narrow flexibility [9,59,234].
Area-normalized throughput (TOPS/mm²) is a more realistic measure of scalable deployment than peak TOPS/W alone [59,234,236].
Open toolchains and standardized datasets are essential; fewer than half of leading platforms provide public code/pipelines [129,130,225].
Peripheral/ADC costs dominate many analog IMC systems; quantization, resource sharing, and mixed-precision converters are decisive levers [75,76,77,78].
Comparisons reflect reported operating points and workloads; when batch size, precision, or ADC range differ, we prioritize area-normalized throughput (TOPS/mm²) and show ranges rather than single peak numbers to avoid apples-to-oranges conclusions.

7.7. Trends, Variability, and Observations

Recent studies indicate upward trends in energy efficiency and throughput density particularly for SRAM and analog IMC while real DRAM systems like UPMEM demonstrate competitive scalability and strong runtime adaptability [9,59,192,242]. Reporting variability is typically within ≈10% for mature digital/charge-domain systems, though analog platforms show higher sensitivity due to device-level non-idealities (drift, mismatch, IR-drop) [39,59,100,205,242,243]. From the new-section synthesis, dominant error sources and ADC/DAC/peripheral overheads (often 50–95% of system energy, area ratios up to 10–12× vs. crossbars) motivate hardware-aware training, QAT, calibration, and write-verify to recover accuracy and reduce system cost [75,76,77,78,79]. Together, these platform-level insights set the stage for learning-driven and graph-based compilation that pushes DNN-to-PIM toward automation, adaptability, and real-time intelligence.

7.8. Reference-Card Synthesis, Claims, and Open Questions

Adopting the reference-card perspective formalizes device class, architectural style (array-centric vs. near-memory), software stack maturity, and supported workloads, enabling fair comparison across DRAM-, SRAM-, ReRAM-, PCM-, MRAM-, FeFET-, and hybrid/3D families [6,39,101,105,109,167,185,201,244]. Evidence from recent surveys and prototypes supports the following: (i) standardized datasheets and community repositories improve comparability and DSE; (ii) hybrid/3D architectures offer superior scalability/efficiency; (iii) compiler maturity is a major bottleneck; (iv) inconsistent metrics/data hinder comparability; and (v) analog NVMs face reliability and peripheral-overhead limits [9,39,101,105,109,166,167,185,199,201,224,245,246].

Open questions include: building a living reference card with community governance; integrating foundation/sparse models on heterogeneous PIM systems; and closing the HW/SW realizability gap through co-design toolchains and standardized evaluation protocols [39,101,105,166,201]. These platform-level insights set the context for the next section, where we examine how graph-based and learning-driven methods push DNN-to-PIM deployment toward automation, adaptability, and real-time intelligence.

8. Graph-Based and Learning-Driven Approaches: Emerging Directions

8.1. Motivation and Summary

Graph-based and machine learning approaches are emerging as important tools for deploying deep neural networks on processing-in-memory (PIM) and in-memory computing (IMC) architectures, particularly when static heuristics become difficult to scale across heterogeneous hardware and workloads. By exploiting structure in computation graphs and regularities in performance data, these methods can support mapping, scheduling, compression, and hardware-aware adaptation [9,22,24,70,184,224,230,231,245,246]. However, their reported advantages should be interpreted with care because the available evidence remains uneven across hardware classes, workload families, and evaluation settings. Table 10 summarizes representative tasks, target objectives, and reported benefits from recent studies.

8.2. GNNs for Hardware-Aware Modeling

Graph neural networks (GNNs) naturally encode hierarchical and topological dependencies in DNN execution graphs and hardware fabrics. By representing data movement, compute–memory co-locality, and inter-core communication dynamics, GNNs capture properties that conventional ML models overlook, a fit that is particularly relevant to PIM where computation is tightly coupled with memory [184].

h_{v}^{(k + 1)} = σ (W_{s} h_{v}^{(k)} + \sum_{u \in N (v)} W_{n} h_{u}^{(k)})

(7)

Here,

h_{v}^{(k)}

is the feature vector of node v at layer k,

N (v)

its set of neighbors, and

W_{s}, W_{n}

are learnable weight matrices for self- and neighbor-contributions, respectively. The nonlinear activation

σ

enables expressive aggregation of structural and workload-aware features, forming the foundation of learning-guided mapping and scheduling in PIM frameworks [9,24,26,103,184,257,258,259,260,261].

Current systems are limited by the lack of large, annotated DNN-to-PIM execution graph datasets, which constrains generalization and scalability. Frameworks such as PyGim illustrate the synergy between GNN workloads and PIM acceleration: autotuning of graph convolution kernels for memory-centric architectures improves cache utilization and enables kernel fusion, yielding substantial performance and energy gains [184]. Distributed GNN training, typically bottlenecked by communication and load imbalance, benefits from PIM-enabled parallelism, enabling scalable analytics for large citation, recommender, and biological graphs [254,262].

8.3. Beyond GNNs: RL, BO, Hybrids, and Predictors

Learning-driven optimization extends to reinforcement learning, Bayesian optimization, and GNN + RL hybrids, which adapt to unseen workloads and hardware configurations while maintaining strong generalization [9,24,26,257,258,259,260,261]. Supervised models and regression-based predictors trained on annotated mappings and profiling data forecast resource allocation, operator placement, and tiling strategies that minimize latency and energy, offering speed and repeatability as DNNs evolve toward hybrid, sparse, or quantized formats [103,104,107,125,263]. Case studies reinforce this trajectory: AutoGMap combines reinforcement learning with recurrent models to select sparsity-aware mappings for memristive crossbars, reducing area and communication while outperforming hand-crafted policies [122]. Complementary efforts such as gPIM use graph-induced partitioning to lower inter-core traffic in graph convolutions, and Graph introduces graph reordering under ReRAM-aware constraints to raise processing efficiency on large datasets [221,264]. At the design-space level, NicePIM integrates ML-based configuration tuning with systematic exploration to find Pareto-optimal mappings, and MAESTRO provides analytical evaluation of dataflows and mappings that is frequently embedded within broader learning loops [101,104].

8.4. Empirical Benchmarks and Scaling Behavior

Compared with ILP solvers that scale only to small workloads (fewer than 10 layers), learning-driven methods exhibit linear-to-quadratic scaling and generalize to larger models with 50–100 layers [9,24,26,257,258,259,260,261]. Benchmarks enriched with graph statistics average degree, clustering coefficient, and modularity support quantitative comparisons across thousands of workloads and hardware instances, enabling fair generalization to unseen architectures [258,259,260]. Examples include a near-memory GNN accelerator reaching up to 230× performance and over 1000× energy efficiency gains over prior designs, and SOT-MRAM-based in-memory GCN aggregation achieving 2523× speedup with six orders of magnitude energy reduction versus CPU baselines [258,261].

8.5. Algorithmic Choices by Objective

Different ML paradigms align with different objectives and granularities. Regression models predict continuous metrics such as per-layer energy or latency; classification models choose best-fit operator placements; reinforcement learning enables adaptive mappings driven by runtime rewards; and black-box search including evolutionary and genetic methods outperforms random search and expert heuristics in non-differentiable landscapes. Together, these methods enhance adaptability, efficiency, and robustness without relying on exhaustive hand tuning [250,265,266].

8.6. Challenges and Limitations

Training overhead can hinder deployment at PIM scale, and generalization to unseen architectures remains imperfect, often motivating probabilistic models or ensembles for robustness [252,253]. Compression and quantization, essential for analog or non-volatile backends, can degrade accuracy unless co-designed with the hardware, and aggressive pruning risks sparsity inefficiencies or the removal of critical connections, especially at the edge [7,171,184,247,248,249]. Interpretability also lags: while GNNs may reduce energy by 30%, they can incur up to 2× training cost with limited transparency into decision processes [9,24,26,103,184,257,258,259,260,261].

8.7. End-to-End Frameworks and Integrated Pipelines

Recent frameworks demonstrate end-to-end maturity. PIMCOMP abstracts diverse accelerators via configurable templates and pseudo-instruction sets to automate compilation and deployment, reporting up to 2.1× throughput and 2.3× energy-efficiency gains across multiple DNNs [107]. TAG (Topology-Aware GNN deployment) co-optimizes DNN computation graphs with hardware topology, achieving 4.56× speedups in distributed training through gradient compression and task assignment [256]. Evidence across integrated pipelines combining GNN abstraction, reinforcement-learning policies, Bayesian optimization, and AutoML exploration shows predictive accuracies above 90%, 2–3× cycle reductions, and energy-efficiency improvements exceeding 20% [9,24,26,103,184,257,258,259,260,261].

8.8. Outlook

Graph-based and learning-driven methods have emerged as a promising direction for PIM/IMC optimization. Their strengths in automation, adaptability, and performance span mapping, scheduling, pruning, and quantization; realizing full potential depends on progress in benchmark standardization, interpretability, robustness under uncertainty, and integration with hybrid algorithms. These priorities define a roadmap toward scalable, reliable, and automated DNN deployment on next-generation memory-centric architectures [101,103,104,184,256,258,262]. Learned schedulers or mappers often falter when exposed to unseen hardware or workload variability, suggesting a need for transfer learning, domain adaptation, or runtime uncertainty estimation to ensure reliability beyond training distributions. Building on these insights, the next section identifies key open challenges and future research directions, charting a roadmap for scalable, robust, and automated DNN deployment across next-generation PIM architecture.

9. Open Challenges and Future Directions

Despite significant progress in learning-driven DNN-to-PIM optimization, several open challenges continue to hinder the realization of scalable, robust, and generalizable deployment frameworks. These challenges span the need for standardized evaluation methodologies, managing architectural heterogeneity, integration across system stacks, and community-driven collaboration. Addressing these fronts is essential to unlock the full potential of ML-enabled design automation in memory-centric computing.

9.1. Standardization Needs

The absence of standardized datasets, benchmark tasks, and evaluation protocols poses a foundational limitation in advancing reproducible and comparable machine learning (ML)-driven DNN-to-PIM research. While individual studies present strong performance gains, the lack of unified baselines impairs meaningful cross-framework comparisons and impedes fair evaluation. Frameworks such as DNN + NeuroSim [39] provide hierarchical evaluation capabilities and support algorithm-to-hardware mapping, yet they are often hardware-specific or lack sufficient granularity in annotations required for robust ML model development. Even recent benchmarking efforts, such as those focusing on processing element-level granularity [100], remain fragmented and underutilized across the community. This fragmentation results in inconsistent training datasets, variable metric definitions, and uncalibrated workloads, all of which limit the effective training, evaluation, and deployment of ML models. The development of domain-specific yet generalizable benchmarking standards akin to ImageNet in computer vision or GLUE in NLP is critical for DNN-to-PIM systems. Such resources must encompass diverse architectures (RRAM, DRAM, SRAM, 3D-stacked), support multiple inference and training use cases, and provide rich annotations for supervised and reinforcement learning models. Without community-endorsed standards, data-driven advances risk being siloed, non-transferable, and ultimately non-replicable [39,100].

9.2. Hardware Heterogeneity

The vast diversity in PIM architectures presents another formidable barrier to creating portable and reliable mapping strategies. RRAM-, SRAM-, and DRAM-based systems exhibit markedly different dataflow, compute locality, endurance characteristics, and parallelism capabilities. Techniques optimized for one hardware substrate, for instance sparse-aware mapping tailored for RRAM, often degrade in performance or fail entirely when transferred to DRAM platforms, where memory access patterns and latency constraints differ substantially [79,104]. Given this diversity among PIM platforms, ranging from analog RRAM crossbars to digital SRAM macros, a mapping or training technique that performs well on one architecture often fails on another. This reveals that future frameworks must either incorporate hardware-specific tuning modules or adopt meta-learning capabilities to adapt to new devices. In practice, the field may need to embrace calibration loops, such as quick fine-tuning or auto-retraining of mapping policies when porting to a different memory technology, to ensure consistent performance. The open question is how to achieve such adaptivity without restarting the optimization process from scratch for each new platform. Optimizing the optimizer is therefore emerging as a critical need: although learning-based mappers show strong potential, their own inference overhead can be prohibitive. Moreover, bitline-level variability and retention errors may affect accuracy differently across memory types, requiring mapping strategies to be both error-resilient and quantization-adaptive [129]. Advanced frameworks like NicePIM attempt to bridge this gap by leveraging ML models to explore large configuration spaces and recommend robust mappings [33,79,104]. However, these models often require retraining or extensive fine-tuning when exposed to unseen architectures or workloads. Achieving robustness to both structural and stochastic hardware variations remains an unsolved problem, calling for hybrid solutions that blend data-driven generalization with hardware-aware calibration layers.

9.3. Integration Barriers

Even when effective mapping and scheduling strategies are identified, integrating them seamlessly into real-world toolchains and execution environments remains difficult. ML-based optimization approaches whether relying on supervised learning, imitation learning, or black-box search often introduce significant computational overhead and compatibility constraints when applied to heterogeneous systems. These challenges span runtime support, compiler interfacing, and hardware abstraction. For example, frameworks such as CAMDNN [60] demonstrate effective ML-based scheduling in edge multiprocessor System-on-Chip (MPSoCs), yet require intricate coordination between local task assignment and global policy learning modules. Similarly, NicePIM’s design-space exploration engine [104] depends on precise alignment with both front-end (DNN parsing) and back-end (memory array tiling and routing) stages, demanding cohesive compiler and hardware co-design. However, the lack of standardized APIs or plug-in interfaces for integrating such ML components severely hinders adoption in existing toolflows. Furthermore, runtime adaptivity is constrained by latency and energy constraints, as many learned models are not optimized for fast inference in resource-constrained environments. This bottleneck underscores the need for lean inference models, hardware-aware model distillation, and minimal-overhead control loops. Overcoming integration challenges will require a new class of compiler-aware ML interfaces and hardware-agnostic toolchains that support modular deployment and real-time control with minimal disruption to existing design flows [104,131].

9.4. Community Collaboration

The final and arguably most urgent barrier lies in the need for broad-based community engagement. The rapid pace of DNN-to-PIM research has produced a highly fragmented ecosystem, with individual groups developing proprietary tools, narrowly scoped datasets, and isolated optimization strategies. This lack of shared infrastructure severely limits reproducibility and inhibits the emergence of consensus best practices. While platforms like DNN + NeuroSim have made progress toward open-source, modular benchmarking [39,104], they often remain constrained to fixed hardware models or inference-only scenarios. To foster scalable and verifiable innovation, the field must prioritize collaborative initiatives modeled after successful community challenges in other AI domains. These may include open competitions for mapping or scheduling under resource constraints, shared GraphFlowPIM-like repositories for annotated training data, and benchmark suites that support diverse DNN architectures and PIM configurations. Additionally, publishing standardized evaluation scripts and releasing model checkpoints would enable robust head-to-head comparison across optimization algorithms and deployment strategies. Crucially, these efforts must be accompanied by metadata-rich reporting standards that include hardware assumptions, training protocols, runtime environments, and error tolerances. Only through such transparency and openness can DNN-to-PIM research achieve the level of maturity and reproducibility necessary for real-world translation. Notably, some of the most difficult challenges in this domain reproducibility, integration, and fragmentation may be cultural rather than purely technical. Shifting toward open-source toolchains, shared datasets, and collaborative mapping competitions could catalyze breakthroughs more effectively than isolated development. To consolidate the evidence behind these challenges and opportunities, Table 11 synthesizes key takeaways across devices, algorithms, software toolchains, benchmarking practices, and performance metrics, serving as a quick-reference matrix for gaps and next steps.

9.5. Actionable Roadmap: Future Directions and Vision

Over the next decade, progress in DNN deployment on PIM/IMC systems will hinge on cross-layer intelligence: architectures, mapping, compilers, and benchmarking moving in lockstep via shared signals and objectives. The goal is a self-optimizing, model-aware, hardware-agnostic pipeline that learns from execution and improves over time.

(1): Architectures that adapt to workload shape: Design fabrics that natively accommodate transformers, large GNNs, and hybrid AI–physics models, not just CNNs. Concretely: heterogeneous memory planes co-packaged with reconfigurable compute islands; precision-morphing datapaths (binary → int4 → fp8) exposed via a thin control API; and telemetry hooks (utilization, queueing, bit-toggle activity) surfaced at μs–ms cadence for software to steer placement and precision online.
(2): Mapping and scheduling as living systems: Elevate mapping from a one-time compile step to a continuous control loop: policy controllers (RL/GNN or rule-learned surrogates) that re-tile, re-pipeline, and re-place subgraphs as workload mix changes; fast-path reconfiguration primitives (hot-swap kernels, patchable tiling) with bounded replan latency; and multi-objective envelopes that balance latency, energy, and accuracy with explicit guardrails so service level objectives (SLOs) aren’t violated.
(3): Compilers as the cross-layer glue: Make the compiler the shared language of the stack: a unified IR that encodes operator semantics, layout/precision constraints, and device capabilities in the same graph; online profiling passes that ingest hardware telemetry and update cost models at runtime; and auto-specialization that emits per-fabric kernels from a common schedule, with plug-in passes for quantization, sparsity, and dataflow.
(4): Benchmarking as an open, evolving standard: Shift evaluation from one-off point metrics to portable, replayable execution graphs: a public suite pairing canonical AI models (ResNet, BERT, GNNs) and domain workloads (edge sensing, bio, scientific) with reference execution graphs; core metrics reported uniformly (EDP, energy/inference, latency at batch = 1/streaming, utilization, accuracy under target precision); and governance via versioned releases and community PRs so the suite evolves with models and hardware.
(5): Staged roadmap with measurable deliverables: Near term (1–3 years): telemetry-first microarchitectures (per-op counters, queue depth, bit-toggle stats) with stable APIs; runtime-aware compiler passes (profile-guided tiling/placement) and policy stubs for limited online remapping; starter execution-graph packs (CNN/Transformer/GNN) with replay scripts for simulators and small silicon prototypes.
Mid term (3–6 years): heterogeneous PIM fabrics that switch precision/dataflow at run time and accept mapping updates without recompilation; controllers upgraded to closed-loop mapping with verified guardrails (e.g., ≤2% accuracy drift, ≤5 ms replan); v2/v3 benchmarks with cross-vendor runners and mandatory utilization/EDP reporting.
Long term (6–10 years): a self-optimizing ecosystem where compiler ↔ mapper ↔ fabric form a single adaptive system learning from workload traces; multi-fabric portability from the same model + IR + policy across DRAM/SRAM/eNVM PIM with bounded QoS deltas; and open governance so new models/devices slot in without redesign.
As summarized in Figure 7, the path toward practical DNN deployment on PIM/IMC systems can be organized as a staged progression from foundational benchmarking and architecture–mapping co-design to learning-augmented mapping, unified evaluation, and eventual deployment standardization. This roadmap emphasizes that meaningful progress depends not on isolated advances at a single layer, but on coordinated development across hardware design, mapping intelligence, compiler infrastructure, benchmarking practice, and industrial interoperability.

10. Conclusions

Deep neural network deployment on processing-in-memory (PIM) and in-memory computing (IMC) platforms is moving through a critical transition. Since 2019, the field has progressed from device-specific prototypes and static, rule-based mappers to compiler-integrated frameworks that combine graph-aware scheduling, evolutionary search, and learning-driven design space exploration. This survey synthesizes that evolution across four tightly coupled axes: hardware substrates, mapping and scheduling strategies, compiler toolchains, and benchmarking infrastructure, drawing insights from more than twenty representative platforms reported between 2019 and 2025. While the trajectory signals growing maturity, four persistent barriers continue to limit scalable, production-grade adoption:

1.: Fragmented evaluation standards. The absence of open, execution-graph benchmarks with shared metrics impedes reproducibility and cross-platform comparability.
2.: Narrow workload coverage and limited training support. Most frameworks remain tuned to CNN inference, with insufficient optimization for transformers, large-scale GNNs, diffusion models, and on-device training, fine-tuning, and continual learning.
3.: Simulation-dominated validation. Heavy reliance on idealized models inflates performance claims; only a minority of systems are validated on silicon under realistic non-idealities (thermal drift, device and process variation, fault resilience).
4.: Weak cross-layer integration. Pruning and quantization are common but rarely co-designed with device-aware scheduling or adaptive runtime control, leaving efficiency and robustness unrealized.

These issues are compounded by fragmented datasets and inconsistent reporting (e.g., mismatched model variants, batch sizes, and metrics), which undermine transparency and slow technology transfer.

To address these challenges, we adopt the five-phase roadmap summarized in Figure 7:

1.: Phase 1: Benchmark and reporting baselines. Establish open repositories of execution graphs and benchmarking scripts spanning vision, language, graph analytics, and scientific workloads, with explicit control of model variants, batch sizes, precision, and metrics to enable fair comparisons.
2.: Phase 2: Non-ideality-aware evaluation. Integrate device non-idealities and environmental factors into evaluation pipelines, and standardize robustness and accuracy-stability reporting under drift, variation, and fault scenarios.
3.: Phase 3: Open compiler foundations. Develop open-source compilers and runtimes with a common intermediate representation and modular backends for DRAM and SRAM PIM as well as RRAM, PCM, and FeFET-based IMC, enabling reusable optimization passes across substrates.
4.: Phase 4: Cross-layer adaptive co-design. Unify mapping, scheduling, and precision management with algorithm-level techniques (e.g., pruning and quantization) in a device-aware manner, and support runtime reconfiguration based on workload phase behavior and system constraints.
5.: Phase 5: Deployment at scale. Validate frameworks on heterogeneous, production-grade systems with mixed workloads and stress tests (including adversarial and corner-case conditions), and evaluate not only throughput and energy but also robustness, accuracy stability, maintainability, and lifecycle costs.

Looking ahead, meta-compiler architectures are poised to be transformative: self-optimizing toolchains that learn from deployment telemetry, retune data placement and parallelism online, and enforce device constraints in closed-loop control. Such adaptability at the hardware–software boundary can move PIM and IMC accelerators beyond static flows toward workload-agnostic, context-aware co-processors that support both inference and efficient on-device adaptation.

Realizing this vision requires more than architectural ingenuity; it calls for a community commitment to open science through open-sourcing code and datasets, adopting reproducible evaluation pipelines, and building shared infrastructure that bridges academic prototypes and deployable systems. By harmonizing architectural innovation, intelligent mapping, and reproducible benchmarking into a unified, data-driven deployment ecosystem, PIM and IMC platforms can advance from promising demonstrations to trusted, robust, and energy-efficient AI across edge devices, data centers, and emerging neuromorphic systems.

Author Contributions

S.M.M. prepared the manuscript, collected and analyzed the literature, organized the figures and tables, and wrote the original draft. S.C. supervised the work, provided guidance on the manuscript structure and scientific content, and reviewed and edited the manuscript. All authors have read and agreed to the published version of the manuscript.

Funding

This research was supported by the National Natural Science Foundation of China under Grant No. 92473114.

Data Availability Statement

No new data were created or analyzed in this study. Data sharing is not applicable to this article.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Sebastian, A.; Gallo, M.L.; Khaddam-Aljameh, R.; Eleftheriou, E. Memory devices and applications for in-memory computing. Nat. Nanotechnol. 2020, 15, 529–544. [Google Scholar] [CrossRef]
Ielmini, D.; Wong, H. In-memory computing with resistive switching devices. Nat. Electron. 2018, 1, 333–343. [Google Scholar] [CrossRef]
Zou, X.; Xu, S.; Chen, X.; Yan, L.; Han, Y. Breaking the von Neumann bottleneck: Architecture-level processing-in-memory technology. Sci. China Inf. Sci. 2021, 64, 160404. [Google Scholar] [CrossRef]
Mutlu, O.; Ghose, S.; Gómez-Luna, J.; Ausavarungnirun, R. Processing Data Where It Makes Sense: Enabling In-Memory Computation. Microprocess. Microsyst. 2019, 67, 28–41. [Google Scholar] [CrossRef]
Gupta, S.; Imani, M.; Kaur, H.; Rosing, T. NNPIM: A Processing In-Memory Architecture for Neural Network Acceleration. IEEE Trans. Comput. 2019, 68, 1325–1337. [Google Scholar] [CrossRef]
Long, Y.; Kim, D.; Lee, E.; Saha, P.; Mudassar, B.; She, X.; Khan, A.; Mukhopadhyay, S. A Ferroelectric FET-Based Processing-in-Memory Architecture for DNN Acceleration. IEEE J. Explor. Solid-State Comput. Devices Circuits 2019, 5, 113–122. [Google Scholar] [CrossRef]
Sutradhar, P.R.; Bavikadi, S.; Connolly, M.; Prajapati, S.; Indovina, M.A.; Dinakarrao, S.M.P.; Ganguly, A. Look-up-Table Based Processing-in-Memory Architecture with Programmable Precision-Scaling for Deep Learning Applications. IEEE Trans. Parallel Distrib. Syst. 2022, 33, 263–275. [Google Scholar] [CrossRef]
Kim, C.H.; Lee, W.; Paik, Y.; Kwon, K.; Kim, S.; Park, I.; Kim, S.W. Silent-PIM: Realizing the Processing-in-Memory Computing with Standard Memory Requests. IEEE Trans. Parallel Distrib. Syst. 2022, 33, 251–262. [Google Scholar] [CrossRef]
Gómez-Luna, J.; Hajj, I.E.; Fernandez, I.; Giannoula, C.; Oliveira, G.F.; Mutlu, O. Benchmarking a New Paradigm: Experimental Analysis and Characterization of a Real Processing-in-Memory System. IEEE Access 2022, 10, 52565–52608. [Google Scholar] [CrossRef]
Jain, S.; Ranjan, A.; Roy, K.; Raghunathan, A. Computing in Memory with Spin-Transfer Torque Magnetic RAM. IEEE Trans. Very Large Scale Integr. (VLSI) Syst. 2017, 26, 470–483. [Google Scholar] [CrossRef]
Gallo, M.L.; Sebastian, A.; Mathis, R.; Manica, M.; Giefers, H.; Tůma, T.; Bekas, C.; Curioni, A.; Eleftheriou, E. Mixed-precision in-memory computing. Nat. Electron. 2017, 1, 246–253. [Google Scholar] [CrossRef]
Soliman, T.; Chatterjee, S.; Laleni, N.; Müller, F.; Kirchner, T.; Wehn, N.; Kämpfe, T.; Chauhan, Y.; Amrouch, H. First demonstration of in-memory computing crossbar using multi-level Cell FeFET. Nat. Commun. 2023, 14, 6348. [Google Scholar] [CrossRef] [PubMed]
Lu, Z.; Wang, X.; Arafin, M.T.; Yang, H.; Liu, Z.; Zhang, J.; Qu, G. An RRAM-Based Computing-in-Memory Architecture and Its Application in Accelerating Transformer Inference. IEEE Trans. Very Large Scale Integr. (VLSI) Syst. 2024, 32, 485–496. [Google Scholar] [CrossRef]
Leitersdorf, O.; Ronen, R.; Kvatinsky, S. MultPIM: Fast Stateful Multiplication for Processing-in-Memory. IEEE Trans. Circuits Syst. II Express Briefs 2021, 69, 1647–1651. [Google Scholar] [CrossRef]
Kim, D.; Yu, C.; Xie, S.; Chen, Y.; Kim, J.Y.; Kim, B.; Kulkarni, J.; Kim, T.T. An Overview of Processing-in-Memory Circuits for Artificial Intelligence and Machine Learning. IEEE J. Emerg. Sel. Top. Circuits Syst. 2022, 12, 338–353. [Google Scholar] [CrossRef]
Ali, M.; Roy, S.; Saxena, U.; Sharma, T.; Raghunathan, A.; Roy, K. Compute-in-Memory Technologies and Architectures for Deep Learning Workloads. IEEE Trans. Very Large Scale Integr. (VLSI) Syst. 2022, 30, 1615–1630. [Google Scholar] [CrossRef]
Mittal, S.; Verma, G.; Kaushik, B.K.; Khanday, F.A. A survey of SRAM-based in-memory computing techniques and applications. J. Syst. Archit. 2021, 119, 102276. [Google Scholar] [CrossRef]
Cheng, C.; Tiw, P.J.; Cai, Y.; Yan, X.; Yang, Y.; Huang, R. In-memory computing with emerging nonvolatile memory devices. Sci. China Inf. Sci. 2021, 64, 221402. [Google Scholar] [CrossRef]
Ahn, J.; Yoo, S.; Mutlu, O.; Choi, K. PIM-enabled instructions: A low-overhead, locality-aware processing-in-memory architecture. In Proceedings of the 2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA); Association for Computing Machinery: New York, NY, USA, 2015; pp. 336–348. [Google Scholar] [CrossRef]
Mittal, S. A Survey of ReRAM-Based Architectures for Processing-In-Memory and Neural Networks. Mach. Learn. Knowl. Extr. 2018, 1, 75–114. [Google Scholar] [CrossRef]
Wang, F.; Li, J.; Zhang, Z.; Ding, Y.; Xiong, Y.; Hou, X.; Chen, H.; Zhou, P. Multifunctional computing-in-memory SRAM cells based on two-surface-channel MoS₂ transistors. iScience 2021, 24, 103138. [Google Scholar] [CrossRef]
Chen, X.; Wang, X.; Jia, X.; Yang, J.; Qu, G.; Zhao, W. Accelerating Graph-Connected Component Computation with Emerging Processing-In-Memory Architecture. IEEE Trans. Comput.-Aided Des. Integr. Circuits Syst. 2022, 41, 5333–5342. [Google Scholar] [CrossRef]
Jonatan, G.; Cho, H.; Son, H.; Wu, X.; Livesay, N.; Mora, E.; Shivdikar, K.; Abellán, J.L.; Joshi, A.; Kaeli, D.R.; et al. Scalability Limitations of Processing-in-Memory using Real System Evaluations. Proc. ACM Meas. Anal. Comput. Syst. 2024, 8, 5. [Google Scholar] [CrossRef]
Lin, J.; Qu, H.; Ma, S.; Ji, X.; Li, H.; Li, X.; Song, C.; Zhang, W. SongC: A Compiler for Hybrid Near-Memory and In-Memory Many-Core Architecture. IEEE Trans. Comput. 2024, 73, 2420–2433. [Google Scholar] [CrossRef]
Rashed, M.; Thijssen, S.; Jha, S.K.; Ewetz, R. LOGIC: Logic Synthesis for Digital In-Memory Computing. ACM Trans. Des. Autom. Electron. Syst. 2025, 30, 25. [Google Scholar] [CrossRef]
Zhu, Z.; Sun, H.; Xie, T.; Zhu, Y.; Dai, G.; Xia, L.; Niu, D.; Chen, X.; Hu, X.S.; Cao, Y.; et al. MNSIM 2.0: A Behavior-Level Modeling Tool for Processing-In-Memory Architectures. IEEE Trans. Comput.-Aided Des. Integr. Circuits Syst. 2023, 42, 4112–4125. [Google Scholar] [CrossRef]
Forlin, B.E.; Santos, P.C.; Becker, A.E.; Alves, M.; Carro, L. Sim2PIM: A complete simulation framework for Processing-in-Memory. J. Syst. Archit. 2022, 128, 102528. [Google Scholar] [CrossRef]
Perach, B.; Ronen, R.; Kimelfeld, B.; Kvatinsky, S. Understanding Bulk-Bitwise Processing In-Memory Through Database Analytics. IEEE Trans. Emerg. Top. Comput. 2022, 12, 7–22. [Google Scholar] [CrossRef]
Xu, S.; Chen, X.; Wang, Y.; Han, Y.; Qian, X.; Li, X. PIMSim: A Flexible and Detailed Processing-in-Memory Simulator. IEEE Comput. Archit. Lett. 2019, 18, 6–9. [Google Scholar] [CrossRef]
Wu, N.; Xie, Y. A Survey of Machine Learning for Computer Architecture and Systems. ACM Comput. Surv. (CSUR) 2021, 55, 54. [Google Scholar] [CrossRef]
Asif, N.A.; Sarker, Y.; Chakrabortty, R.; Ryan, M.J.; Ahamed, M.; Saha, D.K.; Badal, F.; Das, S.; Ali, M.; Moyeen, S.I.; et al. Graph Neural Network: A Comprehensive Review on Non-Euclidean Space. IEEE Access 2021, 9, 60588–60606. [Google Scholar] [CrossRef]
Bilot, T.; Madhoun, N.E.; Agha, K.A.; Zouaoui, A. A Survey on Malware Detection with Graph Representation Learning. ACM Comput. Surv. 2023, 56, 278. [Google Scholar] [CrossRef]
Liu, F.; Zhao, W.; Wang, Z.; Zhao, Y.; Yang, T.; Chen, Y.; Jiang, L. IVQ: In-Memory Acceleration of DNN Inference Exploiting Varied Quantization. IEEE Trans. Comput.-Aided Des. Integr. Circuits Syst. 2022, 41, 5313–5326. [Google Scholar] [CrossRef]
Kierner, S.; Kucharski, J.; Kierner, Z. Taxonomy of hybrid architectures involving rule-based reasoning and machine learning in clinical decision systems: A scoping review. J. Biomed. Inform. 2023, 144, 104428. [Google Scholar] [CrossRef] [PubMed]
Han, R.; John, L.; Zhan, J. Benchmarking Big Data Systems: A Review. IEEE Trans. Serv. Comput. 2018, 11, 580–597. [Google Scholar] [CrossRef]
Bartolomeo, S.D.; Crnovrsanin, T.; Saffo, D.; Puerta, E.; Wilson, C.; Dunne, C. Evaluating Graph Layout Algorithms: A Systematic Review of Methods and Best Practices. Comput. Graph. Forum 2024, 43, e15073. [Google Scholar] [CrossRef]
Meng, J.; Shim, W.; Yang, L.; Yeo, I.; Fan, D.; Yu, S.; Seo, J.W. Temperature-Resilient RRAM-Based In-Memory Computing for DNN Inference. IEEE Micro 2022, 42, 89–98. [Google Scholar] [CrossRef]
Wang, Y.; Qin, Y.; Liu, L.; Wei, S.; Yin, S. SWPU: A 126.04 TFLOPS/W Edge-Device Sparse DNN Training Processor with Dynamic Sub-Structured Weight Pruning. IEEE Trans. Circuits Syst. I Regul. Pap. 2022, 69, 4014–4027. [Google Scholar] [CrossRef]
Peng, X.; Huang, S.; Jiang, H.; Lu, A.; Yu, S. DNN+NeuroSim V2.0: An End-to-End Benchmarking Framework for Compute-in-Memory Accelerators for On-Chip Training. IEEE Trans. Comput.-Aided Des. Integr. Circuits Syst. 2020, 40, 2306–2319. [Google Scholar] [CrossRef]
Zhu, J.; Zhang, T.; Yang, Y.; Huang, R. A comprehensive review on emerging artificial neuromorphic devices. Appl. Phys. Rev. 2020, 7, 011312. [Google Scholar] [CrossRef]
Bavikadi, S.; Sutradhar, P.R.; Khasawneh, K.N.; Ganguly, A.; Dinakarrao, S.M.P. A Review of In-Memory Computing Architectures for Machine Learning Applications. In Proceedings of the 2020 on Great Lakes Symposium on VLSI; Association for Computing Machinery: New York, NY, USA, 2020. [Google Scholar] [CrossRef]
Mannocci, P.; Farronato, M.; Lepri, N.; Cattaneo, L.; Glukhov, A.; Sun, Z.; Ielmini, D. In-memory computing with emerging memory devices: Status and outlook. APL Mach. Learn. 2023, 1, 010902. [Google Scholar] [CrossRef]
Ielmini, D.; Pedretti, G. Device and Circuit Architectures for In-Memory Computing. Adv. Intell. Syst. 2020, 2, 2000040. [Google Scholar] [CrossRef]
Hussain, H.; Tamizharasan, P.; Rahul, C.S. Design possibilities and challenges of DNN models: A review on the perspective of end devices. Artif. Intell. Rev. 2022, 55, 5109–5167. [Google Scholar] [CrossRef]
Yu, S. Neuro-Inspired Computing with Emerging Nonvolatile Memorys. Proc. IEEE 2018, 106, 260–285. [Google Scholar] [CrossRef]
Kwak, H.; Kim, N.; Jeon, S.; Kim, S.; Woo, J. Electrochemical random-access memory: Recent advances in materials, devices, and systems towards neuromorphic computing. Nano Converg. 2024, 11, 9. [Google Scholar] [CrossRef] [PubMed]
Kingra, S.K.; Parmar, V.; Chang, C.C.; Hudec, B.; Hou, T. SLIM: Simultaneous Logic-in-Memory Computing Exploiting Bilayer Analog OxRAM Devices. Sci. Rep. 2018, 10, 2567. [Google Scholar] [CrossRef] [PubMed]
Ling, Y.; Wang, Z.; Yang, Y.; Bao, L.; Bao, S.; Wang, Q.; Cai, Y.; Huang, R. An isolated symmetrical 2T2R cell enabling high precision and high density for RRAM-based in-memory computing. Sci. China Inf. Sci. 2024, 67, 152402. [Google Scholar] [CrossRef]
Ren, S.; Dong, A.W.; Yang, L.; Xue, Y.B.; Li, J.; Yu, Y.; Zhou, H.; Zuo, W.B.; Li, Y.; Cheng, W.M.; et al. Self-Rectifying Memristors for Three-Dimensional In-Memory Computing. Adv. Mater. 2023, 36, 2307218. [Google Scholar] [CrossRef]
Boniardi, M.; Baldo, M.; Allegra, M.; Redaelli, A. Phase Change Memory: A Review on Electrical Behavior and Use in Analog In-Memory-Computing (A-IMC) Applications. Adv. Electron. Mater. 2024, 10, 2400599. [Google Scholar] [CrossRef]
Kim, J.; Kang, S.; Lee, S.; Ro, Y.; Lee, S.; Wang, D.; Choi, J.; So, J.; Cho, Y.; Song, J.; et al. Aquabolt-XL HBM2-PIM, LPDDR5-PIM with In-Memory Processing, and AXDIMM with Acceleration Buffer. IEEE Micro 2022, 42, 20–30. [Google Scholar] [CrossRef]
Azarkhish, E.; Rossi, D.; Loi, I.; Benini, L. Neurostream: Scalable and Energy Efficient Deep Learning with Smart Memory Cubes. IEEE Trans. Parallel Distrib. Syst. 2017, 29, 420–434. [Google Scholar] [CrossRef]
Lee, W.; Kim, C.H.; Paik, Y.; Park, J.; Park, I.; Kim, S. Design of Processing-“Inside”-Memory Optimized for DRAM Behaviors. IEEE Access 2019, 7, 82633–82648. [Google Scholar] [CrossRef]
Kim, J.; Lee, J.; Lee, J.; Heo, J.; Kim, J.Y. Z-PIM: A Sparsity-Aware Processing-in-Memory Architecture with Fully Variable Weight Bit-Precision for Energy-Efficient Deep Neural Networks. IEEE J. Solid-State Circuits 2021, 56, 1093–1104. [Google Scholar] [CrossRef]
Mamdouh, A.; Geng, H.; Niemier, M.; Hu, X.S.; Reis, D. Shared-PIM: Enabling Concurrent Computation and Data Flow for Faster Processing-in-DRAM. IEEE Trans. Comput.-Aided Des. Integr. Circuits Syst. 2025. [Google Scholar] [CrossRef]
Kim, B.; Lee, C.; Kim, G.; Park, E. Cost-Effective Extension of DRAM-PIM for Group-Wise LLM Quantization. IEEE Comput. Archit. Lett. 2025, 24, 53–56. [Google Scholar] [CrossRef]
Kim, C.H.; Lee, W.; Paik, Y.; Kim, S.; Kim, S.W. BL-PIM: Varying the Burst Length to Realize the All-Bank Performance and Minimize the Multi-Workload Interference for in-DRAM PIM. IEEE Access 2023, 11, 81143–81156. [Google Scholar] [CrossRef]
Kim, S.; Kim, S.; Cho, K.; Shin, T.; Park, H.; Lho, D.; Park, S.; Son, K.; Park, G.; Jeong, S.; et al. Signal Integrity and Computing Performance Analysis of a Processing-In-Memory of High Bandwidth Memory (PIM-HBM) Scheme. IEEE Trans. Compon. Packag. Manuf. Technol. 2021, 11, 1955–1970. [Google Scholar] [CrossRef]
Zhang, B.; Yin, S.; Kim, M.; Saikia, J.; Kwon, S.C.; Myung, S.; Kim, H.; Kim, S.J.; Seo, J.-S.; Seok, M. PIMCA: A Programmable In-Memory Computing Accelerator for Energy-Efficient DNN Inference. IEEE J. Solid-State Circuits 2023, 58, 1436–1449. [Google Scholar] [CrossRef]
Gauchi, R.; Kooli, M.; Vivet, P.; Noël, J.; Beigné, E.; Mitra, S.; Charles, H. Memory Sizing of a Scalable SRAM In-Memory Computing Tile Based Architecture. In 2019 IFIP/IEEE 27th International Conference on Very Large Scale Integration (VLSI-SoC); IEEE: Piscataway, NJ, USA, 2019; pp. 166–171. [Google Scholar] [CrossRef]
Zhou, W.; Farmakidis, N.; Feldmann, J.; Li, X.; Tan, J.Y.S.; He, Y.; Wright, C.; Pernice, W.; Bhaskaran, H. Phase-change materials for energy-efficient photonic memory and computing. MRS Bull. 2022, 47, 502–510. [Google Scholar] [CrossRef]
Gallo, M.L.; Khaddam-Aljameh, R.; Stanisavljevic, M.; Vasilopoulos, A.; Kersting, B.; Dazzi, M.; Karunaratne, G.; Braendli, M.; Singh, A.; Mueller, S.; et al. A 64-core mixed-signal in-memory compute chip based on phase-change memory for deep neural network inference. Nat. Electron. 2022, 6, 680–693. [Google Scholar] [CrossRef]
Kim, G.; Ko, D.H.; Kim, T.; Lee, S.; Jung, M.; Lee, Y.K.; Lim, S.; Jo, M.; Eom, T.; Shin, H.; et al. Power-Delay Area-Efficient Processing-In-Memory Based on Nanocrystalline Hafnia Ferroelectric Field-Effect Transistors. Acs Appl. Mater. Interfaces 2022, 15, 1463–1474. [Google Scholar] [CrossRef]
Kim, M.; Lee, K.; Kim, S.; Lee, J.H.; Park, B.G.; Kwon, D. Double-Gated Ferroelectric-Gate Field-Effect-Transistor for Processing in Memory. IEEE Electron Device Lett. 2021, 42, 1607–1610. [Google Scholar] [CrossRef]
Lee, M.; Narayan, D.M.; Kim, J.H.; Le, D.N.; Shirodkar, S.; Park, S.C.; Kang, J.; Lee, S.; Ahn, Y.; Ryu, S.W.; et al. Hafnium Oxide-Based Ferroelectric Devices for In-Memory Computing: Resistive and Capacitive Approaches. ACS Appl. Electron. Mater. 2024, 6, 5391–5401. [Google Scholar] [CrossRef]
Jang, Y.; Kim, D.; Kim, Y.; Park, J. Big-Computing and Little-Storing STT-MRAM PIM Architecture with Charge Domain Based MAC Operation. IEEE Trans. Comput. 2025, 74, 1239–1252. [Google Scholar] [CrossRef]
Kim, T.; Jang, Y.; Kang, M.; Park, B.; Lee, K.J.; Park, J. SOT-MRAM Digital PIM Architecture with Extended Parallelism in Matrix Multiplication. IEEE Trans. Comput. 2022, 71, 2816–2828. [Google Scholar] [CrossRef]
Li, Y.; Bai, T.; Xu, X.; Zhang, Y.; Wu, B.; Cai, H.; Pan, B.; Zhao, W. A Survey of MRAM-Centric Computing: From Near Memory to In Memory. IEEE Trans. Emerg. Top. Comput. 2023, 11, 318–330. [Google Scholar] [CrossRef]
Ghazal, O.; Wang, W.; Kvatinsky, S.; Merchant, F.; Yakovlev, A.; Shafik, R. IMPACT: In-Memory ComPuting Architecture based on Y-FlAsh Technology for Coalesced Tsetlin machine inference. Philos. Trans. Ser. A Math. Phys. Eng. Sci. 2025, 383, 20230393. [Google Scholar] [CrossRef]
He, S.; Zhong, W.; Zhu, M.; Wu, S.; Xie, W.; Ouyang, Z.; Cheng, B.; Zhao, J. In-Memory Computing with Self-Rectification and Dynamic Logical Reconfiguration of 12 Algorithms in a Single Halide Perovskites. Adv. Funct. Mater. 2025, 35, 2424114. [Google Scholar] [CrossRef]
Ren, S.; Xue, Y.B.; Zhang, Y.; Li, Y.; Miao, X. 3D Vertical Self-Rectifying Memristor Arrays with Split-Cell Structure, Large Nonlinearity (>104) and fJ-Level Switching Energy. IEEE Electron Device Lett. 2023, 44, 2059–2062. [Google Scholar] [CrossRef]
Chung, K.Y.; Kim, H.; An, Y.; Seong, K.; Shin, D.H.; Baek, K.H.; Shim, Y. 8T-SRAM Based Process-In-Memory (PIM) System with Current Mirror for Accurate MAC Operation. IEEE Access 2024, 12, 24254–24261. [Google Scholar] [CrossRef]
Duan, C.; Yang, J.; He, X.; Qi, Y.; Wang, Y.; Wang, Y.; He, Z.; Yan, B.; Wang, X.; Jia, X.; et al. DDC-PIM: Efficient Algorithm/Architecture Co-Design for Doubling Data Capacity of SRAM-Based Processing-in-Memory. IEEE Trans. Comput.-Aided Des. Integr. Circuits Syst. 2023, 43, 906–918. [Google Scholar] [CrossRef]
Park, M.; Hwang, J.; Kim, S.; Shin, W.; Shim, W.; Bae, J.H.; Lee, J.; Cho, S. Charge-trap synaptic device with polycrystalline silicon channel for low power in-memory computing. Sci. Rep. 2024, 14, 29089. [Google Scholar] [CrossRef]
Kim, S.; Um, S.; Jo, W.; Lee, J.; Ha, S.; Li, Z.; Yoo, H.J. Scaling-CIM: EDRAM In-Memory-Computing Accelerator with Dynamic-Scaling ADC and Adaptive Analog Operation. IEEE J. Solid-State Circuits 2024, 59, 2694–2705. [Google Scholar] [CrossRef]
Azamat, A.; Asim, F.; Kim, J.; Lee, J. Partial Sum Quantization for Reducing ADC Size in ReRAM-Based Neural Network Accelerators. IEEE Trans. Comput.-Aided Des. Integr. Circuits Syst. 2023, 42, 4897–4908. [Google Scholar] [CrossRef]
Xu, J.; Liu, H.; Duan, Z.; Liao, X.; Jin, H.; Yang, X.; Li, H.; Liu, C.; Mao, F.; Zhang, Y. ReHarvest: An ADC Resource-Harvesting Crossbar Architecture for ReRAM-Based DNN Accelerators. ACM Trans. Archit. Code Optim. 2024, 21, 63. [Google Scholar] [CrossRef]
Amin, M.H.; Elbtity, M.E.; Zand, R. Xbar-Partitioning: A Practical Way for Parasitics and Noise Tolerance in Analog IMC Circuits. IEEE J. Emerg. Sel. Top. Circuits Syst. 2022, 12, 867–877. [Google Scholar] [CrossRef]
Peng, X.; Liu, R.; Yu, S. Optimizing Weight Mapping and Data Flow for Convolutional Neural Networks on Processing-In-Memory Architectures. IEEE Trans. Circuits Syst. I Regul. Pap. 2019, 67, 1333–1343. [Google Scholar] [CrossRef]
Gong, N.; Gong, N.; Idé, T.; Kim, S.; Boybat, I.; Boybat, I.; Sebastian, A.; Narayanan, V.; Ando, T. Signal and noise extraction from analog memory elements for neuromorphic computing. Nat. Commun. 2018, 9, 2102. [Google Scholar] [CrossRef]
Han, L.; Huang, P.; Wang, Y.; Zhou, Z.; Yang, H.; Chen, Y.; Liu, X.; Kang, J. Mitigating methodology of hardware non-ideal characteristics for non-volatile memory based neural networks. Sci. China Inf. Sci. 2025, 68, 122403. [Google Scholar] [CrossRef]
Saragada, P.K.; Das, B.P. Process-Variation-Aware In-Memory Computation with Improved Linearity Using On-Chip Configurable Current-Steering Thermometric DAC. IEEE Trans. Circuits Syst. I Regul. Pap. 2024, 71, 4586–4596. [Google Scholar] [CrossRef]
Kneip, A.; Bol, D. Impact of Analog Non-Idealities on the Design Space of 6T-SRAM Current-Domain Dot-Product Operators for In-Memory Computing. IEEE Trans. Circuits Syst. I Regul. Pap. 2021, 68, 1931–1944. [Google Scholar] [CrossRef]
Kim, Y.; Kwon, Y.J.; Kim, J.; An, C.H.; Park, T.; Kwon, D.; Woo, H.C.; Kim, H.; Yoon, J.; Hwang, C. Novel Selector-Induced Current-Limiting Effect through Asymmetry Control for High-Density One-Selector–One-Resistor Crossbar Arrays. Adv. Electron. Mater. 2019, 5, 1800806. [Google Scholar] [CrossRef]
Park, J.H.; Kim, D.; Kang, D.; Jeon, D.; Kim, T.G. Nanoscale 3D-Stackable Ag-doped HfOx-Based Selector Devices Fabricated through Low-Temperature Hydrogen Annealing. ACS Appl. Mater. Interfaces 2019, 11, 29408–29415. [Google Scholar] [CrossRef]
Lin, C.Y.; Tseng, Y.T.; Chen, P.H.; Chang, T.; Eshraghian, J.; Wang, Q.; Lin, Q.; Tan, Y.F.; Tai, M.; Hung, W.C.; et al. A high-speed MIM resistive memory cell with an inherent vanadium selector. Appl. Mater. Today 2020, 21, 100848. [Google Scholar] [CrossRef]
Upadhyay, N.; Sun, W.; Lin, P.; Joshi, S.; Midya, R.; Zhang, X.; Wang, Z.; Jiang, H.; Yoon, J.; Rao, M.; et al. A Memristor with Low Switching Current and Voltage for 1S1R Integration and Array Operation. Adv. Electron. Mater. 2020, 6, 1901411. [Google Scholar] [CrossRef]
Bae, Y.C.; Lee, A.R.; Baek, G.; Chung, J.B.; Kim, T.Y.; Park, J.G.; Hong, J.P. All oxide semiconductor-based bidirectional vertical p-n-p selectors for 3D stackable crossbar-array electronics. Sci. Rep. 2015, 5, 13362. [Google Scholar] [CrossRef] [PubMed]
Sun, L.; Zhang, Y.; Han, G.; Hwang, G.; Jiang, J.; Joo, B.; Watanabe, K.; Taniguchi, T.; Kim, Y.M.; Yu, W.; et al. Self-selective van der Waals heterostructures for large scale memory array. Nat. Commun. 2019, 10, 3161. [Google Scholar] [CrossRef]
Luo, Q.; Xu, X.; Liu, H.; Lv, H.; Gong, T.; Long, S.; Liu, Q.; Sun, H.; Banerjee, W.; Li, L.; et al. Super non-linear RRAM with ultra-low power for 3D vertical nano-crossbar arrays. Nanoscale 2016, 8, 15629–15636. [Google Scholar] [CrossRef] [PubMed]
Rao, M.; Song, W.; Kiani, F.; Asapu, S.; Zhuo, Y.; Midya, R.; Upadhyay, N.; Wu, Q.; Barnell, M.D.; Lin, P.; et al. Timing Selector: Using Transient Switching Dynamics to Solve the Sneak Path Issue of Crossbar Arrays. Small Sci. 2021, 2, 2100072. [Google Scholar] [CrossRef] [PubMed]
Li, S.; Pam, M.; Li, Y.; Chen, L.; Chien, Y.C.; Fong, X.; Chi, D.; Ang, K. Wafer-Scale 2D Hafnium Diselenide Based Memristor Crossbar Array for Energy-Efficient Neural Network Hardware. Adv. Mater. 2021, 34, 2103376. [Google Scholar] [CrossRef]
Jain, S.; Li, S.; Zheng, H.; Li, L.; Fong, X.; Ang, K. Heterogeneous integration of 2D memristor arrays and silicon selectors for compute-in-memory hardware in convolutional neural networks. Nat. Commun. 2025, 16, 2719. [Google Scholar] [CrossRef]
Filipovich, M.; Guo, Z.; Al-Qadasi, M.; Marquez, B.A.; Morison, H.; Sorger, V.; Prucnal, P.; Shekhar, S.; Shastri, B. Silicon Photonic Architecture for Training Deep Neural Networks with Direct Feedback Alignment. Optica 2021, 9, 1323–1332. [Google Scholar] [CrossRef]
Feldmann, J.; Youngblood, N.; Karpov, M.; Gehring, H.; Li, X.; Stappers, M.; Gallo, M.L.; Fu, X.; Lukashchuk, A.; Raja, A.; et al. Parallel convolutional processing using an integrated photonic tensor core. Nature 2021, 589, 52–58. [Google Scholar] [CrossRef] [PubMed]
Totović, A.; Dabos, G.; Passalis, N.; Tefas, A.; Pleros, N. Femtojoule per MAC Neuromorphic Photonics: An Energy and Technology Roadmap. IEEE J. Sel. Top. Quantum Electron. 2020, 26, 1–15. [Google Scholar] [CrossRef]
Nahmias, M.; de Lima, T.F.; Tait, A.; Peng, H.T.; Shastri, B.; Prucnal, P. Photonic Multiply-Accumulate Operations for Neural Networks. IEEE J. Sel. Top. Quantum Electron. 2020, 26, 8800115. [Google Scholar] [CrossRef]
Jain, S.; Sengupta, A.; Roy, K.; Raghunathan, A. RxNN: A Framework for Evaluating Deep Neural Networks on Resistive Crossbars. IEEE Trans. Comput.-Aided Des. Integr. Circuits Syst. 2018, 40, 326–338. [Google Scholar] [CrossRef]
Burrello, A.; Garofalo, A.; Bruschi, N.; Tagliavini, G.; Rossi, D.; Conti, F. DORY: Automatic End-to-End Deployment of Real-World DNNs on Low-Cost IoT MCUs. IEEE Trans. Comput. 2020, 70, 1253–1268. [Google Scholar] [CrossRef]
Wang, Y.; Fong, X. Benchmarking DNN Mapping Methods for the in-Memory Computing Accelerators. IEEE J. Emerg. Sel. Top. Circuits Syst. 2023, 13, 1040–1051. [Google Scholar] [CrossRef]
Kwon, H.; Chatarasi, P.; Sarkar, V.; Krishna, T.; Pellauer, M.; Parashar, A. MAESTRO: A Data-Centric Approach to Understand Reuse, Performance, and Hardware Cost of DNN Mappings. IEEE Micro 2020, 40, 20–29. [Google Scholar] [CrossRef]
Wang, Y.; Zhao, Z.; Jin, X.; Zheng, H.; Nie, M.; Zou, Q.; Shi, C. AutoMap: Automatic Mapping of Neural Networks to Deep Learning Accelerators for Edge Devices. IEEE Trans. Comput.-Aided Des. Integr. Circuits Syst. 2023, 42, 2994–3006. [Google Scholar] [CrossRef]
Kim, S.; Lee, J.; Paik, Y.; Kim, C.H.; Lee, W.; Kim, S.W. Optimal Model Partitioning with Low-Overhead Profiling on the PIM-based Platform for Deep Learning Inference. ACM Trans. Des. Autom. Electron. Syst. 2023, 29, 28. [Google Scholar] [CrossRef]
Wang, J.; Ge, M.; Ding, B.; Xu, Q.; Chen, S.; Kang, Y. NicePIM: Design Space Exploration for Processing-In-Memory DNN Accelerators with 3-D Stacked-DRAM. IEEE Trans. Comput.-Aided Des. Integr. Circuits Syst. 2023, 43, 1456–1469. [Google Scholar] [CrossRef]
Sun, X.; Wang, X.; Li, W.; Han, Y.; Chen, X. PIMCOMP: An End-to-End DNN Compiler for Processing-In-Memory Accelerators. IEEE Trans. Comput.-Aided Des. Integr. Circuits Syst. 2024, 44, 1745–1759. [Google Scholar] [CrossRef]
Wang, X.; Zhou, M.; Rosing, T.S. Fast-OverlaPIM: A Fast Overlap-Driven Mapping Framework for Processing In-Memory Neural Network Acceleration. IEEE Trans. Comput.-Aided Des. Integr. Circuits Syst. 2024, 44, 130–143. [Google Scholar] [CrossRef]
Dai, P.; Han, B.; Li, K.; Xu, X.; Xing, H.; Liu, K. Joint Optimization of Device Placement and Model Partitioning for Cooperative DNN Inference in Heterogeneous Edge Computing. IEEE Trans. Mob. Comput. 2025, 24, 210–226. [Google Scholar] [CrossRef]
Jun, H.; Kim, T.; Kim, S.C.; Eom, Y.I. A Hierarchical Dispatcher for Scheduling Multiple Deep Neural Networks (DNNs) on Edge Devices. Sensors 2025, 25, 2243. [Google Scholar] [CrossRef] [PubMed]
Ogbogu, C.; Narang, G.; Joardar, B.K.; Doppa, J.; Chakrabarty, K.; Pande, P. HuNT: Exploiting Heterogeneous PIM Devices to Design a 3-D Manycore Architecture for DNN Training. IEEE Trans. Comput.-Aided Des. Integr. Circuits Syst. 2024, 43, 3300–3311. [Google Scholar] [CrossRef]
Zou, X.; Chen, C.; Lin, P.; Zhang, L.; Xu, Y.; Zhang, W. Scalable Heterogeneous Scheduling Based Model Parallelism for Real-Time Inference of Large-Scale Deep Neural Networks. IEEE Trans. Emerg. Top. Comput. Intell. 2024, 8, 2962–2973. [Google Scholar] [CrossRef]
Krishnan, G.; Wang, Z.; Yeo, I.; Meng, J.; Liehr, M.; Joshi, R.; Cady, N.; Fan, D.; Seo, J.-S.; Cao, Y. Hybrid RRAM/SRAM in-Memory Computing for Robust DNN Acceleration. IEEE Trans. Comput.-Aided Des. Integr. Circuits Syst. 2022, 41, 4241–4252. [Google Scholar] [CrossRef]
Huang, B.; Huang, X.; Liu, X.; Ding, C.; Yin, Y.; Deng, S. Adaptive partitioning and efficient scheduling for distributed DNN training in heterogeneous IoT environment. Comput. Commun. 2023, 215, 169–179. [Google Scholar] [CrossRef]
Bai, C.; Wei, X.; Zhuo, Y.; Cai, Y.; Zheng, H.; Yu, B.; Xie, Y. Klotski v2: Improved DNN Model Orchestration Framework for Dataflow Architecture Accelerators. IEEE Trans. Comput.-Aided Des. Integr. Circuits Syst. 2025, 44, 1045–1058. [Google Scholar] [CrossRef]
Mirmahaleh, S.Y.H.; Reshadi, M.; Bagherzadeh, N.; Khademzadeh, A. Data scheduling and placement in deep learning accelerator. Clust. Comput. 2021, 24, 3651–3669. [Google Scholar] [CrossRef]
Shi, L.; Xu, Z.; Sun, Y.; Shi, Y.; Fan, Y.; Ding, X. A DNN inference acceleration algorithm combining model partition and task allocation in heterogeneous edge computing system. Peer-To-Peer Netw. Appl. 2021, 14, 4031–4045. [Google Scholar] [CrossRef]
Li, J.; Liang, W.; Li, Y.; Xu, Z.; Jia, X.; Guo, S. Throughput Maximization of Delay-Aware DNN Inference in Edge Computing by Exploring DNN Model Partitioning and Inference Parallelism. IEEE Trans. Mob. Comput. 2023, 22, 3017–3030. [Google Scholar] [CrossRef]
Chen, Z.; Hu, J.; Chen, X.; Hu, J.; Zheng, X.; Min, G. Computation Offloading and Task Scheduling for DNN-Based Applications in Cloud-Edge Computing. IEEE Access 2020, 8, 115537–115547. [Google Scholar] [CrossRef]
Zhang, J.; Niu, G.; Dai, Q.; Li, H.; Wu, Z.; Dong, F.; Wu, Z. PipePar: Enabling fast DNN pipeline parallel training in heterogeneous GPU clusters. Neurocomputing 2023, 555, 126661. [Google Scholar] [CrossRef]
Chen, X.; Zhang, J.; Lin, B.; Chen, Z.; Wolter, K.; Min, G. Energy-Efficient Offloading for DNN-Based Smart IoT Systems in Cloud-Edge Environments. IEEE Trans. Parallel Distrib. Syst. 2022, 33, 683–697. [Google Scholar] [CrossRef]
Sahoo, R.M.; Padhy, S. A novel algorithm for priority-based task scheduling on a multiprocessor heterogeneous system. Microprocess. Microsyst. 2022, 95, 104685. [Google Scholar] [CrossRef]
Li, Y.; Ge, X.; Lei, B.; Zhang, X.; Wang, W. Joint Task Partitioning and Parallel Scheduling in Device-Assisted Mobile Edge Networks. IEEE Internet Things J. 2023, 11, 14058–14075. [Google Scholar] [CrossRef]
Dong, F.; Wang, H.; Shen, D.; Huang, Z.; He, Q.; Zhang, J.; Wen, L.; Zhang, T. Multi-Exit DNN Inference Acceleration Based on Multi-Dimensional Optimization for Edge Intelligence. IEEE Trans. Mob. Comput. 2023, 22, 5389–5405. [Google Scholar] [CrossRef]
Li, H.; Li, X.; Fan, Q.; He, Q.; Wang, X.; Leung, V.C.M. Distributed DNN Inference with Fine-Grained Model Partitioning in Mobile Edge Computing Networks. IEEE Trans. Mob. Comput. 2024, 23, 9060–9074. [Google Scholar] [CrossRef]
Zhang, J.; Ma, S.; Yan, Z.; Huang, J. Joint DNN partitioning and task offloading in mobile edge computing via deep reinforcement learning. J. Cloud Comput. 2023, 12, 116. [Google Scholar] [CrossRef]
Su, Y.; Fan, W.; Gao, L.; Qiao, L.; Liu, Y.; Wu, F. Joint DNN Partition and Resource Allocation Optimization for Energy-Constrained Hierarchical Edge-Cloud Systems. IEEE Trans. Veh. Technol. 2023, 72, 3930–3944. [Google Scholar] [CrossRef]
Yuan, S.; Zhang, Z.; Li, Q.; Li, W.; Zhang, Y. Joint Optimization of DNN Partition and Continuous Task Scheduling for Digital Twin-Aided MEC Network with Deep Reinforcement Learning. IEEE Access 2023, 11, 27099–27110. [Google Scholar] [CrossRef]
Zhang, S.; Li, Y.; Liu, X.; Guo, S.; Wang, W.; Wang, J.; Ding, B.; Wu, D. Towards Real-time Cooperative Deep Inference over the Cloud and Edge End Devices. Proc. ACM Interact. Mob. Wearable Ubiquitous Technol. 2020, 4, 69. [Google Scholar] [CrossRef]
Liu, Y.; Wang, J.J.; Wang, H.Z.; Liu, S.; Wu, Y.; Hu, S.; Yu, Q.; Liu, Z.; Chen, T.; Yin, Y.; et al. Braille recognition by E-skin system based on binary memristive neural network. Sci. Rep. 2023, 13, 5437. [Google Scholar] [CrossRef]
Zeng, S.; Dai, G.; Zhang, N.; Yang, X.; Zhang, H.; Zhu, Z.; Yang, H.; Wang, Y. Serving Multi-DNN Workloads on FPGAs: A Coordinated Architecture, Scheduling, and Mapping Perspective. IEEE Trans. Comput. 2023, 72, 1314–1328. [Google Scholar] [CrossRef]
Mei, L.; Houshmand, P.; Jain, V.; Giraldo, S.; Verhelst, M. ZigZag: Enlarging Joint Architecture-Mapping Design Space Exploration for DNN Accelerators. IEEE Trans. Comput. 2021, 70, 1160–1174. [Google Scholar] [CrossRef]
Heidari, S.; Ghasemi, M.; Kim, Y.G.; Wu, C.J.; Vrudhula, S. CAMDNN: Content-Aware Mapping of a Network of Deep Neural Networks on Edge MPSoCs. IEEE Trans. Comput. 2022, 71, 3191–3202. [Google Scholar] [CrossRef]
Ren, W.; Qu, Y.; Dong, C.; Jing, Y.; Wu, Q.; Guo, S. A Survey on Collaborative DNN Inference for Edge Intelligence. Mach. Intell. Res. 2023, 20, 370–395. [Google Scholar] [CrossRef]
Khaledian, N.; Voelp, M.; Azizi, S.; Shirvani, M.H. AI-based & heuristic workflow scheduling in cloud and fog computing: A systematic review. Clust. Comput. 2024, 27, 10265–10298. [Google Scholar] [CrossRef]
Park, J.; Sung, H. XLA-NDP: Efficient Scheduling and Code Generation for Deep Learning Model Training on Near-Data Processing Memory. IEEE Comput. Archit. Lett. 2023, 22, 61–64. [Google Scholar] [CrossRef]
Akbari, M.; Rashidi, H. A multi-objectives scheduling algorithm based on cuckoo optimization for task allocation problem at compile time in heterogeneous systems. Expert Syst. Appl. 2016, 60, 234–248. [Google Scholar] [CrossRef]
Zhang, Z.; Kouzani, A. Implementation of DNNs on IoT devices. Neural Comput. Appl. 2019, 32, 1327–1356. [Google Scholar] [CrossRef]
Al-Maytami, B.A.; Fan, P.; Hussain, A.; Baker, T.; Liatsist, P. A Task Scheduling Algorithm with Improved Makespan Based on Prediction of Tasks Computation Time algorithm for Cloud Computing. IEEE Access 2019, 7, 160916–160926. [Google Scholar] [CrossRef]
Chen, H.; Zhang, Z.; Chen, P.; Luo, X.; Li, S.; Liu, W. MARCO: A High-performance Task Mapping and Routing Co-optimization Framework for Point-to-Point NoC-based Heterogeneous Computing Systems. ACM Trans. Embed. Comput. Syst. (TECS) 2021, 20, 54. [Google Scholar] [CrossRef]
Houssein, E.H.; Gad, A.G.; Wazery, Y.; Suganthan, P. Task Scheduling in Cloud Computing based on Meta-heuristics: Review, Taxonomy, Open Challenges, and Future Trends. Swarm Evol. Comput. 2021, 62, 100841. [Google Scholar] [CrossRef]
Kojima, T.; Ohwada, A.; Amano, H. Mapping-Aware Kernel Partitioning Method for CGRAs Assisted by Deep Learning. IEEE Trans. Parallel Distrib. Syst. 2022, 33, 1213–1230. [Google Scholar] [CrossRef]
Wang, Z.; Zhao, W.; Pu, Y.; Chen, L.; Thong, W.W.; Sheng, W.; Ho, T.Y.; Yu, B. ParSGCN: Bridging the Gap Between Emulation Partitioning and Scheduling. IEEE Trans. Comput.-Aided Des. Integr. Circuits Syst. 2025, 44, 1180–1192. [Google Scholar] [CrossRef]
Li, J.; Zhang, X.; Wei, J.; Ji, Z.; Wei, Z. GARLSched: Generative adversarial deep reinforcement learning task scheduling optimization for large-scale high performance computing systems. Future Gener. Comput. Syst. 2022, 135, 259–269. [Google Scholar] [CrossRef]
Zhang, B.; Zeng, H.; Prasanna, V. GraphAGILE: An FPGA-Based Overlay Accelerator for Low-Latency GNN Inference. IEEE Trans. Parallel Distrib. Syst. 2023, 34, 2580–2597. [Google Scholar] [CrossRef]
Sharif, H.; Srivastava, P.; Huzaifa, M.; Kotsifakou, M.; Joshi, K.; Sarita, Y.; Zhao, N.; Adve, V.S.; Misailovic, S.; Adve, S. ApproxHPVM: A portable compiler IR for accuracy-aware optimizations. Proc. ACM Program. Lang. 2019, 3, 186. [Google Scholar] [CrossRef]
Hamdi, M.A.; Daghero, F.; Sarda, G.M.; Delm, J.V.; Symons, A.; Benini, L.; Verhelst, M.; Pagliari, D.J.; Burrello, A. MATCH: Model-Aware TVM-based Compilation for Heterogeneous Edge Devices. arXiv 2024, arXiv:2410.08855. [Google Scholar] [CrossRef]
Anderson, L.; Adams, A.; Ma, K.; Li, T.M.; Jin, T.; Ragan-Kelley, J. Efficient automatic scheduling of imaging and vision pipelines for the GPU. Proc. ACM Program. Lang. 2020, 5, 109. [Google Scholar] [CrossRef]
Lin, C.; Chen, Z.; Zhang, Z.; Liu, J. TOP: Task-Based Operator Parallelism for Asynchronous Deep Learning Inference on GPU. IEEE Trans. Parallel Distrib. Syst. 2025, 36, 266–281. [Google Scholar] [CrossRef]
Xiao, Y.; Nazarian, S.; Bogdan, P. Self-Optimizing and Self-Programming Computing Systems: A Combined Compiler, Complex Networks, and Machine Learning Approach. IEEE Trans. Very Large Scale Integr. (VLSI) Syst. 2019, 27, 1416–1427. [Google Scholar] [CrossRef]
Schmitz, A.; Burak, S.; Miller, J.; Müller, M.S. Parallel Pattern Compiler for Automatic Global Optimizations. Parallel Comput. 2024, 122, 103112. [Google Scholar] [CrossRef]
Ma, Z.; Jin, Y.; Tang, S.; Wang, H.; Xue, W.; Zhai, J.D.; Zheng, W.M. Unified Programming Models for Heterogeneous High-Performance Computers. J. Comput. Sci. Technol. 2023, 38, 211–218. [Google Scholar] [CrossRef]
De Andrade, H.S.; Schroeder, J.; Crnkovic, I. Software Deployment on Heterogeneous Platforms: A Systematic Mapping Study. IEEE Trans. Softw. Eng. 2019, 47, 1683–1707. [Google Scholar] [CrossRef]
Liu, S.; Guo, B.; Fang, C.; Wang, Z.; Luo, S.; Zhou, Z.; Yu, Z. Enabling Resource-Efficient AIoT System with Cross-Level Optimization: A Survey. IEEE Commun. Surv. Tutor. 2023, 26, 389–427. [Google Scholar] [CrossRef]
Kojima, T.; Doan, N.; Amano, H. GenMap: A Genetic Algorithmic Approach for Optimizing Spatial Mapping of Coarse-Grained Reconfigurable Architectures. IEEE Trans. Very Large Scale Integr. (VLSI) Syst. 2020, 28, 2383–2396. [Google Scholar] [CrossRef]
Pudi, D.; Malviya, S.; Boppu, S.; Yang, Y.; Hemani, A.; Cenkeramaddi, L.R. Integer Linear Programming-Based Simultaneous Scheduling and Binding for SiLago Framework. IEEE Access 2024, 12, 124081–124094. [Google Scholar] [CrossRef]
Tang, Z.; Jia, W.; Zhou, X.; Yang, W.; You, Y. Representation and Reinforcement Learning for Task Scheduling in Edge Computing. IEEE Trans. Big Data 2020, 8, 795–808. [Google Scholar] [CrossRef]
Wang, C.; Yu, X.; Xu, L.; Wang, W. Energy-Efficient Task Scheduling Based on Traffic Mapping in Heterogeneous Mobile-Edge Computing: A Green IoT Perspective. IEEE Trans. Green Commun. Netw. 2023, 7, 972–982. [Google Scholar] [CrossRef]
Krishnakumar, A.; Arda, S.E.; Goksoy, A.; Mandal, S.K.; Ogras, U.; Sartor, A.L.; Marculescu, R. Runtime Task Scheduling Using Imitation Learning for Heterogeneous Many-Core Systems. IEEE Trans. Comput.-Aided Des. Integr. Circuits Syst. 2020, 39, 4064–4077. [Google Scholar] [CrossRef]
Sulaiman, M.; Halim, Z.; Lebbah, M.; Waqas, M.; Tu, S. An Evolutionary Computing-Based Efficient Hybrid Task Scheduling Approach for Heterogeneous Computing Environment. J. Grid Comput. 2021, 19, 11. [Google Scholar] [CrossRef]
Mu, S.; Zeng, Y.; Wang, B. Routability-Enhanced Scheduling for Application Mapping on CGRAs. IEEE Access 2021, 9, 92358–92366. [Google Scholar] [CrossRef]
Tirelli, C.; Sapriza, J.; Álvarez, R.R.; Ferretti, L.; Denkinger, B.; Ansaloni, G.; Calero, J.A.M.; Atienza, D.; Pozzi, L. SAT-Based Exact Modulo Scheduling Mapping for Resource-Constrained CGRAs. ACM J. Emerg. Technol. Comput. Syst. 2024, 20, 8. [Google Scholar] [CrossRef]
Liu, W.; Gu, Z.; Xu, J.; Wu, X.; Ye, Y. Satisfiability Modulo Graph Theory for Task Mapping and Scheduling on Multiprocessor Systems. IEEE Trans. Parallel Distrib. Syst. 2011, 22, 1382–1389. [Google Scholar] [CrossRef]
Elaziz, M.A.; Attiya, I. An improved Henry gas solubility optimization algorithm for task scheduling in cloud computing. Artif. Intell. Rev. 2020, 54, 3599–3637. [Google Scholar] [CrossRef]
Dave, S.; Balasubramanian, M.; Shrivastava, A. RAMP: Resource-Aware Mapping for CGRAs. In Proceedings of the 2018 55th ACM/ESDA/IEEE Design Automation Conference (DAC); Association for Computing Machinery: New York, NY, USA, 2018; pp. 1–6. [Google Scholar] [CrossRef]
Balasubramanian, M.; Shrivastava, A. CRIMSON: Compute-Intensive Loop Acceleration by Randomized Iterative Modulo Scheduling and Optimized Mapping on CGRAs. IEEE Trans. Comput.-Aided Des. Integr. Circuits Syst. 2020, 39, 3300–3310. [Google Scholar] [CrossRef]
Hamzeh, M.; Shrivastava, A.; Vrudhula, S. REGIMap: Register-aware application mapping on Coarse-Grained Reconfigurable Architectures (CGRAs). In Proceedings of the 2013 50th ACM/EDAC/IEEE Design Automation Conference (DAC); Association for Computing Machinery: New York, NY, USA, 2013; pp. 1–10. [Google Scholar] [CrossRef]
Li, B.; Doppa, J.; Pande, P.; Chakrabarty, K.; Qiu, J.X.; Li, H. 3D-ReG. ACM J. Emerg. Technol. Comput. Syst. (JETC) 2020, 16, 20. [Google Scholar] [CrossRef]
Gu, P.; Xie, X.; Li, S.; Niu, D.; Zheng, H.; Malladi, K.T.; Xie, Y. DLUX: A LUT-Based Near-Bank Accelerator for Data Center Deep Learning Training Workloads. IEEE Trans. Comput.-Aided Des. Integr. Circuits Syst. 2020, 40, 1586–1599. [Google Scholar] [CrossRef]
Dash, S.; Luo, Y.; Lu, A.; Yu, S.; Mukhopadhyay, S. Robust Processing-In-Memory with Multibit ReRAM Using Hessian-Driven Mixed-Precision Computation. IEEE Trans. Comput.-Aided Des. Integr. Circuits Syst. 2021, 41, 1006–1019. [Google Scholar] [CrossRef]
Roy, S.; Ali, M.; Raghunathan, A. PIM-DRAM: Accelerating Machine Learning Workloads Using Processing in Commodity DRAM. IEEE J. Emerg. Sel. Top. Circuits Syst. 2021, 11, 701–710. [Google Scholar] [CrossRef]
Sun, H.; Shen, J.; Zhang, T.; Tang, Z.; Zhang, C.; Li, Y.; Shi, Y.; Liu, H. FAMS: A FrAmework of Memory-Centric Mapping for DNNs on Systolic Array Accelerators. IEEE Trans. Very Large Scale Integr. (VLSI) Syst. 2025, 33, 976–989. [Google Scholar] [CrossRef]
Li, B.; Qu, S.; Wang, Y. An Automated Quantization Framework for High-Utilization RRAM-Based PIM. IEEE Trans. Comput.-Aided Des. Integr. Circuits Syst. 2022, 41, 583–596. [Google Scholar] [CrossRef]
Qu, S.; Li, B.; Wang, Y.; Xu, D.; Zhao, X.; Zhang, L. RaQu: An automatic high-utilization CNN quantization and mapping framework for general-purpose RRAM Accelerator. In 2020 57th ACM/IEEE Design Automation Conference (DAC); IEEE: Piscataway, NJ, USA, 2020; pp. 1–6. [Google Scholar] [CrossRef]
Zhang, Y.; Jia, Z.; Du, H.; Xue, R.; Shen, Z.; Shao, Z. A Practical Highly Paralleled ReRAM-Based DNN Accelerator by Reusing Weight Pattern Repetitions. IEEE Trans. Comput.-Aided Des. Integr. Circuits Syst. 2022, 41, 922–935. [Google Scholar] [CrossRef]
Zhang, Y.; Wang, X.; Jiang, X.; Yang, Y.; Shen, Z.; Jia, Z. PQ-PIM: A pruning-quantization joint optimization framework for ReRAM-based processing-in-memory DNN accelerator. J. Syst. Archit. 2022, 127, 102531. [Google Scholar] [CrossRef]
Sun, H.; Zhu, Z.; Wang, C.; Ning, X.; Dai, G.; Yang, H.; Wang, Y. Gibbon: An Efficient Co-Exploration Framework of NN Model and Processing-In-Memory Architecture. IEEE Trans. Comput.-Aided Des. Integr. Circuits Syst. 2023, 42, 4075–4089. [Google Scholar] [CrossRef]
Wu, X.; Hanson, E.; Wang, N.; Zheng, Q.; Yang, X.; Yang, H.; Li, S.; Cheng, F.; Pande, P.; Doppa, J.; et al. Block-Wise Mixed-Precision Quantization: Enabling High Efficiency for Practical ReRAM-Based DNN Accelerators. IEEE Trans. Comput.-Aided Des. Integr. Circuits Syst. 2023, 43, 4558–4571. [Google Scholar] [CrossRef]
Gao, X.; Wang, H.; Chen, Y.; Zhang, Y.; Shen, Z.; Ju, L. Static Scheduling of Weight Programming for DNN Acceleration with Resource Constrained PIM. ACM Trans. Embed. Comput. Syst. 2023, 23, 89. [Google Scholar] [CrossRef]
Zhang, J.; Wang, X.; Ye, Y.; Lyu, D.; Xiong, G.; Xu, N.; Lian, Y.; He, G. M2M: A Fine-Grained Mapping Framework to Accelerate Multiple DNNs on a Multi-Chiplet Architecture. IEEE Trans. Very Large Scale Integr. (VLSI) Syst. 2024, 32, 1864–1877. [Google Scholar] [CrossRef]
Dorostkar, A.; Farbeh, H.; Zarandi, H.R. An Empirical Fault Vulnerability Exploration of ReRAM-Based Process-in-Memory CNN Accelerators. IEEE Trans. Reliab. 2025, 74, 2290–2304. [Google Scholar] [CrossRef]
Wang, J.; Du, H.; Ding, B.; Xu, Q.; Chen, S.; Kang, Y. DDAM: Data Distribution-Aware Mapping of CNNs on Processing-In-Memory Systems. ACM Trans. Des. Autom. Electron. Syst. 2022, 28, 36. [Google Scholar] [CrossRef]
Li, C.; Zhou, Z.; Wang, Y.; Yang, F.; Cao, T.; Yang, M.; Liang, Y.; Sun, G. PIM-DL: Expanding the Applicability of Commodity DRAM-PIMs for Deep Learning via Algorithm-System Co-Optimization. In Proceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems; Association for Computing Machinery: New York, NY, USA, 2024; Volume 2, pp. 879–896. [Google Scholar] [CrossRef]
Rhe, J.; Jeon, K.E.; Lee, J.; Jeong, S.; Ko, J.H. KERNTROL: Kernel Shape Control Toward Ultimate Memory Utilization for In-Memory Convolutional Weight Mapping. IEEE Trans. Circuits Syst. I Regul. Pap. 2024, 71, 6138–6151. [Google Scholar] [CrossRef]
Lee, Y.S.; Han, T. Task Parallelism-Aware Deep Neural Network Scheduling on Multiple Hybrid Memory Cube-Based Processing-in-Memory. IEEE Access 2021, 9, 68561–68572. [Google Scholar] [CrossRef]
Giannoula, C.; Yang, P.; Fernandez, I.; Yang, J.; Durvasula, S.; Li, Y.X.; Sadrosadati, M.; Luna, J.G.; Mutlu, O.; Pekhimenko, G. PyGim: An Efficient Graph Neural Network Library for Real Processing-In-Memory Architectures. Proc. ACM Meas. Anal. Comput. Syst. 2024, 8, 43. [Google Scholar] [CrossRef]
Liu, F.; Zhao, W.; Wang, Z.; Chen, Y.; Liang, X.; Jiang, L. ERA-BS: Boosting the Efficiency of ReRAM-Based PIM Accelerator with Fine-Grained Bit-Level Sparsity. IEEE Trans. Comput. 2024, 73, 2320–2334. [Google Scholar] [CrossRef]
Dhilleswararao, P.; Boppu, S.; Manikandan, M.; Cenkeramaddi, L.R. Efficient Hardware Architectures for Accelerating Deep Neural Networks: Survey. IEEE Access 2022, 10, 131788–131828. [Google Scholar] [CrossRef]
Han, L.; Pan, R.; Zhou, Z.; Lu, H.; Chen, Y.; Yang, H.; Huang, P.; Sun, G.; Liu, X.; Kang, J. CoMN: Algorithm-Hardware Co-Design Platform for Nonvolatile Memory-Based Convolutional Neural Network Accelerators. IEEE Trans. Comput.-Aided Des. Integr. Circuits Syst. 2024, 43, 2043–2056. [Google Scholar] [CrossRef]
Krishnan, G.; Mandal, S.K.; Pannala, M.; Chakrabarti, C.; Seo, J.-S.; Ogras, Ü.Y.; Cao, Y. SIAM: Chiplet-based Scalable In-Memory Acceleration with Mesh for Deep Neural Networks. ACM Trans. Embed. Comput. Syst. (TECS) 2021, 20, 68. [Google Scholar] [CrossRef]
Rakka, M.; Fouda, M.; Khargonekar, P.P.; Kurdahi, F.J. A Review of State-of-the-art Mixed-Precision Neural Network Frameworks. IEEE Trans. Pattern Anal. Mach. Intell. 2024, 46, 7793–7812. [Google Scholar] [CrossRef] [PubMed]
Liu, F.; Li, H.; Hu, W.; He, Y. Review of neural network model acceleration techniques based on FPGA platforms. Neurocomputing 2024, 610, 128511. [Google Scholar] [CrossRef]
Prasad, N.S.; Sundar, S. Comprehensive Review on the Exploitation of Advanced Memory Optimization Strategies to Improve Performance for Convolutional and Spiking Neural Networks in Medical Imaging Using Hardware Accelerators. IEEE Access 2025, 13, 62449–62461. [Google Scholar] [CrossRef]
Woźniak, S.; Pantazi, A.; Bohnstingl, T.; Eleftheriou, E. Deep learning incorporating biologically inspired neural dynamics and in-memory computing. Nat. Mach. Intell. 2020, 2, 325–336. [Google Scholar] [CrossRef]
Xing, Y.; Liang, S.; Sui, L.; Jia, X.; Qiu, J.; Liu, X.; Wang, Y.; Wang, Y.; Shan, Y. DNNVM: End-to-End Compiler Leveraging Heterogeneous Optimizations on FPGA-Based CNN Accelerators. IEEE Trans. Comput.-Aided Des. Integr. Circuits Syst. 2019, 39, 2668–2681. [Google Scholar] [CrossRef]
Yin, S.; Sun, X.; Yu, S.; Seo, J.-S. High-Throughput In-Memory Computing for Binary Deep Neural Networks with Monolithically Integrated RRAM and 90-nm CMOS. IEEE Trans. Electron Devices 2020, 67, 4185–4192. [Google Scholar] [CrossRef]
Guo, K.; Han, S.; Yao, S.; Wang, Y.; Xie, Y.; Yang, H. Software-Hardware Codesign for Efficient Neural Network Acceleration. IEEE Micro 2017, 37, 18–25. [Google Scholar] [CrossRef]
Yu, R.; Wang, Z.; Liu, Q.; Gao, B.; Hao, Z.; Guo, T.; Ding, S.; Zhang, J.; Qin, Q.; Wu, D.; et al. A full-stack memristor-based computation-in-memory system with software-hardware co-development. Nat. Commun. 2025, 16, 2123. [Google Scholar] [CrossRef]
Mackin, C.; Rasch, M.; Chen, A.; Timcheck, J.; Bruce, R.L.; Li, N.; Narayanan, P.; Ambrogio, S.; Gallo, M.L.; Nandakumar, S.; et al. Optimised weight programming for analogue memory-based deep neural networks. Nat. Commun. 2022, 13, 3765. [Google Scholar] [CrossRef]
Antolini, A.; Paolino, C.; Zavalloni, F.; Lico, A.; Scarselli, E.F.; Mangia, M.; Pareschi, F.; Setti, G.; Rovatti, R.; Torres, M.L.; et al. Combined HW/SW Drift and Variability Mitigation for PCM-Based Analog In-Memory Computing for Neural Network Applications. IEEE J. Emerg. Sel. Top. Circuits Syst. 2023, 13, 395–407. [Google Scholar] [CrossRef]
Sze, V.; Chen, Y.-H.; Yang, T.-J.; Emer, J. Efficient Processing of Deep Neural Networks: A Tutorial and Survey. Proc. IEEE 2017, 105, 2295–2329. [Google Scholar] [CrossRef]
Garofalo, A.; Ottavi, G.; Conti, F.; Karunaratne, G.; Boybat, I.; Benini, L.; Rossi, D. A Heterogeneous In-Memory Computing Cluster for Flexible End-to-End Inference of Real-World Deep Neural Networks. IEEE J. Emerg. Sel. Top. Circuits Syst. 2022, 12, 422–435. [Google Scholar] [CrossRef]
Oliveira, G.F.; Gómez-Luna, J.; Ghose, S.; Boroumand, A.; Mutlu, O. Accelerating Neural Network Inference with Processing-in-DRAM: From the Edge to the Cloud. IEEE Micro 2022, 42, 25–38. [Google Scholar] [CrossRef]
Zheng, Q.; Li, X.; Guan, Y.; Wang, Z.; Cai, Y.; Chen, Y.; Sun, G.; Huang, R. PIMulator-NN: An Event-Driven, Cross-Level Simulation Framework for Processing-In-Memory-Based Neural Network Accelerators. IEEE Trans. Comput.-Aided Des. Integr. Circuits Syst. 2022, 41, 5464–5475. [Google Scholar] [CrossRef]
Jeon, W.; Lee, J.; Kang, D.; Kal, H.; Ro, W. PIMCaffe: Functional Evaluation of a Machine Learning Framework for In-Memory Neural Processing Unit. IEEE Access 2021, 9, 96629–96640. [Google Scholar] [CrossRef]
Velasco-Montero, D.; Fernández-Berni, J.; Carmona-Galán, R.; Rodríguez-Vázquez, Á. Optimum Selection of DNN Model and Framework for Edge Inference. IEEE Access 2018, 6, 51680–51692. [Google Scholar] [CrossRef]
Wess, M.; Schnöll, D.; Dallinger, D.; Bittner, M.; Jantsch, A. Conformal Prediction Based Confidence for Latency Estimation of DNN Accelerators: A Black-Box Approach. IEEE Access 2024, 12, 109847–109860. [Google Scholar] [CrossRef]
Krishnan, G.; Sun, J.; Hazra, J.; Du, X.; Liehr, M.; Li, Z.; Beckmann, K.; Joshi, R.; Cady, N.; Fan, D.; et al. Exploring Model Stability of Deep Neural Networks for Reliable RRAM-Based In-Memory Acceleration. IEEE Trans. Comput. 2022, 71, 2740–2752. [Google Scholar] [CrossRef]
Moitra, A.; Bhattacharjee, A.; Kuang, R.; Krishnan, G.; Cao, Y.; Panda, P. SpikeSim: An End-to-End Compute-in-Memory Hardware Evaluation Tool for Benchmarking Spiking Neural Networks. IEEE Trans. Comput.-Aided Des. Integr. Circuits Syst. 2022, 42, 3815–3828. [Google Scholar] [CrossRef]
Geirhos, R.; Jacobsen, J.; Michaelis, C.; Zemel, R.; Brendel, W.; Bethge, M.; Wichmann, F. Shortcut learning in deep neural networks. Nat. Mach. Intell. 2020, 2, 665–673. [Google Scholar] [CrossRef]
Lones, M. Avoiding common machine learning pitfalls. Patterns 2021, 5, 101046. [Google Scholar] [CrossRef] [PubMed]
Maleki, F.; Ovens, K.; Gupta, R.; Reinhold, C.; Spatz, A.; Forghani, R. Generalizability of Machine Learning Models: Quantitative Evaluation of Three Methodological Pitfalls. Radiol. Artif. Intell. 2022, 5, e220028. [Google Scholar] [CrossRef] [PubMed]
Yan, Z.; Hu, X.; Shi, Y. Compute-in-Memory-Based Neural Network Accelerators for Safety-Critical Systems: Worst-Case Scenarios and Protections. IEEE Trans. Comput.-Aided Des. Integr. Circuits Syst. 2023, 43, 2452–2464. [Google Scholar] [CrossRef]
Shapira, G.; Chen, Y. Common Pitfalls of Benchmarking Big Data Systems. IEEE Trans. Serv. Comput. 2016, 9, 152–160. [Google Scholar] [CrossRef]
Smagulova, K.; Fouda, M.; Kurdahi, F.; Salama, K.; Eltawil, A. Resistive Neural Hardware Accelerators. Proc. IEEE 2021, 111, 500–527. [Google Scholar] [CrossRef]
Kriegel, H.; Schubert, E.; Zimek, A. The (black) art of runtime evaluation: Are we comparing algorithms or implementations? Knowl. Inf. Syst. 2016, 52, 341–378. [Google Scholar] [CrossRef]
Meng, J.; Yang, L.; Peng, X.; Yu, S.; Fan, D.; Seo, J.-S. Structured Pruning of RRAM Crossbars for Efficient In-Memory Computing Acceleration of Deep Neural Networks. IEEE Trans. Circuits Syst. II Express Briefs 2021, 68, 1576–1580. [Google Scholar] [CrossRef]
Camus, V.; Mei, L.; Enz, C.; Verhelst, M. Review and Benchmarking of Precision-Scalable Multiply-Accumulate Unit Architectures for Embedded Neural-Network Processing. IEEE J. Emerg. Sel. Top. Circuits Syst. 2019, 9, 697–711. [Google Scholar] [CrossRef]
Zhou, Y.; Dong, H.; Saddik, A.E. Deep Learning in Next-Frame Prediction: A Benchmark Review. IEEE Access 2020, 8, 69273–69283. [Google Scholar] [CrossRef]
Vinayakumar, R.; Alazab, M.; Soman, K.P.; Poornachandran, P.; Al-Nemrat, A.; Venkatraman, S. Deep Learning Approach for Intelligent Intrusion Detection System. IEEE Access 2019, 7, 41525–41550. [Google Scholar] [CrossRef]
Lee, K.; Eo, M.; Jung, E.; Yoon, Y.; Rhee, W. Short-Term Traffic Prediction with Deep Neural Networks: A Survey. IEEE Access 2020, 9, 54739–54756. [Google Scholar] [CrossRef]
Cui, X.; Zheng, S.; Jia, T.; Ye, L.; Liang, Y. ARES: A Mapping Framework of DNNs Towards Diverse PIMs with General Abstractions. In 2023 IEEE/ACM International Conference on Computer Aided Design (ICCAD); IEEE: Piscataway, NJ, USA, 2023; pp. 1–9. [Google Scholar] [CrossRef]
Lyu, B.; Wang, S.; Wen, S.; Shi, K.; Yang, Y.; Zeng, L.; Huang, T. AutoGMap: Learning to Map Large-Scale Sparse Graphs on Memristive Crossbars. IEEE Trans. Neural Netw. Learn. Syst. 2021, 35, 12888–12898. [Google Scholar] [CrossRef]
Chen, X. Instruction Set Architecture (ISA) for Processing-in-Memory DNN Accelerators. arXiv 2023, arXiv:2308.06449. [Google Scholar] [CrossRef]
Negi, S.; Chakraborty, I.; Ankit, A.; Roy, K. NAX: Neural architecture and memristive xbar based accelerator co-design. In Proceedings of the 59th ACM/IEEE Design Automation Conference; Association for Computing Machinery: New York, NY, USA, 2022; pp. 451–456. [Google Scholar] [CrossRef]
Cao, W.; Zhao, Y.; Boloor, A.; Han, Y.; Zhang, X.; Jiang, L. Neural-PIM: Efficient Processing-In-Memory with Neural Approximation of Peripherals. IEEE Trans. Comput. 2022, 71, 2142–2155. [Google Scholar] [CrossRef]
Ghosh, S.K.; Raha, A.; Raghunathan, V. Energy-Efficient Approximate Edge Inference Systems. ACM Trans. Embed. Comput. Syst. 2023, 22, 77. [Google Scholar] [CrossRef]
Noh, S.H.; Lee, S.; Shin, B.; Park, S.; Jang, Y.; Kung, J. All-rounder: A flexible DNN accelerator with diverse data format support. arXiv 2023, arXiv:2310.16757. [Google Scholar] [CrossRef]
Zhang, X.; Ye, H.; Wang, J.; Lin, Y.; Xiong, J.; Hwu, W.-m.; Chen, D. DNNExplorer: A Framework for Modeling and Exploring a Novel Paradigm of FPGA-based DNN Accelerator. In Proceedings of the 2020 IEEE/ACM International Conference on Computer Aided Design (ICCAD); Association for Computing Machinery: New York, NY, USA, 2020; pp. 1–9. [Google Scholar] [CrossRef]
Wang, Z.; Sun, G.; Zhu, J.; Zhou, Z.; Guo, Y.; Yuan, Z. METRO: A Software-Hardware Co-Design of Interconnections for Spatial DNN Accelerators. arXiv 2021, arXiv:2108.10570. [Google Scholar]
Liu, M.; Yin, M.; Han, K.; Demara, R.; Yuan, B.; Bai, Y. Algorithm and hardware co-design co-optimization framework for LSTM accelerator using quantized fully decomposed tensor train. Internet Things 2023, 22, 100680. [Google Scholar] [CrossRef]
Krishnan, G.; Mandal, S.K.; Chakrabarti, C.; Seo, J.-s.; Ogras, U.; Cao, Y. Interconnect-Aware Area and Energy Optimization for In-Memory Acceleration of DNNs. IEEE Des. Test 2020, 37, 79–87. [Google Scholar] [CrossRef]
Xu, Z.; Yang, D.; Yin, C.; Tang, J.; Wang, Y.; Xue, G. A Co-Scheduling Framework for DNN Models on Mobile and Edge Devices with Heterogeneous Hardware. IEEE Trans. Mob. Comput. 2021, 22, 1275–1288. [Google Scholar] [CrossRef]
Lee, E.; Han, T.; Seo, D.H.; Shin, G.; Kim, J.; Kim, S.; Jeong, S.; Rhe, J.; Park, J.; Ko, J.; et al. A Charge-Domain Scalable-Weight In-Memory Computing Macro with Dual-SRAM Architecture for Precision-Scalable DNN Accelerators. IEEE Trans. Circuits Syst. I Regul. Pap. 2021, 68, 3305–3316. [Google Scholar] [CrossRef]
Houshmand, P.; Sarda, G.M.; Jain, V.; Ueyoshi, K.; Papistas, I.A.; Shi, M.; Zheng, Q.; Bhattacharjee, D.; Mallik, A.; Debacker, P.; et al. DIANA: An End-to-End Hybrid DIgital and ANAlog Neural Network SoC for the Edge. IEEE J. Solid-State Circuits 2023, 58, 203–215. [Google Scholar] [CrossRef]
Jia, H.; Ozatay, M.; Tang, Y.; Valavi, H.; Pathak, R.; Lee, J.; Verma, N. Scalable and Programmable Neural Network Inference Accelerator Based on In-Memory Computing. IEEE J. Solid-State Circuits 2021, 57, 198–211. [Google Scholar] [CrossRef]
Rasch, M.; Mackin, C.; Gallo, M.L.; Chen, A.; Fasoli, A.; Odermatt, F.; Li, N.; Nandakumar, S.; Narayanan, P.; Tsai, H.; et al. Hardware-aware training for large-scale and diverse deep learning inference workloads using in-memory computing-based accelerators. Nat. Commun. 2023, 14, 5282. [Google Scholar] [CrossRef]
Keller, B.; Venkatesan, R.; Dai, S.; Tell, S.; Zimmer, B.; Sakr, C.; Dally, W.; Gray, C.T.; Khailany, B. A 95.6-TOPS/W Deep Learning Inference Accelerator with Per-Vector Scaled 4-bit Quantization in 5 nm. IEEE J. Solid-State Circuits 2023, 58, 1129–1141. [Google Scholar] [CrossRef]
Zimmer, B.; Venkatesan, R.; Shao, Y.; Clemons, J.; Fojtik, M.R.; Jiang, N.; Keller, B.; Klinefelter, A.; Pinckney, N.; Raina, P.; et al. A 0.32–128 TOPS, Scalable Multi-Chip-Module-Based Deep Neural Network Inference Accelerator with Ground-Referenced Signaling in 16 nm. IEEE J. Solid-State Circuits 2020, 55, 920–932. [Google Scholar] [CrossRef]
Long, Y.; Na, T.; Mukhopadhyay, S. ReRAM-Based Processing-in-Memory Architecture for Recurrent Neural Network Acceleration. IEEE Trans. Very Large Scale Integr. (VLSI) Syst. 2018, 26, 2781–2794. [Google Scholar] [CrossRef]
Liu, Z.; Dou, Y.; Jiang, J.; Xu, J.; Li, S.; Zhou, Y.; Xu, Y. Throughput-Optimized FPGA Accelerator for Deep Convolutional Neural Networks. ACM Trans. Reconfig. Technol. Syst. (TRETS) 2017, 10, 17. [Google Scholar] [CrossRef]
Nie, C.; Tang, C.; Lin, J.; Hu, H.; Lv, C.; Cao, T.; Zhang, W.; Jiang, L.; Liang, X.; Qian, W.; et al. VSPIM: SRAM Processing-in-Memory DNN Acceleration via Vector-Scalar Operations. IEEE Trans. Comput. 2024, 73, 2378–2390. [Google Scholar] [CrossRef]
Krishnan, G.; Mandal, S.K.; Chakrabarti, C.; Seo, J.-S.; Ogras, U.; Cao, Y. Impact of On-chip Interconnect on In-memory Acceleration of Deep Neural Networks. ACM J. Emerg. Technol. Comput. Syst. (JETC) 2021, 18, 34. [Google Scholar] [CrossRef]
Wei, Y.; Wang, Z.; Wang, Z.; Dai, Y.; Ou, G.; Gao, H.; Yang, H.; Wang, Y.; Cao, C.C.; Weng, L.; et al. Visual Diagnostics of Parallel Performance in Training Large-Scale DNN Models. IEEE Trans. Vis. Comput. Graph. 2023, 30, 3915–3929. [Google Scholar] [CrossRef]
Wang, Y.; Chen, W.; Yang, J.; Li, T. Exploiting Parallelism for CNN Applications on 3D Stacked Processing-In-Memory Architecture. IEEE Trans. Parallel Distrib. Syst. 2019, 30, 589–600. [Google Scholar] [CrossRef]
Angizi, S.; He, Z.; Awad, A.; Fan, D. MRIMA: An MRAM-Based In-Memory Accelerator. IEEE Trans. Comput.-Aided Des. Integr. Circuits Syst. 2020, 39, 1123–1136. [Google Scholar] [CrossRef]
Soliman, T.; Laleni, N.; Kirchner, T.; Müller, F.; Shrivastava, A.; Kämpfe, T.; Guntoro, A.; Wehn, N. FELIX: A Ferroelectric FET Based Low Power Mixed-Signal In-Memory Architecture for DNN Acceleration. ACM Trans. Embed. Comput. Syst. 2022, 21, 84. [Google Scholar] [CrossRef]
Xiao, T.; Bennett, C.; Feinberg, B.; Agarwal, S.; Marinella, M. Analog architectures for neural network acceleration based on non-volatile memory. Appl. Phys. Rev. 2020, 7, 031301. [Google Scholar] [CrossRef]
Kraidia, I.; Ghenai, A.; Belhaouari, S. Defense against adversarial attacks: Robust and efficient compressed optimized neural networks. Sci. Rep. 2024, 14, 6420. [Google Scholar] [CrossRef]
Ogbogu, C.; Arka, A.I.; Pfromm, L.; Joardar, B.K.; Doppa, J.; Chakrabarty, K.; Pande, P. Accelerating Graph Neural Network Training on ReRAM-Based PIM Architectures via Graph and Model Pruning. IEEE Trans. Comput.-Aided Des. Integr. Circuits Syst. 2023, 42, 2703–2716. [Google Scholar] [CrossRef]
Ogbogu, C.; Joardar, B.K.; Chakrabarty, K.; Doppa, J.; Pande, P. Data Pruning-enabled High Performance and Reliable Graph Neural Network Training on ReRAM-based Processing-in-Memory Accelerators. Acm Trans. Des. Autom. Electron. Syst. 2024, 29, 72. [Google Scholar] [CrossRef]
Hamdia, K.M.; Zhuang, X.; Rabczuk, T. An efficient optimization approach for designing machine learning models based on genetic algorithm. Neural Comput. Appl. 2020, 33, 1923–1933. [Google Scholar] [CrossRef]
Guo, F.; Han, D.; Kim, N. Multi-Objectives Optimization of Plastic Injection Molding Process Parameters Based on Numerical DNN-GA-MCS Strategy. Polymers 2024, 16, 2247. [Google Scholar] [CrossRef]
Goerigk, M.; Kurtz, J. Data-driven robust optimization using deep neural networks. Comput. Oper. Res. 2022, 151, 106087. [Google Scholar] [CrossRef]
Katz, J.; Pappas, I.; Avraamidou, S.; Pistikopoulos, E. Integrating deep learning models and multiparametric programming. Comput. Chem. Eng. 2020, 136, 106801. [Google Scholar] [CrossRef]
Lin, H.; Yan, M.; Ye, X.; Fan, D.; Pan, S.; Chen, W.; Xie, Y. A Comprehensive Survey on Distributed Training of Graph Neural Networks. Proc. IEEE 2022, 111, 1572–1606. [Google Scholar] [CrossRef]
Besta, M.; Hoefler, T. Parallel and Distributed Graph Neural Networks: An In-Depth Concurrency Analysis. IEEE Trans. Pattern Anal. Mach. Intell. 2022, 46, 2584–2606. [Google Scholar] [CrossRef]
Zhang, S.; Yi, X.; Diao, L.; Wu, C.; Wang, S.; Lin, W. Expediting Distributed DNN Training with Device Topology-Aware Graph Deployment. IEEE Trans. Parallel Distrib. Syst. 2023, 34, 1281–1293. [Google Scholar] [CrossRef]
Yang, T.; Li, D.; Ma, F.; Song, Z.; Zhao, Y.; Zhang, J.; Liu, F.; Jiang, L. PASGCN: An ReRAM-Based PIM Design for GCN with Adaptively Sparsified Graphs. IEEE Trans. Comput.-Aided Des. Integr. Circuits Syst. 2023, 42, 150–163. [Google Scholar] [CrossRef]
Raman, S.R.S.; John, L.; Kulkarni, J.P. NEM-GNN: DAC/ADC-less, Scalable, Reconfigurable, Graph and Sparsity-Aware Near-Memory Accelerator for Graph Neural Networks. ACM Trans. Archit. Code Optim. 2024, 21, 39. [Google Scholar] [CrossRef]
Ghasemi, S.A.; Jahannia, B.; Farbeh, H. GraphA: An efficient ReRAM-based architecture to accelerate large scale graph processing. J. Syst. Archit. 2022, 133, 102755. [Google Scholar] [CrossRef]
Dai, G.; Huang, T.; Chi, Y.; Zhao, J.; Sun, G.; Liu, Y.; Wang, Y.; Xie, Y.; Yang, H. GraphH: A Processing-in-Memory Architecture for Large-Scale Graph Processing. IEEE Trans. Comput.-Aided Des. Integr. Circuits Syst. 2019, 38, 640–653. [Google Scholar] [CrossRef]
Wei, Y.; Wang, X.; Zhang, S.; Yang, J.; Jia, X.; Wang, Z.; Qu, G.; Zhao, W. IMGA: Efficient In-Memory Graph Convolution Network Aggregation with Data Flow Optimizations. IEEE Trans. Comput.-Aided Des. Integr. Circuits Syst. 2023, 42, 4695–4705. [Google Scholar] [CrossRef]
Wang, T.; Zheng, X.; Zhang, L.; Cui, Z.; Xu, C. A graph-based interpretability method for deep neural networks. Neurocomputing 2023, 555, 126651. [Google Scholar] [CrossRef]
Wu, Z.; Pan, S.; Chen, F.; Long, G.; Zhang, C.; Yu, P.S. A Comprehensive Survey on Graph Neural Networks. IEEE Trans. Neural Netw. Learn. Syst. 2019, 32, 4–24. [Google Scholar] [CrossRef] [PubMed]
Jin, H.; Chen, D.; Zheng, L.; Huang, Y.; Yao, P.; Zhao, J.; Liao, X.; Jiang, W. Accelerating Graph Convolutional Networks Through a PIM-Accelerated Approach. IEEE Trans. Comput. 2023, 72, 2628–2640. [Google Scholar] [CrossRef]
Black, J.E.; Kueper, J.K.; Williamson, T. An introduction to machine learning for classification and prediction. Fam. Pract. 2022. [Google Scholar] [CrossRef] [PubMed]
Hüttenrauch, M.; Neumann, G. Robust Black-Box Optimization for Stochastic Search and Episodic Reinforcement Learning. J. Mach. Learn. Res. 2024, 25, 1–44. [Google Scholar]
Kim, S.; Li, Z.; Um, S.; Jo, W.; Ha, S.; Lee, J.; Kim, S.; Han, D.; Yoo, H.J. DynaPlasia: An eDRAM In-Memory Computing-Based Reconfigurable Spatial Accelerator with Triple-Mode Cell. IEEE J. Solid-State Circuits 2024, 59, 102–115. [Google Scholar] [CrossRef]
Lin, C.T.; Wang, D.; Zhang, B.; Chen, G.K.; Knag, P.C.; Krishnamurthy, R.; Seok, M. DIMCA: An Area-Efficient Digital In-Memory Computing Macro Featuring Approximate Arithmetic Hardware in 28 nm. IEEE J. Solid-State Circuits 2024, 59, 960–971. [Google Scholar] [CrossRef]

Figure 1. Conceptual overview of the integration of graph-based DNN mapping, machine learning-driven optimization, and benchmarking frameworks in PIM/IMC systems.

Figure 2. Comparison of von Neumann, PIM, and IMC architectures, highlighting data flow and control structures.

Figure 3. Evolution of computing paradigms from the von Neumann architecture toward neuromorphic systems, highlighting increasing energy efficiency and integration trends across decades.

Figure 4. Radar chart comparison of key IMC/PIM memory technologies across precision, endurance, energy efficiency, technology maturity, and integration attribute.

Figure 5. End-to-end DNN-to-PIM/IMC deployment pipeline illustrating mapping, hardware execution, and feedback co-design loop.

Figure 7. Progressive roadmap from benchmarking and co-design to learning-based mapping, unified evaluation, and industrial standardization for PIM/IMC systems.

Table 1. Comparative feature matrix: IMC/PIM memory technologies.

Technology	Device Type	Integration Challenge	Typical Application	Notable Example
DRAM	Volatile	Volatility, refresh overhead	AI, HPC, general compute	HBM2-PIM, Smart Memory Cube [51,52]
SRAM	Volatile	Area/cost scaling	Edge AI, embedded, cache	PIMCA, IMC-Sort [41,59,60]
RRAM (ReRAM)	Non-volatile	Device variability, endurance	AI accelerators, neuromorphic	SLIM, 2T2R RRAM [47,48,49]
PCM	Non-volatile	Drift, write energy	Analog IMC, photonic IMC	PCM-AIMC, Photonic PCM [50,61,62]
FeFET	Non-volatile	Endurance, process control	Low-power PIM, logic-in-mem	FeFET-PIM, DG-FeFET [63,64,65]
STT-MRAM	Non-volatile	Write energy, integration	Digital IMC, NVM cache	BCLS-SP, SOT-MRAM [66,67,68]
Photonic Memristor	Non-volatile	Integration with electronics	Photonic IMC, neuromorphic	Photonic PCM [61]
Y-Flash	Non-volatile	Process maturity	ML inference, logic-in-mem	IMPACT [69]
Halide Perovskite	Non-volatile	Stability, scalability	Reconfigurable logic-in-mem	Perovskite IMC [70]
3D Self-Rectifying Memristor	Non-volatile	3D integration, sneak paths	3D IMC, high-density arrays	3D SRM [49,71]

Table 2. Decision taxonomy: Conventional DNN mapping, partitioning, and scheduling methods.

Method Type	When to Use	Core Mechanism	Strength	Key Limitation	Example System/Paper
Rule-Based	Well-known workload, simple hardware	Predefined rules	Predictable, low-cost	Inflexible, not adaptive	[100,102,114]
Heuristic	Scalable, time-constrained, moderate complexity	Greedy/DP/Graph	Fast, scalable	May miss optimum, local optima	[103,115,116,117,118]
Metaheuristic	Large search space, complex/ heterogeneous environments	Genetic/Swarm/ particle swarm optimization/genetic algorithm (PSO/GA)	Finds good solutions in complex space	High computation cost, slow	[104,112,119,120]
Learning-Based	Dynamic, non-stationary, or highly variable workloads	deep reinforcement learning (DRL)/Policy Gradient/Actor-Critic	Adaptive, handles dynamics	Requires training, data-hungry	[121,122,123,124,125,126]
Graph-Based	Irregular DNNs, complex dependencies, DAG topologies	Graph partition/ min-cut	Captures structure, flexible	High overhead, complex tuning	[103,126,127,128]
Hybrid/ Automated	Heterogeneous hardware, multi-objective optimization	Auto-tuning, DSE frameworks	Balances trade-offs, flexible	Complexity, may need profiling	[102,129,130,131]

Table 3. Chronological landscape of DNN-to-PIM mapping frameworks (2019–2025).

Framework	Year	Best Benchmark	Key Innovation
NNPIM [5]	2019	48.2× speedup, 131.5× energy efficiency vs. GPU	Crossbar memory, weight sharing, parallel in-memory compute
Weight Mapping [79]	2019	2.03× speedup, 1.4× throughput/energy vs. prior mapping	Kernel division, spatial PE assignment, optimal pipeline
3D-ReG [166]	2020	5.64× training speedup, 3.56× energy efficiency vs. GPU + DRAM	Heterogeneous GPU + PIM, 3D integration, task-mapping schemes
DLUX [167]	2020	6.3× speedup, 42× energy efficiency vs. Tesla V100 GPU	In-DRAM LUT, near-bank mapping, loop tiling, layout transposition
ZigZag [130]	2021	Up to 64% more energy-efficient vs. prior DSE frameworks	Uneven mapping, nested-for-loop DSE, mapping search engines
Robust PIM [166,168]	2021	Maintains accuracy under process variation (ResNet/MobileNet)	Hessian-driven mixed-precision, sensitivity-based quantization
PIM-DRAM [169]	2021	Up to 19.5× speedup vs. NVIDIA Titan Xp GPU	DRAM-based PIM primitive, intra-bank accumulation, data mapping
FAMS [170]	2021	29.7% lower latency, 42.4% higher throughput vs. TENET	Memory-centric mapping, cycle-accurate simulation, hardware constraints
IVQ [33]	2022	19.7–91.7× speedup, 17.7–541× energy vs. ISAAC/CASCADE/ASIC/ FPGA/GPU	Varied quantization, spatial mapping, temporal scheduling, pipeline
RaQu [171,172]	2022	29.2–37.4% resource utilization, 1.8–3.3% accuracy vs. prior quantization	AutoML-based quantization, hardware-aware mapping
PattPIM [173]	2022	Significant performance, energy, and resource efficiency	Weight pattern reuse, WPR-aware mapping, PE pipeline
PQ-PIM [174]	2022	1.74× performance, 62% energy saving vs. prior ReRAM	Patch-wise pruning-quantization, mixed OU-based engine
NicePIM [104]	2023	37% latency and 28% energy reduction vs. baseline	PIM-Tuner, PIM-Mapper, data-scheduler, deep kernel learning
Gibbon [175]	2023	9.8–48.2× speedup, 5.96× EDP reduction vs. prior work	Evolutionary search, multilevel joint simulator, co-exploration
BWQ [176]	2023	6.08× speedup, 17.47× energy saving vs. prior ReRAM	Block-wise mixed-precision, precision-aware mapping
Benchmarking IMC [100]	2023	30% minimum execution-time reduction vs. prior mapping	PE-level mapping, hybrid mapping, benchmarking framework
Static Scheduling [177]	2023	Significant latency reduction vs. prior ReRAM mapping	Static scheduling, weight-to-(output unit) OU mapping, latency model
Fast-OverlaPIM [106]	2024	4.6–18.1× faster mapping vs. prior overlap-based framework	Overlap-driven mapping, analytical overlap analysis, transformation mechanism
ReHarvest [77]	2024	3.5× speedup, 3.1× resource reduction vs. FORMS	ADC-crossbar decoupling, multi-tile mapping, bus-based multicast
M2M [178]	2024	7.18–61.09% latency reduction vs. prior multi-DNN mapping	Fine-grained mapping, temporal/spatial scheduling, QoS for NoP
PIMCOMP [105]	2024	Throughput, latency, energy improved vs. prior PIM	End-to-end compiler, multilevel optimization, weight-layout mapping
HuNT [109]	2024	10× energy, 8× compute efficiency vs. homogeneous PIM	Heterogeneous 3D PIM, neural layer mapping, tier configuration
Empirical Fault Framework [179]	2025	Fault impact characterization, model-specific vulnerabilities	Fault-injection, layer/location-aware mapping, reliability analysis

Table 4. Software landscape dashboard: Compilers and mapping tools for DNN-to-PIM/IMC deployment.

Framework/Tool Name	Year	Hardware Type	Supported Models/Workloads	Open Source?	Primary Validation
Fast-OverlaPIM [106]	2024	Generic PIM	CNNs (full networks)	No	Simulation
NicePIM [104]	2023	3D Stacked DRAM-PIM	CNNs (various), ResNet, VGG	No	Simulation
Optimized Weight Mapping [79]	2019	RRAM-PIM	CNNs (ResNet-34), ImageNet	No	Simulation
DDAM [180]	2022	Generic PIM	CNNs (various)	No	Simulation
Gibbon [175]	2023	Memristor PIM	DNNs (various), CIFAR-10	No	Simulation
HuNT [179]	2024	Hybrid (ReRAM, FeFET, PCM, MRAM, SRAM)	DNNs (training/inference)	No	Simulation
IVQ [33]	2022	Crossbar PIM	DNNs (varied quantization)	No	Simulation
PIMCOMP [105]	2024	Generic PIM	CNNs, DNNs (ResNet, VGG, etc.)	Yes (GitHub)	Simulation
MAESTRO [101]	2020	Simulator/Model	CNNs, DNNs (various)	Yes (GitHub)	Simulation
ZigZag [130]	2021	Simulator/ Framework	CNNs, DNNs (various)	Yes (GitHub)	Simulation
DNN + NeuroSim V2.0 [39]	2020	SRAM/ReRAM/ Analog PIM	VGG-8, CIFAR-10, DNNs	Yes (GitHub)	Simulation
PIM-DRAM [169]	2021	Commodity DRAM-PIM	AlexNet, VGG16, ResNet18	No	Simulation
FAMS [170]	2025	Systolic Array (SRAM/DRAM)	DNNs (various)	No	Simulation
PattPIM [173]	2022	ReRAM-PIM	DNNs (6 models)	No	Simulation
PQ-PIM [174]	2022	ReRAM-PIM	DNNs (various)	No	Simulation
Block-Wise Mixed-Precision Quantization (BWQ) [176]	2023	ReRAM-PIM	DNNs (various)	No	Simulation
SDP [99]	2023	SRAM-PIM (Digital)	Sparse NNs, DNNs	No	Simulation
ReHarvest [77]	2024	ReRAM-PIM	DNNs (various)	No	Simulation
Klotski v2 [113]	2025	Dataflow Accelerator	DNNs (various)	No	Simulation
Spiking DNN Mapping [101]	2015	Neuromorphic/ Spike-based	CNN → SNN, CIFAR-10, Neovision2	No	Simulation; hardware mapping

Table 5. Coverage map: Public benchmarks and datasets for DNN-PIM/IMC research.

Benchmark/ Framework	Device/Arch/ System Coverage	Supported Models	Dataset Type	Main Metric(s)	Major Limitation
DNN + NeuroSim V2.0 [39]	Device, Circuit, Architecture, System	VGG-8, flexible via PyTorch	CIFAR-10	Area, Energy, Throughput, Accuracy	Limited to SRAM/eNVM; focus on on-chip training
SIAM [188]	Device, Circuit, Architecture, System	ResNet-50, wide DNN support	CIFAR-10, CIFAR-100, ImageNet	Energy Efficiency, Throughput	Focused on chiplet-based IMC; calibration needed
Benchmarking DNN Mapping [100]	Architecture, PE-level	Convolutional DNNs	Public DNN benchmarks	Area-Efficiency, Scalability	Emphasis on mapping, not full system
NicePIM [104]	Architecture, System	DNNs (various, mapping-focused)	Not specified	Latency, Energy	3D-DRAM PIM focus; dataset coverage unclear
MNSIM 2.0 [26]	Device, Architecture, System	Large-scale NNs, Digital/Analog PIM	Case studies, fabricated macros	Accuracy, Performance, Modeling Error	Generalized, but not all real-world datasets
Gibbon [175]	Device, Architecture, System	Memristor-based DNNs	Not specified	Accuracy, Energy-Delay Product	Focus on co-exploration, not dataset diversity
Heterogeneous IMC Cluster [200]	System, Architecture	MobileNetV2, TinyML tasks	Real-world IoT tasks	Latency, Energy, Area	System-level, limited model diversity
Accelerating NN Inference with PIM [201]	Device, Architecture, System	UPMEM, Mensa, SIMDRAM (various NNs)	Matrix–vector kernels, Google edge models	Performance, Energy Efficiency	Focused on DRAM-based PIM, not all model types
PIMulator-NN [202]	Circuit, Architecture, System	Several PIM designs, templates	Not specified	Area, Latency, Energy	Lacks standardized dataset integration
PIMCaffe [203]	Device, Architecture, System	Recommendation, AlexNet, ResNet-50	Not specified	Speedup vs. CPU	Prototype neural processing unit (NPU), limited model/dataset coverage
RaQu [171]	Device, Architecture	CNNs (various, quantization focus)	Not specified	Resource Utilization, Accuracy	RRAM-specific, not broad dataset coverage
BWQ [176]	Device, Architecture	DNNs (block-wise quantization)	Not specified	Speedup, Energy Saving	ReRAM focus, limited to quantization studies

Table 6. Red-Flag Matrix: Pitfalls and Solutions in Benchmarking DNN-to-PIM/IMC.

Pitfall	Severity	Concrete Example	Real Impact	Solution/Best Practice	Literature Ref.
Shortcut learning and overfitting to benchmarks	High	DNNs perform well on standard benchmarks but fail in real-world deployment	Inflated performance claims, poor generalization, unreliable hardware evaluation	Use diverse, real-world datasets; test under adversarial/noisy conditions; robust cross-validation	[208,209,210]
Ignoring hardware non-idealities (noise, variation)	High	Benchmarking IMC/PIM without modeling device variation, quantization, or faults	Overestimation of accuracy and efficiency; missed reliability issues in deployment	Incorporate device-level noise, variation, and quantization effects in benchmarks and simulations	[39,78,168,179,206,211]
Apples-to-oranges comparisons	Medium	Comparing PIM/IMC results with different DNN models, batch sizes, or metrics	Misleading performance/efficiency claims; unfair hardware comparisons	Standardize benchmarks, report all parameters, use common datasets and metrics	[199,212,213,214]
Unrealistic or unrepresentative workloads	Medium	Using synthetic or toy datasets not reflective of target applications	Results do not translate to real-world performance; misguides design choices	Benchmark with application-relevant, large-scale, and diverse workloads	[9,104,199,212]
Poor reproducibility and lack of transparency	Low	Missing code, incomplete reporting of hardware/ software stack	Results cannot be verified or built upon; slows progress and trust in the field	Open-source code, detailed reporting of hardware/ software, use public benchmarks	[9,199,209,214]
Not testing at scale or under realistic conditions	Low	Evaluating on small models or idealized hardware, ignoring edge cases	Overlooks bottlenecks, scalability issues, or failure modes	Test at production scale, include stress tests, adversarial and worst-case scenarios	[188,208,211,212]

Table 7. Unified metric reference card [9,58,185,204,205].

Metric	Definition	Units	Notes
TOPS/W	Operations per second per watt	TOPS/W	Higher is better
pJ/op	Energy per operation (MAC)	pJ/op	Lower is better
Throughput	Operations/inferences per second	TOPS, GOPS, FPS	Essential for real-time tasks
Throughput per Area	Throughput normalized by silicon area	TOPS/mm²	Indicates computational density
Energy per Inference	Energy consumed per full inference	μJ/inference	End-to-end efficiency measure
Energy-Delay Product (EDP)	Energy–time trade-off metric	pJ·s, nJ·s	Critical for balancing performance
Latency	Time per inference/operation	ns, μs, ms	Lower is better
Accuracy	Model accuracy under constraints	%	Must accompany efficiency
Utilization	Hardware resource efficiency	%	Higher indicates efficient mapping

Table 8. Comparative benchmarking with hardware targets and mapping/scheduling strategies.

Paper (Cite)	Hardware Target	Mapping/Scheduling Strategy	Performance/Efficiency Gains
[220]	Multi-tenant DNN accelerators	Co-optimization of mapping and scheduling for multi-tenancy	Improved resource utilization and multi-tenant efficiency
[104]	3D-stacked DRAM PIM	PIM-Tuner, PIM-Mapper, data-scheduler (ILP-based)	37% latency, 28% energy reduction vs. baseline
[149]	Diverse PIMs	General abstraction-based mapping	Generalized mapping for diverse PIMs
[105]	Crossbar-based PIM DNN accelerators	Universal compilation framework	Universal mapping, improved portability
[221]	Memristive crossbars	RL-based dynamic sparsity-aware mapping on crossbars	Area reduction to 43% (small), 22.5% & 17.1% (large datasets)
[222]	PIM DNN accelerators	ISA-level mapping	Improved programmability
[129]	FPGAs (cloud)	Joint architecture, scheduling, mapping	Up to 9× EDP reduction vs. SOTA
[130]	DNN accelerators (general)	Joint architecture–mapping DSE, uneven mapping	Up to 64% more energy-efficient solutions
[223]	Memristive crossbar systems	Co-design of NN and hardware	Improved co-design for memristive systems
[224]	PIM with neural approximation	Approximate peripherals for PIM	Improved efficiency via approximation
[225]	Edge inference systems	Approximate computing strategies	Energy-efficient edge inference
[226]	Flexible DNN accelerator	Diverse data-format support	Flexibility for multiple data formats
[100]	In-memory computing accelerators	Benchmarking DNN mapping methods	Comparative benchmarking
[177]	Resource-constrained PIM	Static scheduling of weight programming	Improved scheduling for resource constraints
[109]	Heterogeneous PIM devices (3D manycore)	Exploiting heterogeneity for DNN training	Improved training efficiency
[227]	FPGA-based DNN accelerator	Modeling and exploration framework	Up to 4.2× performance, 4.4× DSP efficiency vs. SOTA
[228]	Spatial DNN accelerators	SW–HW co-design of interconnections	Improved interconnect efficiency
[229]	LSTM accelerator	Algorithm–hardware co-design, quantized tensor train	Co-optimized LSTM acceleration
[230]	In-memory DNN accelerators	Interconnect-aware area/energy optimization	Area and energy optimization
[231]	Mobile/edge devices (heterogeneous HW)	Co-scheduling framework	Latency and energy improvements

Table 9. Unified comparative benchmarking results from leading DNN-to-PIM accelerators.

Platform/Technology (Cite)	Workload	Energy Efficiency (TOPS/W)	Throughput per Area (TOPS/mm²)	Latency (ms/Inference)
Binary/Ternary Reconfigurable IMC (SRAM) [138]	Binary/Ternary DNNs (ImageNet, CIFAR-10)	2.33	0.35	0.71 (batch-1, ImageNet)
SRAM & eNVM-based CIM [39]	VGG-8 (CIFAR-10)	10–100 (SRAM), 20–200 (eNVM, simulated + silicon)	0.1–0.5	0.5–2 (batch-1)
UPMEM DRAM-PIM [9]	DNNs, Graph, Analytics (PrIM suite)	0.1–0.5	0.01–0.05	10–100 (varies by workload)
Dual-SRAM Charge-Domain IMC [232]	DNNs (CIFAR-10, VGG-8)	18.4–119.2	0.2–0.8	0.2–1.0
10T1C SRAM-based IMC [59]	VGG9, ResNet-18 (CIFAR-10)	437	1.2	0.15 (batch-1)
Hybrid Analog IMC + Digital [233]	CIFAR-10, ImageNet	600 (AIMC), 14 (Digital)	0.5	0.12 (CIFAR-10)
Capacitor-based Analog IMC [234]	11-layer CNN (CIFAR-10), ResNet-50 (ImageNet)	30–51.5	0.3	0.13 (CIFAR-10)
Analog IMC (various) [235]	ConvNets, RNNs, Transformers (CIFAR-10, ImageNet, NLP)	10–100 (varies)	0.1–0.5	0.2–1.0
5 nm Digital Accelerator [236]	BERT-Base, ResNet-50	38.6–95.6	1.5	0.08 (BERT, batch-1)
16 nm Multi-Chip-Module [237]	ResNet-50 (ImageNet)	9.5	1.29	0.52 (batch-1)
RRAM-based Analog IMC + RISC-V [200]	MobileNetV2 (CIFAR-10)	9.5	0.2	0.11
ReRAM-based PIM [238]	RNNs (LSTM, GRU)	10–80	0.1–0.3	0.5–2.0
FPGA-based [239]	LeNet, AlexNet, VGG-S	0.5–1.2	0.05–0.1	0.8–2.0
Digital SRAM-PIM [240]	GEMM, ResNet, VGG	5–8.9	0.2	0.3–1.0
3D Stacked DRAM-PIM [104]	ResNet-18, VGG-16 (ImageNet)	2–10	0.1	0.6–1.5
Hybrid Memory Cube (HMC) PIM [115]	DCNNs (ImageNet)	1.5–3.0	0.05	1.0–2.0
LUT-based DRAM-PIM [7]	AlexNet, VGG	2.5–12	0.1	0.5–1.5
SRAM/ReRAM IMC [241]	VGG-19, ResNet	3–6	0.1	0.4–1.2
ReRAM-based PIM [176]	ResNet-18, VGG-16	6–17.5	0.2	0.3–1.0
IMC (SRAM, RRAM) [100]	ResNet, VGG (CIFAR-10, ImageNet)	10–30	0.1–0.3	0.2–1.0

Note: Cross-paper values are reported as published and should be interpreted in the context of workload, precision, batch size, technology node, and evaluation setting.

Table 10. Task–application matrix: ML and GNN for DNN–PIM optimization.

ML/GNN Task	Typical Input	Prediction Target	Example Use Case	Main Benefit	Notable Limitation	Ref.
Hardware-Aware DNN Mapping	DNN architecture, hardware specs	Optimal mapping/ configuration	Mapping DNN layers to PIM nodes for latency/energy optimization	Significant speedup and energy savings	Balancing hardware and model constraints is complex	[7,104,169,184]
Graph-Based Resource Scheduling	Device topology, computation graph	Scheduling strategy	Assigning DNN tasks to PIM cores in distributed/ heterogeneous systems	Better utilization, reduced communication	Scalability and comms overhead	[175,210,232,239]
Model Compression & Quantization	DNN weights, hardware constraints	Compressed/ quantized parameters	Deploying DNNs on RRAM/ DRAM PIM with limited resources	Lower footprint, faster inference	Potential accuracy loss, HW mismatch	[7,171,184,247]
Data/Model Pruning for GNNs	Graph data, GNN model	Pruned graph/model	Accelerating GNN training on ReRAM-based PIM	Faster, energy-efficient training	Risk of over-pruning, info loss	[248,249]
Automated Architecture Optimization	DNN/graph structure, performance data	Optimized architecture/ config	AutoML-driven design for DNN-PIM deployment	Automation, high-quality solutions	Search space explosion, compute cost	[104,171,250,251]
Robust Optimization via Deep Learning	Historical data, uncertainty models	Robust solution/ uncertainty set	Reliable DNN-PIM deployment under variable conditions	Improved robustness, generalization	NP-hardness, integration complexity	[252,253]
Distributed GNN Training	Large-scale graph data, cluster info	Training workflow/strategy	Scaling GNN training across PIM-enabled clusters	Scalability, efficient distributed use	Communication bottlenecks	[254,255,256]

Table 11. Evidence matrix for DNN-to-PIM mapping research landscape.

Focus Area	Key Statement	Strength of Support	Brief Justification	Literature Reference
Hardware & Devices	DRAM and SRAM are most mature and widely integrated IMC/PIM tech	Strong	Commercial deployment, extensive research, robust performance	[4,51,59,60,201]
	RRAM and PCM enable high energy efficiency/density for AI/neuromorphic	Strong	Multiple experimental/prototype systems; some integration/endurance challenges	[42,47,48,50,61,62]
	Analog IMC faces precision/variation limitations	Moderate	Device non-idealities, process variation impact accuracy	[42,82]
	Integration/process compatibility are major barriers for emerging IMC/PIM	Moderate	Key challenge in reviews and experimental reports	[47,48,49,50,61,62,63,65,67,68,71]
	Photonic/3D memristor IMC promise ultra-high density/efficiency	Moderate	Early-stage research, mostly prototypes; integration hurdles	[49,61,71]
	FeFET, STT-MRAM offer high endurance and low power for emerging PIM	Moderate	Promising in recent studies, but limited large-scale deployment	[63,64,65,66,67,68]
Mapping & Algorithms	All methods face trade-offs between efficiency, adaptability, complexity	Moderate	Context-dependent; highlighted in comparative analyses	[42,51,54,55,56,57,58,129,130,131]
	Graph-based/hybrid methods excel in irregular DNNs, multi-objective	Moderate	Effective for DAGs and heterogeneous systems, but complex to implement	[60,63,74,102,103,127,128]
	Rule-based methods predictable but inflexible	Moderate	Low-overhead, not adaptive	[100,102,114]
	Metaheuristics outperform heuristics in large, complex search spaces	Moderate	Outperform heuristics in complex/heterogeneous settings	[104,112,119,120]
	Heuristic methods are fast and scalable for moderate complexity	Strong	Supported by many studies and benchmarks	[103,115,116,117,118]
	Learning-based methods adapt to dynamic, heterogeneous environments	Strong	Demonstrated in dynamic scheduling, partitioning	[121,122,123,124,125,126]
Software/ Tools	PIMCOMP enables fully automated, modular DNN deployment on PIM/IMC	Strong	Demonstrated end-to-end automation, modular passes, broad hardware support	[105]
	Open-source availability accelerates research and adoption	Moderate	Open tools cited in multiple studies/benchmarks	[105,192]
	Benchmarking and standardization are ongoing challenges	Moderate	Few tools provide unified benchmarks across hardware/applications	[105,192,193,194,195]
	DNN + NeuroSim provides hierarchical evaluation and device modeling	Strong	Widely used for device-level simulation and partial automation	[192]
	Heterogeneous hardware support remains limited	Moderate	Most tools focus on specific memory types/platforms	[193,194,195]
	Proprietary tools limit reproducibility and community improvement	Moderate	Closed/partially accessible frameworks hinder impact	[193,195]
Benchmarking & Datasets	Lack of standardized benchmarks/diverse hardware complicates comparison	Weak	Heterogeneous setups limit cross-framework comparability	[5,39,104,175,180,184,185,224,257]
	Leading frameworks provide multi-level hardware coverage	Strong	Multiple frameworks benchmark device to system level	[26,39,175,188]
	Some frameworks enable flexible, extensible benchmarking	Moderate	Subset support flexible model/hardware integration	[26,39,175,188]
	Lack of end-to-end system-level evaluation with real-world datasets	Moderate	Few frameworks benchmark full systems with diverse, real-world tasks	[26,175,200,202,203]
	Open access/standardized benchmarks are inconsistently available	Moderate	Few are public, most lack open access/dataset integration	[39,175,200,202,203]
	Most frameworks are tailored to specific hardware or tasks	Moderate	Limits generalizability and cross-platform comparison	[100,104,171,176,202]
	Dataset/model diversity is limited in most frameworks	Strong	Most focus on specific models, lack real-world dataset integration	[100,104,171,175,176,200,202,203]
Pitfalls & Methodological Gaps	Ignoring hardware non-idealities leads to unreliable results	Strong	Device noise/variation significantly affect real-world performance	[39,78,168,179,206,211]
	Apples-to-oranges comparisons mislead hardware evaluation	Strong	Lack of standardization skews results	[100,199,212,213]
	Shortcut learning/overfitting to benchmarks are common and severe	Strong	DNNs exploit spurious correlations in benchmarks	[208,209,210]
Evaluation & Reproducibility	Unrealistic workloads undermine generalizability	Moderate	Synthetic/toy datasets do not reflect real application performance	[9,104,199,212]
	Poor reproducibility slows progress	Moderate	Missing code/incomplete reporting prevent verification/reuse	[9,199,209,214]
	Not testing at scale or under stress misses bottlenecks	Moderate	Small-scale tests overlook scalability/failure modes	[188,208,211,212]
Learning- Driven DNN-to-PIM	ML/GNN-based mapping significantly improves DNN-PIM performance	Strong	Multiple studies show large speedup and energy savings	[7,104,169,184]
	Model compression/ quantization enables efficient PIM deployment	Strong	Consistent improvements in memory use, inference speed	[7,171,184,247]
	Automated optimization (AutoML) finds high-quality DNN-PIM configs	Moderate	Good results, but search space and cost are challenges	[104,171,250,251]
	Distributed GNN training on PIM is scalable but faces bottlenecks	Moderate	Scalability shown, but communication/workflow divergence issues	[254,255,256]
	Robust optimization via deep learning improves reliability	Moderate	Some evidence for better generalization; integration is complex	[252,253]
	Data/model pruning accelerates GNN training on PIM	Moderate	Speed/energy gains, but risk of over-pruning	[248,249]
Frameworks & Benchmarking Trends	Lack of standardized benchmarks/diverse hardware complicates comparison	Weak	Heterogeneous setups limit cross-framework comparability	[5,39,104,175,180,184,185,224,257]
	Modern DNN-to-PIM frameworks achieve 2–48× speedup, 28–131× energy efficiency	Strong	Consistent, large improvements in throughput/energy over CPU/GPU	[5,104,175,180,184,257]
	Real-system benchmarks (e.g., PyGim) are essential for validating frameworks	Moderate	Provide real hardware results, highlight deployment issues	[5,184]
	Most frameworks focus on inference; training/GNN support underexplored	Moderate	Few studies address training/support for new model types	[39,81,257]
	Integer/dynamic programming effective for mapping/partitioning	Moderate	Improved data locality, reduced comm. overhead, but high complexity	[104,180]
	Co-design approaches yield better optimization than hardware/software only	Strong	Jointly optimizing NN and PIM architectures gives superior results	[104,175,257]
Performance Metrics	Energy efficiency (TOPS/W), throughput (TOPS) are primary metrics	Strong	Consistently reported; definitions standardized	[26,54,59,79,185,234,267,268]
	Standardized workloads (CIFAR-10, ImageNet) enable fair comparison	Strong	Most comparative studies use these datasets	[54,59,79,267,268]
	Some platforms report inflated efficiency via aggressive quantization	Moderate	Accuracy must be reported alongside efficiency	[54,59,268]
	Secondary metrics (EDP, area, utilization) less standardized/varied	Moderate	Some report these, but usage is inconsistent	[54,79,185]
	Variability in reporting granularity hinders direct comparison	Moderate	Some report only partial results, complicating meta-analysis	[54,79,185]
	Reporting mean ± std. over ≥5 runs is now common	Strong	Multiple papers report statistical results for reliability	[59,267,268]

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Marium, S.M.; Chen, S. Bridging Architectures, Mapping, and Learning for DNN Acceleration with Processing-in-Memory and In-Memory Computing Systems. Microelectronics 2026, 2, 10. https://doi.org/10.3390/microelectronics2020010

AMA Style

Marium SM, Chen S. Bridging Architectures, Mapping, and Learning for DNN Acceleration with Processing-in-Memory and In-Memory Computing Systems. Microelectronics. 2026; 2(2):10. https://doi.org/10.3390/microelectronics2020010

Chicago/Turabian Style

Marium, Syeda Munazza, and Song Chen. 2026. "Bridging Architectures, Mapping, and Learning for DNN Acceleration with Processing-in-Memory and In-Memory Computing Systems" Microelectronics 2, no. 2: 10. https://doi.org/10.3390/microelectronics2020010

APA Style

Marium, S. M., & Chen, S. (2026). Bridging Architectures, Mapping, and Learning for DNN Acceleration with Processing-in-Memory and In-Memory Computing Systems. Microelectronics, 2(2), 10. https://doi.org/10.3390/microelectronics2020010

Article Menu

Bridging Architectures, Mapping, and Learning for DNN Acceleration with Processing-in-Memory and In-Memory Computing Systems

Abstract

1. Introduction

1.1. Tutorial Snapshot: PIM vs. IMC

1.2. Scope and Taxonomy of This Survey

1.3. Where Practice Stands: Opportunities and Frictions

1.4. Survey Positioning and Roadmap

2. Landscape of IMC/PIM Architectures (Hybrid)

2.1. Paradigm Shift: From Data Movement to Memory-Centric Compute

2.2. Digital PIM in DRAM/SRAM: Operations, Gains, and Exemplars

2.3. Device Taxonomy and Comparative Trade-Offs (Volatile vs. Non-Volatile; Analog vs. Digital)

2.4. Peripheral Overheads and System Bottlenecks

2.5. Device Non-Idealities and Cross-Layer Mitigation

2.6. Integration Challenges, Selectors, and 3D Stacking

2.7. Photonic IMC: Modulators, MAC Mechanics, Precision, and Energy

2.8. Practical Rule-of-Thumb Bridge to Mapping (Device-Aware Constraints)

2.9. Application Scope and Commercialization Pathways

3. Taxonomy and Pipeline for Mapping, Partitioning, and Scheduling on PIM/IMC (Hybrid, Combined)

3.1. Conventional → Advanced Strategy Spectrum

3.2. Device-Aware and Cross-Layer Constraints (First-Class)

3.3. Pipeline Primitives and Design Knobs

3.4. Strategy-to-Flow Schematics (Fit and Roles)

3.5. Compiler Integration and Portability

3.6. Cross-Layer Co-Optimization and Feedback

3.7. Quantitative Evidence and Case Studies

3.8. Contributors Snapshot

3.9. Limits and Gaps

3.10. Takeaway

4. Chronological Evolution and Comparative Analysis of DNN-to-PIM Mapping Frameworks (2019–2025)

4.1. Early Years (2019–2020): Foundational Simulators and Rule-Based Flows

4.2. Maturity Phase (2021–2022): Hardware-Aware Optimization and ILP-Based Scheduling

4.3. Recent Innovations (2023–2025): Graph-Based, RL/DRL, and Hybrid Frameworks

4.4. Comparative Trade-Offs and Open Gaps

5. Software-Centric Approaches: Compilers and End-to-End Mapping Tools

5.1. Bridging Algorithm–Hardware Abstraction Gaps

5.2. Mapping Pipelines and Dataflow Strategies

5.3. Dynamic Optimizations and Latency Reduction

5.4. Framework Diversity: Open-Source vs. Proprietary

5.5. Unresolved Gaps and Future Directions

6. Benchmarking and Dataset Resources

6.1. Scope and Rationale

6.2. Taxonomy of Benchmarking Practices and Datasets

6.3. Public Ecosystem and Coverage Map

6.4. Standard Metrics and Normalization

6.5. Methodological Pitfalls (“Red-Flag Matrix”) and Best Practices

6.6. Chronology (2018–2025): From Ad Hoc Suites to Graph-Centric and Hardware-in-the-Loop

6.7. Integration with Software Stacks and Transition

7. Comprehensive Reference Card and Comparative Benchmarking of DNN-to-PIM Accelerators

7.1. Energy Efficiency vs. Latency: Trading Power for Real-Time Operation

7.2. Workload Compatibility and Generalization: Versatility vs. Specialization

7.3. Area Efficiency and Scaling Potential: Density Beyond TOPS/W

7.4. Latency–Accuracy–Energy Trifecta: Navigating the Pareto Frontier

7.5. Toward Unified Evaluation: Open Source, Real-World Workloads, and Toolchain Co-Design

7.6. Key Takeaways from Comparative Benchmarking

7.7. Trends, Variability, and Observations

7.8. Reference-Card Synthesis, Claims, and Open Questions

8. Graph-Based and Learning-Driven Approaches: Emerging Directions

8.1. Motivation and Summary

8.2. GNNs for Hardware-Aware Modeling

8.3. Beyond GNNs: RL, BO, Hybrids, and Predictors

8.4. Empirical Benchmarks and Scaling Behavior

8.5. Algorithmic Choices by Objective

8.6. Challenges and Limitations

8.7. End-to-End Frameworks and Integrated Pipelines

8.8. Outlook

9. Open Challenges and Future Directions

9.1. Standardization Needs

9.2. Hardware Heterogeneity

9.3. Integration Barriers

9.4. Community Collaboration

9.5. Actionable Roadmap: Future Directions and Vision

10. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics