A Binary Convolution Accelerator Based on Compute-in-Memory

Cui, Wenpeng; Zheng, Zhe; Li, Pan; Li, Ming; Liu, Yu; Chi, Yingying

doi:10.3390/electronics15010117

Open AccessArticle

A Binary Convolution Accelerator Based on Compute-in-Memory

by

Wenpeng Cui

,

Zhe Zheng

,

Pan Li

^*,

Ming Li

,

Yu Liu

and

Yingying Chi

Beijing Smart-Chip Microelectronics Technology Company Ltd., Beijing 100192, China

^*

Author to whom correspondence should be addressed.

Electronics 2026, 15(1), 117; https://doi.org/10.3390/electronics15010117

Submission received: 30 October 2025 / Revised: 4 December 2025 / Accepted: 23 December 2025 / Published: 25 December 2025

(This article belongs to the Special Issue Wireless Multimodal Communications for Integrated Heterogeneous Networks)

Download

Browse Figures

Versions Notes

Abstract

As AI workloads move to edge devices, the von Neumann architecture is hindered by memory- and power-wall limitations We present an SRAM-based compute-in-memory binary convolution accelerator that stores and transports only 1-bit weights and activations, maps MACs to bitwise XNOR–popcount, and fuses BatchNorm, HardTanh, and binarization into a single affine-and-threshold uni. Residual paths are handled by in-accumulator summation to minimize data movement. FPGA validation shows 87.6% CIFAR 10 accuracy consistent with a bit-accurate software reference, a compute-only latency of 2.93 ms per 32 × 32 image at 50 MHz, sustained at only 1.52 W. These results demonstrate an efficient and practical path to deploying edge models under tight power and memory budgets.

Keywords:

compute-in-memory; SRAM; edge deploying; FPGA validation

Graphical Abstract

1. Introduction

1.1. Background

The rapid advancement of artificial intelligence has fueled the proliferation of intelligent devices, particularly at the edge side. These devices now appear not only in the consumer market but also across a wide range of industries, where they play increasingly important roles. Although the fast progress of deep learning provides strong algorithmic support, resource-constrained edge devices often need to be placed close to data sources to fully exploit edge intelligence, which in turn imposes stringent power constraints on on-device computation.

For decades, guided by Moore’s law, reductions in transistor size and architectural innovations have delivered substantial improvements in computing performance [1]. In recent years, however, data volumes have grown explosively [2], making the efficient storage, movement, and processing of massive datasets a central challenge in information technology. Against this backdrop, the limitations of the von Neumann architecture have become increasingly pronounced [3]. On the one hand, memory access speed lags far behind processor execution speed, so system performance is limited by data-transfer bandwidth; as a result, realized compute throughput falls well below the theoretical peak, failing to meet the fast and accurate response demanded by big data applications, the so-called memory wall [4]. On the other hand, the separation of storage and computation in the von Neumann paradigm necessitates frequent data transfers between memory and processors (Figure 1), incurring substantial energy overhead, the so-called power wall [5].

Compute-in-memory (CIM) [6] has been proposed to address these von Neumann bottlenecks. The core idea is to integrate computation with memory access by performing data processing directly within the memory itself, thereby eliminating data movement at its source and unifying storage and computation. CIM thus holds promise for overcoming both the memory wall and the power wall and has emerged as a leading solution for edge-intelligent computing [7].

While compute-in-memory (CIM) has exhibited strong potential for energy-efficient inference across mainstream neural network architectures, its advantages are increasingly stressed by the trend toward deploying ever-larger models at the edge. As model size and precision scale up, the storage footprint and data-movement demand grow rapidly, and high-bitwidth formats (e.g., FP32/FP16 or even INT8 in some settings) place untenable pressure on scarce on-chip memory resources [8]. Moreover, high-precision arithmetic inflates circuit complexity and power, particularly for multipliers and wide datapaths. These constraints motivate a shift toward highly optimized binary neural networks for edge deployment [9]. By binarizing both weights and activations to a single bit, the memory footprint is reduced by up to orders of magnitude and the bandwidth requirement is minimized [10]. Simultaneously, multiply-accumulate operations are mapped to bitwise XNOR followed by population count performed within or near the memory array, eliminating multipliers and significantly lowering switching activity, latency, and power [11,12]. Building on this rationale, we propose a CIM-based binary convolution accelerator tailored for edge devices, which leverages tightly integrated in-memory bitwise computation to deliver high energy efficiency under stringent on-chip resource budgets.

1.2. Computing-in-Memory

In terms of implementation, CIM broadly follows two paths: digital CIM and analog CIM [13,14]. Digital CIM integrates storage and computation using digital circuits (commonly SRAM-based arrays), offering high numerical accuracy, strong noise immunity, and considerable architectural flexibility. Its energy efficiency, however, is partially constrained by digital switching activity and peripheral logic. By contrast, analog CIM leverages physical laws to perform multiply-accumulate (MAC) operations directly within the array, yielding intrinsically high energy efficiency and massive parallelism.

Analog CIM also faces notable challenges [15,16], especially when operating on binarized data. The limited signal dynamic range and small differential margins used to represent binary states make computations more susceptible to device mismatch, noise, and parasitic effects, which can accumulate into noticeable errors. In practice, the low signal-to-noise ratio and the calibration/compensation needed to stabilize accuracy diminish the nominal energy-efficiency gains. By comparison, digital CIM maps naturally to binary operations and exhibits strong robustness to process/voltage/temperature variations, making it a pragmatic choice for reliable edge deployments under tight resource and power budgets [17].

Among digital CIM options, SRAM-based techniques have become predominant owing to their process maturity [18], compatibility with standard CMOS flows, and robustness. Relative to conventional memory hierarchies, SRAM-based digital CIM processes data in situ with markedly lower latency, which is advantageous for workloads requiring frequent, low-latency memory access—such as image processing and machine learning inference. A typical SRAM digital CIM architecture comprises three components: (i) storage arrays that provide fast, low-latency read/write operations; (ii) in/near-memory compute elements that execute operations directly on wordlines/bitlines to avoid round-trip data movement; and (iii) interconnect and control circuitry that coordinates wordline activation, bitline sensing, and dataflow to sustain high throughput and parallelism.

By tightly coupling data storage with computation, SRAM-based digital CIM demonstrates clear benefits for data-intensive applications. In deep learning, where large matrix/vector operations dominate, it can accelerate inference (and, in some cases, parts of training) while reducing reliance on general-purpose CPUs/GPUs [19]. Owing to its low latency and strong energy efficiency, SRAM digital CIM is also well-suited to caches and embedded systems that must meet stringent performance-per-watt targets. Recent academic and industrial advances continue to extend SRAM-based CIM, underscoring its growing role in compute-intensive edge workloads [7].

2. Purely Binary Convolution CIM Design

2.1. Overview

Building on the above advantages, this work develops a high-efficiency binary convolution accelerator that targets convolution layers in binary neural networks (BNNs). Through a set of architectural and dataflow optimizations, the accelerator efficiently handles diverse convolutional structures, including single-layer, multi-layer, and residual convolutions. A key design principle is that only binary data are ever stored in the memory arrays and on-chip buffers. Any higher-bitwidth intermediates produced within the datapath are immediately reduced to 1 bit via lightweight combinational logic before being written back. This policy tightly matches storage bandwidth to computational bandwidth, markedly improving utilization and throughput.

In BNN inference, convolution outputs are typically followed by BatchNorm, Hardtanh, and binarization. We fuse these three inference steps into a single affine-and-threshold stage, thereby eliminating redundant operations and memory traffic and further improving efficiency. For residual connections, the design leverages dataflow control to perform the summation of two convolution results directly inside the accumulator, avoiding additional adders and scratchpad storage and reducing cycle count. To enhance locality, we adopt output reuse. Convolution results are forwarded directly as inputs to the subsequent layer, minimizing intermediate buffering and write-back overhead and accelerating layer-to-layer handoff.

From a hardware-software co-design perspective, layer parameters are aligned to powers of two to maximize bandwidth utilization and simplify control (e.g., aligning channel counts, tile sizes, and strides to array widths), which further boosts effective throughput.

In summary, the proposed binary convolution accelerator not only circumvents the inherent limitations of the von Neumann paradigm through compute-in-memory integration but also offers a practical remedy to the storage and compute costs associated with high-bitwidth formats under edge deployment of large models. By combining CIM with aggressively binarized networks, the design delivers notable gains in storage efficiency, energy efficiency, and computational performance, providing robust hardware support for intelligent terminals and edge AI scenarios, and offering a viable direction for post-Moore computing architectures. Our fully digital SRAM-based architecture makes a strategic trade-off, prioritizing computational robustness and scalability over the higher physical density of analog in-memory computing (AIMC). This is critical, as AIMC’s density advantage can be undermined by practical implementation overheads; for instance, complex data mapping schemes required for multi-bit operations can compromise effective cell utilization [20]. More fundamentally, our digital paradigm avoids the device non-linearity and temporal drift challenges endemic to analog approaches—issues that persist even with sophisticated hardware compensation schemes in PCM-based systems [21]. By mapping multiply-accumulate operations to deterministic, bitwise logic, we eliminate the need for area- and power-intensive ADCs, which are inherent bottlenecks to both precision and array-level scalability, thus providing a more tractable pathway for high-fidelity edge inference.

2.2. Computing Framework

As illustrated in Figure 2, the proposed accelerator adopts a fully digital compute-in-memory organization tailored to binary convolution, and is composed of seven tightly coordinated modules. The overall pipeline is designed so that only 1-bit data are stored and transported across the memory hierarchy and compute fabric. Any higher-bitwidth intermediates generated within the datapath are immediately reduced to a single bit before being written back. This policy minimizes bandwidth demands, reduces switching activity, and improves the match between storage bandwidth and computational throughput. Functionally, the front end accepts instructions and parameters from the host, the memory subsystem stages inputs, weights, and per-channel post-processing coefficients, and the compute subsystem executes bitwise convolutions followed by a fused post-processing stage. The framework natively supports single-layer, multi-layer, and residual convolutional structures, and is optimized for chaining layers without round-trips to off-chip memory. The proposed framework is inherently scalable to support deeper BNN models and more complex datasets. Architectural extension is straightforward: expanding on-chip SRAM capacity for larger models and increasing the number of parallel ComputeCore instances to maintain throughput. This modularity ensures the control logic remains simple. While classification accuracy is ultimately determined by the BNN model’s representational capacity, our accelerator is designed to provide the robust hardware foundation necessary to execute these larger and more complex network topologies.

The front end comprises an AMBA AHB–Lite compatible slave, an Instruction Register, and a centralized controller (Ctrl). The AHB slave exposes memory mapped regions for input/output buffers, the weight store, and the parameter store, along with control and status registers; upon layer completion it asserts an interrupt to the host. The Instruction Register holds the configuration for the current layer (e.g., kernel geometry, tiling, buffer selection, residual mode, and write back policy). The Ctrl unit decodes these fields into a micro schedule that sequences buffer toggling, input and weight fetches, compute core activation, residual accumulation, and post-processing. To sustain high utilization, input feature maps are double buffered across two on-chip SRAM banks (SRAMa and SRAMb) operating in a ping-pong manner: while one bank streams operands to the compute cores, the other is refilled with the next spatial tile, thereby overlapping I/O and computation. Detailed roles and interactions of the individual modules are elaborated in Section 2.3.

2.3. Functions of the Accelerator Modules

The InBuf module serves as a near array staging buffer that caches the inputs to be processed and selects the data source (SRAMa or SRAMb) via write enable and select controls. To support residual connections robustly, InBuf includes an auxiliary register file that temporarily latches the first read of the residual branch input from SRAMa. On the second pass, the residual input is fetched directly from this temporary store rather than from SRAM, preventing the original feature map from being overwritten by the first pass’s outputs. This mechanism stabilizes the data supply from SRAM and enables two computation modes: (i) sequential convolution, and (ii) residual convolution, where two convolutions are computed in parallel from inputs CONV1 (in SRAMa) and CONV2 (in SRAMb), their results are summed on the fly in the accumulator, and the combined sum is then forwarded to the SIMD post-processing for fused BatchNorm + HardTanh + binarization, as illustrated in Figure 3.

SRAMa and SRAMb are a matched pair of input/output buffers that alternate roles per tile or layer, thereby realizing classical ping-pong operation. In a typical dataflow, each layer entails three stages—read, compute, and write. With a single buffer these stages serialize, forcing the compute array to idle while new data are fetched or written back. The ping-pong organization removes this bottleneck: while the compute fabric consumes data from one bank, the other bank is simultaneously filled with the next tile. Upon completion, roles swap without bubbles. This overlap pipelines input, compute, and output, eliminates stalls in the dataflow, and significantly reduces end to end latency and data movement overhead.

The ComputeCore Group constitutes the main compute fabric and integrates the weight store and the binary convolution datapath. It encapsulates 64 parameterized ComputeCore instances operating in parallel. Each core is bound to a single convolution kernel (i.e., one output channel) for the duration of a tile and executes weight storage, bitwise multiplication, and accumulation. WeightSRAM provides the kernel parameters with concurrent read/write capability. A small WeightBuffer in each core refreshes its local registers when enabled and streams weight bits to the binary multiplier, implemented as a bitwise XOR (equivalently XNOR under complementary encoding). The partial products feed a balanced AdderTree to realize the population count, and results accumulate in an Accumulator that supports four instruction-controlled modes: clear (00), load (01), accumulate (10), and hold (11). All streams are aligned at 64-bit granularity to maximize effective bandwidth. A representative flow proceeds as follows (Figure 4): a 32 × 32 × 64 bit input feature map is loaded from SRAMa and preprocessed by a 64-bit input buffer; it is convolved with 64 × 3 × 3 × 64 binarized weights stored in a 64 × 64 × 38 CIM SRAM; the 64 accumulators (ACC) aggregate the results; finally, the BatchNorm module normalizes and binarizes the outputs, which are emitted as a 32 × 32 × 64 bit feature map into SRAMb. This design keeps all in-flight operands strictly binary, thereby avoiding high bitwidth write backs and improving energy proportionality.

The SIMD module implements the fused BatchNorm, HardTanh, and binarization, and is provisioned to store the per channel parameters for multiple BN layers while processing many channels concurrently. In the standard form [22], batch normalization performs the following calculations during the inference process:

y = γ \times (x - μ) / σ + β,

(1)

where γ, β, μ, and σ are trained floating point constants. Directly realizing this expression in hardware would require substantial parameter storage and floating-point units. Instead, we algebraically fold BatchNorm, HardTanh, and the subsequent binarization into a simple affine and threshold form a(x − b). For BNNs, only the sign of a and the threshold b are needed at inference: a is a 1-bit sign per output channel, and b is a per channel threshold constrained to 12 bits, matching the dynamic range of the accumulator input x. In rare cases where the derived b exceeds 12 bits (typically when γ is very small, indicating a low importance channel), we clip b without measurable impact on inference accuracy; empirical validation confirms that representing b as an integer preserves performance. This 12-bit quantization range for the threshold b is empirically sufficient for over 99% of all feature channels across the network during inference, making clipping a statistically rare event. This phenomenon is almost exclusively confined to channels where the Batch Normalization scaling factor (γ) is exceptionally small, indicating that these channels have minimal saliency and contribute negligibly to the network’s final output. Consequently, the information loss from clipping is insignificant and does not compromise the model’s overall accuracy or its robustness under domain shift. The network’s resilience to perturbations like noise or quantization mismatch is dominated by the numerous, more salient channels that operate well within the unclipped quantization range (Algorithm 1). Operationally, the SIMD module stores two column vectors per layer, including an a vector (1 bit per channel) and a b vector (12 bits per channel, length equal to the layer’s channel count). For each channel, SIMD compares x with b and applies the sign from a to emit the final 1-bit activation, thus collapsing normalization, activation, and binarization into a single, low latency step.

The remaining system modules coordinate instruction delivery, control, and host interaction. The Ctrl module parses the instruction words and issues the requisite enables, address sequences, and mode signals to orchestrate the accelerator through load, compute, residual accumulation, post-processing, and writeback. The AHB slave interfaces with the external AHB bus to perform data marshaling, address mapping for the various on-chip memories, and interrupt signaling (set/clear) upon events such as tile or layer completion.

Algorithm 1 SIMD Computing

RESET_HW_STATE()
(OPCODE, K_SIZE, IN_HW_LOG, IN_C_LOG, OUT_C_LOG, S1, S2, W_BASE, BN_BASE) = DECODE(iInstruction)
OUT_H = (1 << IN_HW_LOG) / S1
OUT_W = (1 << IN_HW_LOG) / S1

for out_pixel_idx = 0 to (OUT_H * OUT_W) - 1 do
for out_ch_group_idx = 0 to ( (1 << OUT_C_LOG) / 64 ) - 1 do
ACCUMULATORS = ZERO_ACCUMULATORS()

ACCUMULATORS = ACCUMULATORS + CALCULATE_CONV_SUM(
out_pixel_idx, out_ch_group_idx, K_SIZE, (1 << IN_C_LOG), W_BASE, S1, S2)

if OPCODE == 2’b01 then
ACCUMULATORS = ACCUMULATORS + CALCULATE_RES_SUM(out_pixel_idx, out_ch_group_idx, W_BASE, (1<<IN_C_LOG), (1<< OUT_C_LOG), S2)
end if

BN_PARAMS = FETCH_BN_PARAMS(BN_BASE, out_ch_group_idx)
BINARIZED_OUTPUT = SIMD_BINARIZE(ACCUMULATORS, BN_PARAMS)

STORE_OUTPUT_FEATURES(out_pixel_idx, out_ch_group_idx, BINARIZED_OUTPUT)
end for
end for

SWAP_IO_BUFFERS()
SIGNAL_CALCULATION_COMPLETE()
RETURN PROCESSED_DATA_REFERENCE()

To streamline programming, an Instruction Register module implements a lightweight instruction set tailored to the accelerator, covering storage and transfer operations (Figure 5). It reads instruction addresses and payloads from the AHB slave and dispatches them to Ctrl in program order, observing a simple handshake—for example, it writes iWriteAddr/iWriteData into the instruction memory when iReady indicates the accelerator can accept a new command; it monitors iComputeDone to detect completion; and it raises oDone while enabling the controller via oCtrlnCe to advance execution. This organization decouples host traffic from real-time control and ensures predictable, low overhead sequencing of layer execution.

Together, these modules realize a strictly binary, compute-in-memory convolution engine that supports conventional and residual topologies with high utilization. By confining storage and transport to 1-bit representations and reducing post-processing to per channel compares, the design minimizes bandwidth and switching energy while maintaining a streamlined programming model and robust system integration.

3. FPGA-Based Accelerator Validation

3.1. Simulation of the Purely Binary Accelerator

In the proposed accelerator, both inputs and weights are represented in binary. The software and hardware differ only in representation, not in arithmetic precision: software adopts the {+1, −1} convention, whereas hardware uses {1, 0}. Because these encodings are isomorphic, we can construct a bit-accurate software reference and perform layer-wise equivalence checks. Concretely, we implement a network whose topology and binarization policy match the hardware design, export each layer’s parameters and input tensors, and feed the same inputs to the binary convolution accelerator. By comparing the per-layer outputs from software and hardware, we validate both the correctness of the accelerator’s compute kernels and the integrity of its execution schedule. This co-simulation framework also allows us to exercise complete inference flows in a controllable environment and to stress boundary conditions that may arise in deployment.

The binary convolution compute cycle can be summarized as write weights, write inputs, start compute, and readout results. In our design, all reads/writes and execution triggers are issued over an AHB bus, which controls parameter loading, result retrieval, computation launch, and completion polling. Because modeling AHB transactions directly in a testbench is verbose and error-prone, we encapsulate bus behavior using Verilog tasks to enhance modularity and readability. Typical tasks include AHB_write (addr, data), AHB_read (addr, data), AHB_burst_write (base, buf, len), AHB_burst_read (base, buf, len), start_compute, and wait_done. These abstract low-level handshake signals (HREADY, HRESP, etc.), insert appropriate wait states, and standardize timing of request/response sequences. With this harness, we run directed and randomized tests that cover single-layer, multi-layer, and residual convolutions, various tiling configurations, and corner cases in padding/stride/dilation, thereby ensuring functional correctness before FPGA prototyping.

3.2. FPGA Prototyping and On-Board Verification

Our FPGA-based validation serves as a crucial proof-of-concept for the proposed architecture; however, a full-custom, transistor-level implementation is envisioned for a production-level deployment to achieve optimal power, performance, and area (PPA). Such an ASIC would overcome the inherent inefficiencies of the FPGA’s reconfigurable logic and routing fabric, enabling significantly higher clock frequencies and lower static power. Furthermore, a custom silicon approach would allow for the leveraging of advanced process nodes and transistor architectures, such as FinFETs, to further enhance energy efficiency at the circuit level [23]. The current FPGA prototype, therefore, validates the functional correctness and provides a solid foundation for this future silicon realization. As the accelerator requires external control to stream data and issue commands, we prototype it on an FPGA to validate functionality in the absence of a silicon tapeout. Directly exposing the accelerator’s AHB interface to off-chip connectors on an FPGA board is prone to severe timing problems due to the physical pinout and signal integrity of parallel buses. To avoid this, we adopted the PYNQ platform built on the Xilinx Zynq SoC, where the Processing System (PS) and the programmable logic (PL) reside on the same die and communicate via on-chip AXI interconnects. We instantiate the accelerator in the PL and drive it from the PS over AXI, eliminating external bus wiring. Because the PS runs Linux with a Python 3.7 environment, we also implement pre- and post-processing on the PS side, enabling end-to-end inference with minimal glue logic.

The implementation flow in Vivado proceeds as follows. First, the accelerator is packaged as a reusable IP. During packaging, we constrain interface signals, cast generics/parameters to numeric types, and reserve a sufficiently large internal register map for AHB access. Next, we create a block design: the PS, an AXI-AHB-Lite bridge, AXI SmartConnect, a Processor System Reset block, and the proposed accelerator are added and connected. Prior to wiring, all IP cores are configured to match the board’s clocks and reset scheme. Using Vivado’s Address Editor, we assign the AHB-accessible regions into the PS-visible AXI address space. Timing and power reports are then inspected. After closure, we generate the bitstream. A representative post-implementation power report is shown in Figure 6. In our prototype, on-chip power is dominated by PS activity, whereas PL devoted to the accelerator contributes a small fraction.

For on-board evaluation, we first copy the generated bitstream (.bit) and hardware handoff (.hwh) files to the PYNQ board. Using a Jupyter notebook, we load the design via pynq. Overlay to program the PL, then instantiate pynq. MMIO with the base address and range assigned in Vivado to guarantee correct register mapping. After a quick sanity check of read/write access to control/status registers and a small memory window, we execute an end-to-end inference flow: reuse the PyTorch preprocessing to produce bit-packed 1-bit input tensors, burst-write them into SRAMa/SRAMb, program the instruction words and per-channel thresholds, trigger execution, poll the done flag (or service the interrupt), and read back the bit-packed outputs for post-processing into logits and labels. We begin with a single-image smoke test to validate the data path and scheduling, then run dataset-level evaluation while logging per-image latency (from start trigger to completion) and overall accuracy. Because the PS and PL communicate over on-chip AXI, these measurements reflect realistic clocking and memory behavior, and the Python-based harness provides a reproducible, software-controlled environment for repeatable experiments.

4. Experimental Results and Conclusions

4.1. Performance

On-board measurements were carried out on a PYNQ Z2 running the programmable logic at 50 MHz. For a single 32 × 32 image, the end-to-end latency observed from the Python harness (including MMIO register programming, input packing, AXI transfers, and result readback) highlights that the long wall clock time is dominated by software and transfer overheads on the PS side rather than the accelerator datapath itself. Isolating pure compute by excluding all data-loading and host control overheads yields a per image kernel latency of 2.93 ms, which corresponds to a compute-limited throughput of 341.4 frames per second. In practice, using burst transfers/DMA, preloading weights, chaining layers via output reuse, and batching images substantially reduces the host overhead and brings the measured throughput closer to the kernel bound.

Accuracy on CIFAR 10 (10,000 test images) reaches 87.6% with the proposed binary network and accelerator, matching the PyTorch reference (biarized ResNet-14 model [24]) within expected numerical tolerances. Because the hardware and software differ only in binary encoding ({+1, −1} versus {1, 0}), no precision loss is introduced by the implementation. The post-implementation power report indicates a total on-chip power of 1.52 W.

As presented in Table 1, a comparative analysis reveals that our proposed accelerator achieves a superior balance across key hardware metrics when benchmarked against contemporary solutions. To ensure a fair comparison independent of clock frequency, we introduced normalized throughput (FPS/MHz) and energy efficiency (GOPS/W). Although our prototype operates at a conservative 50 MHz, its architectural efficiency is highlighted by an energy efficiency of approximately 66.9 GOPS/W, demonstrating an excellent trade-off between computational performance and power consumption while occupying a significantly smaller hardware footprint (25,186 LUTs). To provide further context, we clarify that our design natively supports essential features for modern deep networks, including residual convolutions and fused BatchNorm stages. This architectural completeness, which is not explicitly supported by all compared BNN accelerators, enables the direct deployment of complex topologies like ResNet and positions our work as a highly efficient and functionally complete solution for edge platforms.

Throughput was also characterized analytically for the target compute in memory array when scaled to a 250 MHz operating point and an SRAM array size of 64 × 256 × 64. In this configuration, the compute fabric instantiates 64 parallel ComputeCores, each consuming a 64-bit input stripe per cycle. Using the conventional BNN accounting where one bitwise XNOR is counted as one operation, the peak throughput is 1024 GOPS. Under realistic workloads (tiling, edge handling, residual addition, and occasional stalls in post-processing), the measured sustained rate averages 552.17 GOPS, corresponding to an array utilization of approximately 54%.

Extrapolating the 2.93 ms kernel latency at 50 MHz to 250 MHz yields roughly 0.586 ms per 32 × 32 image (≈1706 FPS) when data are fully resident on-chip and layer chaining is enabled. The gap between this bound and the end-to-end measurement at 50 MHz is attributable to software-driven MMIO transactions and non-streaming I/O. Adopting DMA based bursts, minimizing per image reconfiguration, and amortizing weight loads across multiple images closes this gap without modifying the accelerator microarchitecture.

In summary, the prototype demonstrates that the proposed purely binary CIM accelerator achieves competitive accuracy on CIFAR 10 while delivering high compute efficiency. Kernel-only performance reaches hundreds of GOPS at sub watt power, and energy efficiency up to 727 GOPS/W at 250 MHz. System-level throughput on the FPGA is presently limited by host-side overheads rather than by the accelerator, suggesting clear paths to further speedups through standard PS/PL integration optimizations.

4.2. Conclusions

This work presents a compute-in-memory binary convolution accelerator that addresses the memory wall and power wall limitations inherent to the von Neumann paradigm. By restricting all stored and transported operands to 1-bit representations throughout the pipeline, the design effectively expands usable on-chip buffering capacity and alleviates bandwidth pressure. We further fuse BatchNorm, HardTanh, and binarization into a single affine-and-threshold unit and map MAC operations to bitwise logic. This reduces circuit area and dynamic power and improves per-operation energy efficiency. For residual connections, dataflow control enables in-accumulator summation of the two convolution branches, avoiding additional adders and scratchpads and shortening the critical path.

Taken together, these techniques yield a compact, energy-efficient accelerator well matched to the stringent resource and power constraints of edge deployments, even as model sizes grow. The proposed approach provides a practical path for deploying compute-in-memory binary convolution in real-world settings, including power-grid environments, thereby supporting low-power, on-device intelligence and advancing post-Moore-era architectural exploration.

Author Contributions

W.C.: Conceptualization, Methodology, Software, Writing—Original Draft; Z.Z.: Formal analysis, Visualization; M.L.: Validation, Investigation, Data Curation; P.L.: Supervision, Project administration, Funding acquisition, Writing—Review and Editing; M.L.: Validation, Investigation, Data Curation; Y.L.: Resources, Software; Y.C.: Investigation, Writing—Review and Editing. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the project “Development of Edge-Side Intelligent Data Computing Chips”, grant number 546856230023. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript, or in the decision to publish the results.

Data Availability Statement

All data generated or analyzed during this study are included in this published article.

Conflicts of Interest

All authors are employees of Beijing Smart-Chip Microelectronics Technology Company. All authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest. The authors declare that this study received funding from Beijing Smart-Chip Microelectronics Technology Company. The funder was not involved in the study design, collection, analysis, interpretation of data, the writing of this article or the decision to submit it for publication.

References

Chang, Z.; Liu, S.; Xiong, X.; Cai, Z.; Tu, G. A survey of recent advances in edge-computing-powered artificial intelligence of things. IEEE Internet Things J. 2021, 8, 13849–13875. [Google Scholar] [CrossRef]
Garg, T.; Khullar, S. Big data analytics: Applications, challenges & future directions. In Proceedings of the 2020 8th International Conference on Reliability, Infocom Technologies and Optimization (Trends and Future Directions) (ICRITO), Noida, India, 4–5 June 2020; IEEE: Piscataway, NJ, USA, 2020; pp. 923–928. [Google Scholar]
Ali, M.; Roy, S.; Saxena, U.; Sharma, T.; Raghunathan, A.; Roy, K. Compute-in-memory technologies and architectures for deep learning workloads. IEEE Trans. Very Large Scale Integr. (VLSI) Syst. 2022, 30, 1615–1630. [Google Scholar] [CrossRef]
Liu, S.; Radway, R.M.; Wang, X.; Kwon, J.; Trippel, C.; Levis, P.; Mitra, S.; Wong, H.-S.P. Future of Memory: Massive, Diverse, Tightly Integrated with Compute-from Device to Software. In Proceedings of the 2024 IEEE International Electron Devices Meeting (IEDM), San Francisco, CA, USA, 7–12 November 2024; IEEE: New York, NY, USA, 2024; pp. 1–4. [Google Scholar]
Zou, X.; Xu, S.; Chen, X.; Yan, L.; Han, Y. Breaking the von Neumann bottleneck: Architecture-level processing-in-memory technology. Sci. China Inf. Sci. 2021, 64, 160404. [Google Scholar] [CrossRef]
Sun, W.; Yue, J.; He, Y.; Huang, Z.; Wang, J.; Jia, W.; Li, Y.; Lei, L.; Jia, H.; Liu, Y. A survey of computing-in-memory processor: From circuit to application. IEEE Open J. Solid-State Circuits Soc. 2023, 4, 25–42. [Google Scholar] [CrossRef]
Jhang, C.J.; Xue, C.X.; Hung, J.M.; Chang, F.C.; Chang, M.F. Challenges and trends of SRAM-based computing-in-memory for AI edge devices. IEEE Trans. Circuits Syst. I Regul. Pap. 2021, 68, 1773–1786. [Google Scholar] [CrossRef]
Liang, T.; Glossner, J.; Wang, L.; Shi, S.; Zhang, X. Pruning and quantization for deep neural network acceleration: A survey. Neurocomputing 2021, 461, 370–403. [Google Scholar] [CrossRef]
Qin, H.; Gong, R.; Liu, X.; Bai, X.; Song, J.; Sebe, N. Binary neural networks: A survey. Pattern Recognit. 2020, 105, 107281. [Google Scholar] [CrossRef]
Zou, Q.; Wang, Y.; Wang, Q.; Zhao, Y.; Li, Q. Deep learning-based gait recognition using smartphones in the wild. IEEE Trans. Inf. Forensics Secur. 2020, 15, 3197–3212. [Google Scholar] [CrossRef]
Tanigawa, T.; Noda, M.; Ishiura, N. Efficient FPGA implementation of binarized neural networks based on generalized parallel counter tree. In Proceedings of the Workshop on Synthesis and System Integration of Mixed Information Technologies (SASIMI), Taipei, China, 11–12 March 2024; pp. 32–37. [Google Scholar]
Vatsavai, S.S.; Karempudi, V.S.P.; Thakkar, I. An optical xnor-bitcount based accelerator for efficient inference of binary neural networks. In Proceedings of the 2023 24th International Symposium on Quality Electronic Design (ISQED), San Francisco, CA, USA, 5–7 April 2023; IEEE: New York, NY, USA, 2023; pp. 1–8. [Google Scholar]
Sun, J.; Houshmand, P.; Verhelst, M. Analog or digital in-memory computing? benchmarking through quantitative modeling. In Proceedings of the 2023 IEEE/ACM International Conference on Computer Aided Design (ICCAD), San Francisco, CA, USA, 29 October–2 November 2023; IEEE: New York, NY, USA, 2023; pp. 1–9. [Google Scholar]
Perri, S.; Zambelli, C.; Ielmini, D.; Silvano, C. Digital In-Memory Computing to Accelerate Deep Learning Inference on the Edge. In Proceedings of the 2024 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW), San Francisco, CA, USA, 27–31 May 2024; IEEE: New York, NY, USA, 2024; pp. 130–133. [Google Scholar]
Houshmand, P.; Cosemans, S.; Mei, L.; Papistas, I.; Bhattacharjee, D.; Debacker, P.; Mallik, A.; Verkest, D.; Verhelst, M. Opportunities and limitations of emerging analog in-memory compute DNN architectures. In Proceedings of the 2020 IEEE International Electron Devices Meeting (IEDM), San Francisco, CA, USA, 12–18 December 2020; IEEE: New York, NY, USA, 2020; pp. 29.1.1–29.1.4. [Google Scholar]
Channamadhavuni, S.; Thijssen, S.; Jha, S.K.; Ewetz, R. Accelerating AI applications using analog in-memory computing: Challenges and opportunities. In Proceedings of the 2021 Great Lakes Symposium on VLSI, Virtual, 22–25 June 2021; pp. 379–384. [Google Scholar]
Qin, R.; Yan, Z.; Zeng, D.; Jia, Z.; Liu, D.; Liu, J.; Abbasi, A.; Zheng, Z.; Cao, N.; Ni, K.; et al. Robust implementation of retrieval-augmented generation on edge-based computing-in-memory architectures. In Proceedings of the 43rd IEEE/ACM International Conference on Computer-Aided Design, New York, NY, USA, 27–31 October 2024; pp. 1–9. [Google Scholar]
Lin, Z.; Tong, Z.; Zhang, J.; Wang, F.; Xu, T.; Zhao, Y.; Wu, X.; Peng, C.; Lu, W.; Zhao, Q.; et al. A review on SRAM-based computing in-memory: Circuits, functions, and applications. J. Semicond. 2022, 43, 031401. [Google Scholar] [CrossRef]
Sun, X.; Liu, R.; Peng, X.; Yu, S. Computing-in-memory with SRAM and RRAM for binary neural networks. In Proceedings of the 2018 14th IEEE International Conference on Solid-State and Integrated Circuit Technology (ICSICT), Qingdao, China, 31 October–3 November 2018; IEEE: New York, NY, USA, 2018; pp. 1–4. [Google Scholar]
Mourya, M.V.; Bansal, H.; Verma, D.; Suri, M. RRAM IMC based Efficient Analog Carry Propagation and Multi-bit MVM. In Proceedings of the 2024 8th IEEE Electron Devices Technology & Manufacturing Conference (EDTM), Bangalore, India, 3–6 March 2024; IEEE: New York, NY, USA, 2024; pp. 1–3. [Google Scholar]
Antolini, A.; Lico, A.; Zavalloni, F.; Scarselli, E.F.; Gnudi, A.; Torres, M.L.; Canegallo, R.; Pasotti, M. A readout scheme for PCM-based analog in-memory computing with drift compensation through reference conductance tracking. IEEE Open J. Solid-State Circuits Soc. 2024, 4, 69–82. [Google Scholar] [CrossRef]
Huang, L.; Qin, J.; Zhou, Y.; Zhu, F.; Liu, L.; Shao, L. Normalization techniques in training dnns: Methodology, analysis and application. IEEE Trans. Pattern Anal. Mach. Intell. 2023, 45, 10173–10196. [Google Scholar] [CrossRef] [PubMed]
Jooq, M.K.Q.; Behbahani, F.; Moaiyeri, M.H. Ultra-efficient fully programmable membership function generator based on independent double-gate FinFET technology. Int. J. Circuit Theory Appl. 2023, 51, 4485–4502. [Google Scholar] [CrossRef]
Qin, H.; Gong, R.; Liu, X.; Shen, M.; Wei, Z.; Yu, F.; Song, J. Forward and backward information retention for accurate binary neural networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 2250–2259. [Google Scholar]
Liu, Y.; Zhang, H.; Sun, Z.; Duan, F.; Ma, Y.; Lu, W.; Caiafa, C.F.; Solé-Casals, J. HSBNN: A High-Scalable Bayesian Neural Networks Accelerator Based on Field Programmable Gate Arrays (FPGA). Cogn. Comput. 2025, 17, 100. [Google Scholar] [CrossRef]
Cen, R.; Zhang, D.; Kang, Y.; Wang, D. HA-BNN: Hardware-Aware Binary Neural Networks for Efficient Inference. In Proceedings of the 2024 IEEE 17th International Conference on Signal Processing (ICSP), Suzhou, China, 28–31 October 2024; IEEE: New York, NY, USA, 2024; pp. 176–180. [Google Scholar]
Li, Z.; Bilavarn, S. Partial Reconfiguration for Energy-Efficient Inference on FPGA: A Case Study with ResNet-18. In Proceedings of the 2024 27th Euromicro Conference on Digital System Design (DSD), Paris, France, 28–30 August 2024; IEEE: New York, NY, USA, 2024; pp. 291–297. [Google Scholar]
Yang, A. Research of FPGA-based neural network accelerators. In Proceedings of the IET Conference Proceedings CP895, Stevenage, UK, 11 October 2024; The Institution of Engineering and Technology: Stevenage, UK, 2024; Volume 2024, pp. 117–122. [Google Scholar]

Figure 1. Comparison of von Neumann and CIM architectures.

Figure 2. Our binarized convolution CIM accelerator architecture.

Figure 3. Computational flow of residual convolution.

Figure 4. Flow diagram of a convolution computation.

Figure 5. Instruction path diagram.

Figure 6. Block Design of the Binarized Convolution Accelerator.

Table 1. Comparison with Prior Works.

Paper	[25]	[26]	[27]	[28]	This Paper
Model	BNN	BNN	ResNet-18	Zynq Net	ResNet-14
FPGA	ZYBO Z7-20	PYNQ-Z1	PYNQ-Z2	Virtex-7 VC709	PYNQ-Z2
LUTs	49,387	49,600	74,482	338,922	25,186
BRAM (Mb)	2.46	5.04	4.92	49.6	68
Power (W)	1.8	2.15	2.85	11.5	1.65
Frequency (MHz)	100	100	100	125	50
FPS/MHz	118.3	42.0	0.04	0.125	6.83
GOPS/W	166	70	7	N/A	66.9
Bit width	32	1	8	16	1
Residual/BN	NO/NO	Partial/Yes	Yes/Yes	N/A	Yes/Yes

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Cui, W.; Zheng, Z.; Li, P.; Li, M.; Liu, Y.; Chi, Y. A Binary Convolution Accelerator Based on Compute-in-Memory. Electronics 2026, 15, 117. https://doi.org/10.3390/electronics15010117

AMA Style

Cui W, Zheng Z, Li P, Li M, Liu Y, Chi Y. A Binary Convolution Accelerator Based on Compute-in-Memory. Electronics. 2026; 15(1):117. https://doi.org/10.3390/electronics15010117

Chicago/Turabian Style

Cui, Wenpeng, Zhe Zheng, Pan Li, Ming Li, Yu Liu, and Yingying Chi. 2026. "A Binary Convolution Accelerator Based on Compute-in-Memory" Electronics 15, no. 1: 117. https://doi.org/10.3390/electronics15010117

APA Style

Cui, W., Zheng, Z., Li, P., Li, M., Liu, Y., & Chi, Y. (2026). A Binary Convolution Accelerator Based on Compute-in-Memory. Electronics, 15(1), 117. https://doi.org/10.3390/electronics15010117

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A Binary Convolution Accelerator Based on Compute-in-Memory

Abstract

1. Introduction

1.1. Background

1.2. Computing-in-Memory

2. Purely Binary Convolution CIM Design

2.1. Overview

2.2. Computing Framework

2.3. Functions of the Accelerator Modules

3. FPGA-Based Accelerator Validation

3.1. Simulation of the Purely Binary Accelerator

3.2. FPGA Prototyping and On-Board Verification

4. Experimental Results and Conclusions

4.1. Performance

4.2. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI