Abstract
As AI workloads move to edge devices, the von Neumann architecture is hindered by memory- and power-wall limitations We present an SRAM-based compute-in-memory binary convolution accelerator that stores and transports only 1-bit weights and activations, maps MACs to bitwise XNOR–popcount, and fuses BatchNorm, HardTanh, and binarization into a single affine-and-threshold uni. Residual paths are handled by in-accumulator summation to minimize data movement. FPGA validation shows 87.6% CIFAR 10 accuracy consistent with a bit-accurate software reference, a compute-only latency of 2.93 ms per 32 × 32 image at 50 MHz, sustained at only 1.52 W. These results demonstrate an efficient and practical path to deploying edge models under tight power and memory budgets.