MDPI - Publisher of Open Access Journals

23 pages, 2239 KB

Open AccessArticle

SparseDroop: Hardware–Software Co-Design for Mitigating Voltage Droop in DNN Accelerators

by Arnab Raha, Shamik Kundu, Arghadip Das, Soumendu Kumar Ghosh and Deepak A. Mathaikutty

J. Low Power Electron. Appl. 2026, 16(1), 2; https://doi.org/10.3390/jlpea16010002 - 23 Dec 2025

Viewed by 363

Modern deep neural network (DNN) accelerators must sustain high throughput while avoiding performance degradation from supply voltage (VDD) droop, which occurs when large arrays of multiply–accumulate (MAC) units switch concurrently and induce high peak current (

I C C_{m a x}

) [...] Read more.

Modern deep neural network (DNN) accelerators must sustain high throughput while avoiding performance degradation from supply voltage (VDD) droop, which occurs when large arrays of multiply–accumulate (MAC) units switch concurrently and induce high peak current (

I C C_{m a x}

) transients on the power delivery network (PDN). In this work, we focus on ASIC-class DNN accelerators with tightly synchronized MAC arrays rather than FPGA-based implementations, where such cycle-aligned switching is most pronounced. Conventional guardbanding and reactive countermeasures (e.g., throttling, clock stretching, or emergency DVFS) either waste energy or incur non-trivial throughput penalties. We propose SparseDroop, a unified hardware-conscious framework that proactively shapes instantaneous current demand to mitigate droop without reducing sustained computing rate. SparseDroop comprises two complementary techniques. (1) SparseStagger, a lightweight hardware-friendly droop scheduler that exploits the inherent unstructured sparsity already present in the weights and activations—it does not introduce any additional sparsification. SparseStagger dynamically inspects the zero patterns mapped to each processing element (PE) column and staggers MAC start times within a column so that high-activity bursts are temporally interleaved. This fine-grain reordering smooths

I C C

trajectories, lowers the probability and depth of transient VDD dips, and preserves cycle-level alignment at tile/row boundaries—thereby maintaining no throughput loss and negligible control overhead. (2) SparseBlock, an architecture-aware, block-wise-structured sparsity induction method that intentionally introduces additional sparsity aligned with the accelerator’s dataflow. By co-designing block layout with the dataflow, SparseBlock reduces the likelihood that all PEs in a column become simultaneously active, directly constraining

I C C_{m a x}

and peak dynamic power on the PDN. Together, SparseStagger’s opportunistic staggering (from existing unstructured weight zeros) and SparseBlock’s structured, layout-aware sparsity induction (added to prevent peak-power excursions) deliver a scalable, low-overhead solution that improves voltage stability, energy efficiency, and robustness, integrates cleanly with the accelerator dataflow, and preserves model accuracy with modest retraining or fine-tuning. Full article

► Show Figures

Figure 1

21 pages, 1565 KB

Open AccessArticle

A KWS System for Edge-Computing Applications with Analog-Based Feature Extraction and Learned Step Size Quantized Classifier

by Yukai Shen, Binyi Wu, Dietmar Straeussnigg and Eric Gutierrez

Sensors 2025, 25(8), 2550; https://doi.org/10.3390/s25082550 - 17 Apr 2025

Viewed by 1697

Abstract

Edge-computing applications demand ultra-low-power architectures for both feature extraction and classification tasks. In this manuscript, a Keyword Spotting (KWS) system tailored for energy-constrained portable environments is proposed. A 16-channel analog filter bank is employed for audio feature extraction, followed by a digital Gated [...] Read more.

Edge-computing applications demand ultra-low-power architectures for both feature extraction and classification tasks. In this manuscript, a Keyword Spotting (KWS) system tailored for energy-constrained portable environments is proposed. A 16-channel analog filter bank is employed for audio feature extraction, followed by a digital Gated Recurrent Unit (GRU) classifier. The filter bank is behaviorally modeled, making use of second-order band-pass transfer functions, simulating the analog front-end (AFE) processing. To enable efficient deployment, the GRU classifier is trained using a Learned Step Size (LSQ) and Look-Up Table (LUT)-aware quantization method. The resulting quantized model, with 4-bit weights and 8-bit activation functions (W4A8), achieves 91.35% accuracy across 12 classes, including 10 keywords from the Google Speech Command Dataset v2 (GSCDv2), with less than 1% degradation compared to its full-precision counterpart. The model is estimated to require only 34.8 kB of memory and 62,400 multiply–accumulate (MAC) operations per inference in real-time settings. Furthermore, the robustness of the AFE against noise and analog impairments is evaluated by injecting Gaussian noise and perturbing the filter parameters (center frequency and quality factor) in the test data, respectively. The obtained results confirm a strong classification performance even under degraded circuit-level conditions, supporting the suitability of the proposed system for ultra-low-power, noise-resilient edge applications. Full article

(This article belongs to the Section Intelligent Sensors)

► Show Figures

Figure 1

10 pages, 638 KB

Open AccessArticle

Efficient Quantization and Data Access for Accelerating Homomorphic Encrypted CNNs

by Kai Chen, Xinyu Wang, Yuxiang Fu and Li Li

Electronics 2025, 14(3), 464; https://doi.org/10.3390/electronics14030464 - 23 Jan 2025

Cited by 2 | Viewed by 1265

Abstract

Due to the ability to perform computations directly on encrypted data, homomorphic encryption (HE) has recently become an important branch of privacy-preserving machine learning (PPML) implementation. Nevertheless, existing implementations of HE-based convolutional neural network (HCNN) applications are not satisfactory in inference latency and [...] Read more.

Due to the ability to perform computations directly on encrypted data, homomorphic encryption (HE) has recently become an important branch of privacy-preserving machine learning (PPML) implementation. Nevertheless, existing implementations of HE-based convolutional neural network (HCNN) applications are not satisfactory in inference latency and area efficiency compared to the unencrypted version. In this work, we first improve the additive powers-of-two (APoT) quantization method for HCNN to achieve a better tradeoff between the complexity of modular multiplication and the network accuracy. An efficient multiplicationless modular multiplier–accumulator (M-MAC) unit is accordingly designed. Furthermore, a batch-processing HCNN accelerator with M-MACs is implemented, in which we propose an advanced data partition scheme to avoid multiple moves of the large-size ciphertext polynomials. Compared to the latest FPGA design, our accelerator can achieve

11 \times

resource reduction of an M-MAC and

2.36 \times

speedup in inference latency for a widely used CNN-11 network to process 8K images. The speedup of our design is also significant compared to the latest CPU and GPU implementations of the batch-processing HCNN models. Full article

► Show Figures

Figure 1

15 pages, 2671 KB

Open AccessArticle

Reconfigurable Frequency Response Masking Multi-MAC Filters for Software Defined Radio Channelization

by Subahar Arivalagan, Britto Pari James and Man-Fai Leung

Electronics 2024, 13(21), 4211; https://doi.org/10.3390/electronics13214211 - 27 Oct 2024

Cited by 3 | Viewed by 1385

Abstract

Mobile technology is currently trending toward supporting multiple communication standards on a single device. This means that some reconfigurable techniques must be the foundation of their design. The two essential requirements of channel filters are minimized complexity and reconfigurability. In this research, a [...] Read more.

Mobile technology is currently trending toward supporting multiple communication standards on a single device. This means that some reconfigurable techniques must be the foundation of their design. The two essential requirements of channel filters are minimized complexity and reconfigurability. In this research, a novel extension of Frequency Response Masking (FRM) was investigated by employing Time Division Multiplexing (TDM)-based single Multiply and Accumulate (MAC) architecture using the principle of resource sharing to realize multiple sharp filter responses from a single prototype constant group delay low pass filter. This paper uses a single multiply and add units regardless of the quantity of channels and taps. The suggested reconfigurable filter was synthesized on technology based on 0.18-µm CMOS and put into practice. Further trials were carried out on Virtex-II 2v3000ff1152-4 FPGA device. The outcomes revealed that the suggested channel filter, which was synthesized using FPGA, provides 21.36% of the area curtail and 14.88% of power scaling down on average and put into practice using ASIC provides 5.18% of the area reduction and 9.08% of power scaling down on average. Full article

► Show Figures

Figure 1

23 pages, 2718 KB

Open AccessArticle

Voltage Scaled Low Power DNN Accelerator Design on Reconfigurable Platform

by Rourab Paul, Sreetama Sarkar, Suman Sau, Sanghamitra Roy, Koushik Chakraborty and Amlan Chakrabarti

Electronics 2024, 13(8), 1431; https://doi.org/10.3390/electronics13081431 - 10 Apr 2024

Cited by 1 | Viewed by 2668

Abstract

The exponential emergence of Field-Programmable Gate Arrays (FPGAs) has accelerated research on hardware implementation of Deep Neural Networks (DNNs). Among all DNN processors, domain-specific architectures such as Google’s Tensor Processor Unit (TPU) have outperformed conventional GPUs (Graphics Processing Units) and CPUs (Central Processing [...] Read more.

The exponential emergence of Field-Programmable Gate Arrays (FPGAs) has accelerated research on hardware implementation of Deep Neural Networks (DNNs). Among all DNN processors, domain-specific architectures such as Google’s Tensor Processor Unit (TPU) have outperformed conventional GPUs (Graphics Processing Units) and CPUs (Central Processing Units). However, implementing low-power TPUs in reconfigurable hardware remains a challenge in this field. Voltage scaling, a popular approach for energy savings, can be challenging in FPGAs, as it may lead to timing failures if not implemented appropriately. This work presents an ultra-low-power FPGA implementation of a TPU for edge applications. We divide the systolic array of a TPU into different FPGA partitions based on the minimum slack value of different design paths of Multiplier Accumulators (MACs). Each partition uses different near-threshold (NTC) biasing voltages to run its FPGA cores. The biasing voltage for each partition is roughly calculated by the proposed static schemes. However, further calibration of biasing voltage is performed by the proposed runtime scheme. To overcome the timing failure caused by NTC, the MACs with higher minimum slack are placed in lower-voltage partitions, while the MACs with lower minimum slack paths are placed in higher-voltage partitions. The proposed architecture is implemented in a commercial platform, namely

V i v a d o

with Xilinx

A r t i x - 7

FPGA and academic platform

V T R

with 22 nm, 45 nm and 130 nm FPGAs. Any timing error caused by NTC can be caught by the Razor flipflop used in each MAC. The proposed voltage-scaled, partitioned systolic array can save 3.1% to 11.6% of dynamic power in

V i v a d o

and

V T R

tools, respectively, depending on the FPGA technology, partition size, number of partitions and biasing voltages. The normalized performance and accuracy of benchmark models running on our low-power TPU are very competitive compared to existing literature. Full article

(This article belongs to the Special Issue Embedded Systems for Neural Network Applications)

► Show Figures

Figure 1

23 pages, 4213 KB

Open AccessArticle

Leveraging Bit-Serial Architectures for Hardware-Oriented Deep Learning Accelerators with Column-Buffering Dataflow

by Xiaoshu Cheng, Yiwen Wang, Weiran Ding, Hongfei Lou and Ping Li

Electronics 2024, 13(7), 1217; https://doi.org/10.3390/electronics13071217 - 26 Mar 2024

Cited by 4 | Viewed by 3590

Abstract

Bit-serial neural network accelerators address the growing need for compact and energy-efficient deep learning tools. Traditional neural network accelerators, while effective, often grapple with issues of size, power consumption, and versatility in handling a variety of computational tasks. To counter these challenges, this [...] Read more.

Bit-serial neural network accelerators address the growing need for compact and energy-efficient deep learning tools. Traditional neural network accelerators, while effective, often grapple with issues of size, power consumption, and versatility in handling a variety of computational tasks. To counter these challenges, this paper introduces an approach that hinges on the integration of bit-serial processing with advanced dataflow techniques and architectural optimizations. Central to this approach is a column-buffering (CB) dataflow, which significantly reduces access and movement requirements for the input feature map (IFM), thereby enhancing efficiency. Moreover, a simplified quantization process effectively eliminates biases, streamlining the overall computation process. Furthermore, this paper presents a meticulously designed LeNet-5 accelerator leveraging a convolutional layer processing element array (CL PEA) architecture incorporating an improved bit-serial multiply–accumulate unit (MAC). Empirically, our work demonstrates superior performance in terms of frequency, chip area, and power consumption compared to current state-of-the-art ASIC designs. Specifically, our design utilizes fewer hardware resources to implement a complete accelerator, achieving a high performance of 7.87 GOPS on a Xilinx Kintex-7 FPGA with a brief processing time of 284.13 μs. The results affirm that our design is exceptionally suited for applications requiring compact, low-power, and real-time solutions. Full article

(This article belongs to the Section Artificial Intelligence Circuits and Systems (AICAS))

► Show Figures

Figure 1

15 pages, 2950 KB

Open AccessArticle

Memristor–CMOS Hybrid Circuits Implementing Event-Driven Neural Networks for Dynamic Vision Sensor Camera

by Rina Yoon, Seokjin Oh, Seungmyeong Cho and Kyeong-Sik Min

Micromachines 2024, 15(4), 426; https://doi.org/10.3390/mi15040426 - 22 Mar 2024

Cited by 5 | Viewed by 3414

Abstract

For processing streaming events from a Dynamic Vision Sensor camera, two types of neural networks can be considered. One are spiking neural networks, where simple spike-based computation is suitable for low-power consumption, but the discontinuity in spikes can make the training complicated in [...] Read more.

For processing streaming events from a Dynamic Vision Sensor camera, two types of neural networks can be considered. One are spiking neural networks, where simple spike-based computation is suitable for low-power consumption, but the discontinuity in spikes can make the training complicated in terms of hardware. The other one are digital Complementary Metal Oxide Semiconductor (CMOS)-based neural networks that can be trained directly using the normal backpropagation algorithm. However, the hardware and energy overhead can be significantly large, because all streaming events must be accumulated and converted into histogram data, which requires a large amount of memory such as SRAM. In this paper, to combine the spike-based operation with the normal backpropagation algorithm, memristor–CMOS hybrid circuits are proposed for implementing event-driven neural networks in hardware. The proposed hybrid circuits are composed of input neurons, synaptic crossbars, hidden/output neurons, and a neural network’s controller. Firstly, the input neurons perform preprocessing for the DVS camera’s events. The events are converted to histogram data using very simple memristor-based latches in the input neurons. After preprocessing the events, the converted histogram data are delivered to an ANN implemented using synaptic memristor crossbars. The memristor crossbars can perform low-power Multiply–Accumulate (MAC) calculations according to the memristor’s current–voltage relationship. The hidden and output neurons can convert the crossbar’s column currents to the output voltages according to the Rectified Linear Unit (ReLU) activation function. The neural network’s controller adjusts the MAC calculation frequency according to the workload of the event computation. Moreover, the controller can disable the MAC calculation clock automatically to minimize unnecessary power consumption. The proposed hybrid circuits have been verified by circuit simulation for several event-based datasets such as POKER-DVS and MNIST-DVS. The circuit simulation results indicate that the neural network’s performance proposed in this paper is degraded by as low as 0.5% while saving as much as 79% in power consumption for POKER-DVS. The recognition rate of the proposed scheme is lower by 0.75% compared to the conventional one, for the MNIST-DVS dataset. In spite of this little loss, the power consumption can be reduced by as much as 75% for the proposed scheme. Full article

(This article belongs to the Section D1: Semiconductor Devices)

► Show Figures

Figure 1

17 pages, 6522 KB

Open AccessEditor’s ChoiceArticle

Design of a Convolutional Neural Network Accelerator Based on On-Chip Data Reordering

by Yang Liu, Yiheng Zhang, Xiaoran Hao, Lan Chen, Mao Ni, Ming Chen and Rong Chen

Electronics 2024, 13(5), 975; https://doi.org/10.3390/electronics13050975 - 4 Mar 2024

Cited by 3 | Viewed by 9318

Abstract

Convolutional neural networks have been widely applied in the field of computer vision. In convolutional neural networks, convolution operations account for more than 90% of the total computational workload. The current mainstream approach to achieving high energy-efficient convolution operations is through dedicated hardware [...] Read more.

Convolutional neural networks have been widely applied in the field of computer vision. In convolutional neural networks, convolution operations account for more than 90% of the total computational workload. The current mainstream approach to achieving high energy-efficient convolution operations is through dedicated hardware accelerators. Convolution operations involve a significant amount of weights and input feature data. Due to limited on-chip cache space in accelerators, there is a significant amount of off-chip DRAM memory access involved in the computation process. The latency of DRAM access is 20 times higher than that of SRAM, and the energy consumption of DRAM access is 100 times higher than that of multiply–accumulate (MAC) units. It is evident that the “memory wall” and “power wall” issues in neural network computation remain challenging. This paper presents the design of a hardware accelerator for convolutional neural networks. It employs a dataflow optimization strategy based on on-chip data reordering. This strategy improves on-chip data utilization and reduces the frequency of data exchanges between on-chip cache and off-chip DRAM. The experimental results indicate that compared to the accelerator without this strategy, it can reduce data exchange frequency by up to 82.9%. Full article

(This article belongs to the Special Issue Artificial Intelligence and Signal Processing: Circuits and Systems)

► Show Figures

Figure 1

18 pages, 10254 KB

Open AccessArticle

Design and Performance Analysis of a [8/8/8] Charge Domain Mixed-Signal Multiply-Accumulator

by Akira Matsuzawa, Abdel Martinez Alonso and Masaya Miyahara

Electronics 2024, 13(1), 50; https://doi.org/10.3390/electronics13010050 - 21 Dec 2023

Viewed by 2205

Abstract

This article describes the design and performance analysis of a charge domain mixed-signal multiply-accumulator (MAC) using RDAC, CDAC, and SAR-ADC with an 8-bit resolution for input, weight, and output. The arithmetic accuracy is mainly determined by the ADC, and the gain error has [...] Read more.

This article describes the design and performance analysis of a charge domain mixed-signal multiply-accumulator (MAC) using RDAC, CDAC, and SAR-ADC with an 8-bit resolution for input, weight, and output. The arithmetic accuracy is mainly determined by the ADC, and the gain error has a significant impact. The mismatches and thermal noises of the RDAC and the CDAC are averaged by the number of multiply-accumulate units m connected to one ADC. As a result, if m is large enough, mismatches and thermal noises have a limited impact on the computation accuracy. Most of the computational energy is determined by the energy consumed by the SAR-ADC, and the computational energy per operation can be reduced by increasing m. This last metric is mainly determined by the charge and discharge energy of the CDAC for sufficiently large m values. Furthermore, since RDAC consumes energy unnecessarily, the turn-off timing of RDAC should be optimized. These MAC units have been designed and prototyped using 28 nm CMOS technology, integrating 12,288 arithmetic units while operating at 180 MHz, resulting in an arithmetic speed of 4.4 TOPS. The r-MVM accuracy is about 1% and a high energy efficiency of 240 TOPS/W as a MAC macro and 64.4 TOPS/W as a system has been achieved. Full article

(This article belongs to the Special Issue Ultra-Low-Voltage and Ultra-Low-Power Integrated Circuits and Systems Evolution)

► Show Figures

Figure 1

17 pages, 5099 KB

Open AccessArticle

Analog Convolutional Operator Circuit for Low-Power Mixed-Signal CNN Processing Chip

by Malik Summair Asghar, Saad Arslan and HyungWon Kim

Sensors 2023, 23(23), 9612; https://doi.org/10.3390/s23239612 - 4 Dec 2023

Cited by 5 | Viewed by 4353

Abstract

In this paper, we propose a compact and low-power mixed-signal approach to implementing convolutional operators that are often responsible for most of the chip area and power consumption of Convolutional Neural Network (CNN) processing chips. The convolutional operators consist of several multiply-and-accumulate (MAC) [...] Read more.

In this paper, we propose a compact and low-power mixed-signal approach to implementing convolutional operators that are often responsible for most of the chip area and power consumption of Convolutional Neural Network (CNN) processing chips. The convolutional operators consist of several multiply-and-accumulate (MAC) units. MAC units are the primary components that process convolutional layers and fully connected layers of CNN models. Analog implementation of MAC units opens a new paradigm for realizing low-power CNN processing chips, benefiting from less power and area consumption. The proposed mixed-signal convolutional operator comprises low-power binary-weighted current steering digital-to-analog conversion (DAC) circuits and accumulation capacitors. Compared with a conventional binary-weighted DAC, the proposed circuit benefits from optimum accuracy, smaller area, and lower power consumption due to its symmetric design. The proposed convolutional operator takes as input a set of 9-bit digital input feature data and weight parameters of the convolutional filter. It then calculates the convolutional filter’s result and accumulates the resulting voltage on capacitors. In addition, the convolutional operator employs a novel charge-sharing technique to process negative MAC results. We propose an analog max-pooling circuit that instantly selects the maximum input voltage. To demonstrate the performance of the proposed mixed-signal convolutional operator, we implemented a CNN processing chip consisting of 3 analog convolutional operators, with each operator processing a 3 × 3 kernel. This chip contains 27 MAC circuits, an analog max-pooling, and an analog-to-digital conversion (ADC) circuit. The mixed-signal CNN processing chip is implemented using a CMOS 55 nm process, which occupies a silicon area of 0.0559 mm² and consumes an average power of 540.6 μW. The proposed mixed-signal CNN processing chip offers an area reduction of 84.21% and an energy reduction of 91.85% compared with a conventional digital CNN processing chip. Moreover, another CNN processing chip is implemented with more analog convolutional operators to demonstrate the operation and structure of an example convolutional layer of a CNN model. Therefore, the proposed analog convolutional operator can be adapted in various CNN models as an alternative to digital counterparts. Full article

(This article belongs to the Special Issue Advanced CMOS Integrated Circuit Design and Application II)

► Show Figures

Figure 1

17 pages, 4076 KB

Open AccessArticle

Booth Encoded Bit-Serial Multiply-Accumulate Units with Improved Area and Energy Efficiencies

by Xiaoshu Cheng, Yiwen Wang, Jiazhi Liu, Weiran Ding, Hongfei Lou and Ping Li

Electronics 2023, 12(10), 2177; https://doi.org/10.3390/electronics12102177 - 10 May 2023

Cited by 7 | Viewed by 3604

Abstract

Bit-serial multiply-accumulate units (MACs) play a crucial role in various hardware accelerator applications, including deep learning, image processing, and signal processing. Despite the advantages of bit-serial MACs, such as a small footprint, full hardware utilization, and high frequency, their serial nature can lead [...] Read more.

Bit-serial multiply-accumulate units (MACs) play a crucial role in various hardware accelerator applications, including deep learning, image processing, and signal processing. Despite the advantages of bit-serial MACs, such as a small footprint, full hardware utilization, and high frequency, their serial nature can lead to high latency and potentially compromised performance. This study investigates the potential of bit-serial solutions by applying Booth encoding to bit-serial multipliers within MACs to enhance area and power efficiencies. We present two types of bit-serial MACs based on radix-2 and radix-4 Booth encoding multipliers, respectively. Their performance is assessed through simulations and synthesis results, demonstrating the benefits of the proposed approach. The radix-4 Booth bit-serial MAC improves power and area efficiencies compared to the original bit-serial MAC. Operating at TSMC 90 nm and 150 MHz, our design exhibits a remarkable 96.39% reduction in area-power-product (APP). Moreover, the prototype verification on a Xilinx Kintex-7 FPGA proved successful. The proposed solution offers significant advantages in energy efficiency, area reduction, and APP, making it a promising candidate for next-generation hardware accelerators in offline inference, low-power devices, and other applications. Full article

(This article belongs to the Section Artificial Intelligence Circuits and Systems (AICAS))

► Show Figures

Figure 1

21 pages, 1719 KB

Open AccessArticle

A Bottom-Up Methodology for the Fast Assessment of CNN Mappings on Energy-Efficient Accelerators

by Guillaume Devic, Gilles Sassatelli and Abdoulaye Gamatié

J. Low Power Electron. Appl. 2023, 13(1), 5; https://doi.org/10.3390/jlpea13010005 - 5 Jan 2023

Viewed by 3581

Abstract

The execution of machine learning (ML) algorithms on resource-constrained embedded systems is very challenging in edge computing. To address this issue, ML accelerators are among the most efficient solutions. They are the result of aggressive architecture customization. Finding energy-efficient mappings of ML workloads [...] Read more.

The execution of machine learning (ML) algorithms on resource-constrained embedded systems is very challenging in edge computing. To address this issue, ML accelerators are among the most efficient solutions. They are the result of aggressive architecture customization. Finding energy-efficient mappings of ML workloads on accelerators, however, is a very challenging task. In this paper, we propose a design methodology by combining different abstraction levels to quickly address the mapping of convolutional neural networks on ML accelerators. Starting from an open-source core adopting the RISC-V instruction set architecture, we define in RTL a more flexible and powerful multiply-and-accumulate (MAC) unit, compared to the native MAC unit. Our proposal contributes to improving the energy efficiency of the RISC-V cores of PULPino. To effectively evaluate its benefits at system level, while considering CNN execution, we build a corresponding analytical model in the Timeloop/Accelergy simulation and evaluation environment. This enables us to quickly explore CNN mappings on a typical RISC-V system-on-chip model, manufactured under the name of GAP8. The modeling flexibility offered by Timeloop makes it possible to easily evaluate our novel MAC unit in further CNN accelerator architectures such as Eyeriss and DianNao. Overall, the resulting bottom-up methodology assists designers in the efficient implementation of CNNs on ML accelerators by leveraging the accuracy and speed of the combined abstraction levels. Full article

(This article belongs to the Special Issue RISC-V Architectures and Systems: Hardware and Software Perspectives)

► Show Figures

Figure 1

12 pages, 9479 KB

Open AccessArticle

Power-Efficient Deep Neural Network Accelerator Minimizing Global Buffer Access without Data Transfer between Neighboring Multiplier—Accumulator Units

by Jeonghyeok Lee, Sangwook Han, Seungwon Choi and Jungwook Choi

Electronics 2022, 11(13), 1996; https://doi.org/10.3390/electronics11131996 - 25 Jun 2022

Viewed by 1860

Abstract

This paper presents a novel method for minimizing the power consumption of weight data movements required by a convolutional operation performed on a two-dimensional multiplier–accumulator (MAC) array of a deep neural-network accelerator. The proposed technique employs a local register file (LRF) at each [...] Read more.

This paper presents a novel method for minimizing the power consumption of weight data movements required by a convolutional operation performed on a two-dimensional multiplier–accumulator (MAC) array of a deep neural-network accelerator. The proposed technique employs a local register file (LRF) at each MAC unit in a manner such that once weight pixels are read from the global buffer into the LRF, they are reused from the LRF as many times as desired instead of being repeatedly fetched from the global buffer in each convolutional operation. One of the most evident merits of the proposed method is that the procedure is completely free from the burden of data transfer between neighboring MAC units. It was found from our simulations that the proposed method provides a power saving of approximately 83.33% and 97.62% compared with the power savings recorded by the conventional methods, respectively, when the dimensions of the input data matrix and weight matrix are 128 × 128 and 5 × 5, respectively. The power savings increase as the dimensions of the input data matrix or weight matrix increase. Full article

(This article belongs to the Special Issue Reconfigurable Computing and Real-Time Embedded Systems)

► Show Figures

Figure 1

12 pages, 8513 KB

Open AccessArticle

Silicon-Based Metastructure Optical Scattering Multiply–Accumulate Computation Chip

by Xu Liu, Xudong Zhu, Chunqing Wang, Yifan Cao, Baihang Wang, Hanwen Ou, Yizheng Wu, Qixun Mei, Jialong Zhang, Zhe Cong and Rentao Liu

Nanomaterials 2022, 12(13), 2136; https://doi.org/10.3390/nano12132136 - 21 Jun 2022

Cited by 3 | Viewed by 2889

Abstract

Optical neural networks (ONN) have become the most promising solution to replacing electronic neural networks, which have the advantages of large bandwidth, low energy consumption, strong parallel processing ability, and super high speed. Silicon-based micro-nano integrated photonic platforms have demonstrated good compatibility with [...] Read more.

Optical neural networks (ONN) have become the most promising solution to replacing electronic neural networks, which have the advantages of large bandwidth, low energy consumption, strong parallel processing ability, and super high speed. Silicon-based micro-nano integrated photonic platforms have demonstrated good compatibility with complementary metal oxide semiconductor (CMOS) processing. Therefore, without completely changing the existing silicon-based fabrication technology, optoelectronic hybrid devices or all-optical devices of better performance can be achieved on such platforms. To meet the requirements of smaller size and higher integration for silicon photonic computing, the topology of a four-channel coarse wavelength division multiplexer (CWDM) and an optical scattering unit (OSU) are inversely designed and optimized by Lumerical software. Due to the random optical power splitting ratio and incoherency, the intensities of different input signals from CWDM can be weighted and summed directly by the subsequent OSU to accomplish arbitrary multiply–accumulate (MAC) operations, therefore supplying the core foundation for scattering ONN architecture. Full article

(This article belongs to the Special Issue Nanophotonics and Integrated Optics Devices)

► Show Figures

Figure 1

17 pages, 6762 KB

Open AccessArticle

Implementing a Timing Error-Resilient and Energy-Efficient Near-Threshold Hardware Accelerator for Deep Neural Network Inference

by Noel Daniel Gundi, Pramesh Pandey, Sanghamitra Roy and Koushik Chakraborty

J. Low Power Electron. Appl. 2022, 12(2), 32; https://doi.org/10.3390/jlpea12020032 - 6 Jun 2022

Cited by 6 | Viewed by 3957

Abstract

Increasing processing requirements in the Artificial Intelligence (AI) realm has led to the emergence of domain-specific architectures for Deep Neural Network (DNN) applications. Tensor Processing Unit (TPU), a DNN accelerator by Google, has emerged as a front runner outclassing its contemporaries, CPUs and [...] Read more.

Increasing processing requirements in the Artificial Intelligence (AI) realm has led to the emergence of domain-specific architectures for Deep Neural Network (DNN) applications. Tensor Processing Unit (TPU), a DNN accelerator by Google, has emerged as a front runner outclassing its contemporaries, CPUs and GPUs, in performance by 15×–30×. TPUs have been deployed in Google data centers to cater to the performance demands. However, a TPU’s performance enhancement is accompanied by a mammoth power consumption. In the pursuit of lowering the energy utilization, this paper proposes PREDITOR—a low-power TPU operating in the Near-Threshold Computing (NTC) realm. PREDITOR uses mathematical analysis to mitigate the undetectable timing errors by boosting the voltage of the selective multiplier-and-accumulator units at specific intervals to enhance the performance of the NTC TPU, thereby ensuring a high inference accuracy at low voltage. PREDITOR offers up to 3×–5× improved performance in comparison to the leading-edge error mitigation schemes with a minor loss in accuracy. Full article

(This article belongs to the Special Issue Hardware for Machine Learning)

► Show Figures

Figure 1

Search Results (18)

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

Saved Queries

Search Filter Reset All

Years

Feature Papers

Subjects

Journals

Article Types

Countries / Regions

Search Results (18)

Further Information

Guidelines

MDPI Initiatives

Follow MDPI