Sign in to use this feature.

Years

Between: -

Subjects

remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline

Journals

Article Types

Countries / Regions

Search Results (30)

Search Parameters:
Keywords = systolic hardware accelerator

Order results
Result details
Results per page
Select all
Export citation of selected articles as:
17 pages, 3011 KB  
Article
Architecture-Level Risk-Guided Fault-Injection Prioritization for Systolic AI Accelerators: A Fixed Candidate-Pool Evaluation
by Larisa Goffman-Vinopal
Electronics 2026, 15(13), 2792; https://doi.org/10.3390/electronics15132792 (registering DOI) - 25 Jun 2026
Abstract
Fault-injection campaigns are widely used to evaluate silent data corruption (SDC) in AI hardware, but exhaustive campaigns over workloads, dataflows, processing elements, and datapath roles are expensive. This paper presents an architecture-level risk-guided fault-injection prioritization method for systolic AI accelerators. The method ranks [...] Read more.
Fault-injection campaigns are widely used to evaluate silent data corruption (SDC) in AI hardware, but exhaustive campaigns over workloads, dataflows, processing elements, and datapath roles are expensive. This paper presents an architecture-level risk-guided fault-injection prioritization method for systolic AI accelerators. The method ranks candidate transient functional perturbations before downstream validation, with the goal of enriching the discovery of candidates that produce a thresholded relative-output-error outcome under a limited validation budget. The evaluation uses a fixed candidate fault pool: all ranking policies score the same 21,000 candidate faults across 30 workload/dataflow/array configurations, corresponding to five GEMM-derived workloads, three array sizes, and two dataflows. Fault magnitudes are sampled once per candidate and are independent of all ranking scores. Candidate faults are modeled as transient architecture-level perturbations in MAC, accumulator, or forwarding paths. The proposed full-risk score combines activity, composite spatial stress, tensor sensitivity, and a path-class weight. In the proposed architecture-level simulation environment and under the fixed-pool protocol, the proposed method achieves the highest mean top-10% SDC-proxy lift, AUPRC, NDCG@10%, and rank correlation with relative output error among the evaluated principle-based ranking policies. At the calibrated threshold, it achieves a mean top-10% lift of 5.65× [4.91, 6.38], compared with 4.61× for AVF-like exposure and 4.33× for output sensitivity. Paired configuration-level tests, threshold sensitivity, and outcome-model sensitivity analyses characterize the result while showing that the proposed score is not universally dominant under every synthetic outcome assumption. The method is intended as a front-end architecture-level screening tool for validation prioritization, not as a replacement for RTL, gate-level, FPGA, or silicon reliability signoff. Full article
(This article belongs to the Section Computer Science & Engineering)
Show Figures

Figure 1

13 pages, 3658 KB  
Article
TR-ABFT: Tile-Resilient Fault Detection for Neural Processing Units
by Yang Hua, Yunhong Bai, Bo Wang, Wei Zhuang and Yuanfu Zhao
Electronics 2026, 15(12), 2715; https://doi.org/10.3390/electronics15122715 - 19 Jun 2026
Viewed by 184
Abstract
Spaceborne neural processing units (NPUs) increasingly support real-time deep-learning inference, but their dense multiply-accumulate arrays are vulnerable to radiation-induced soft errors. Conventional radiation-hardening methods improve reliability through hardware redundancy, but they incur substantial area, performance and compiler-mapping overheads. This paper proposes tile-resilient algorithm-based [...] Read more.
Spaceborne neural processing units (NPUs) increasingly support real-time deep-learning inference, but their dense multiply-accumulate arrays are vulnerable to radiation-induced soft errors. Conventional radiation-hardening methods improve reliability through hardware redundancy, but they incur substantial area, performance and compiler-mapping overheads. This paper proposes tile-resilient algorithm-based fault tolerance (TR-ABFT), a software-scheduled, detection-oriented scheme for quantized NPU inference. TR-ABFT generates checksum information at tile granularity and maps checking tasks onto the original processing element (PE) array without changing the hardware topology. To make ABFT compatible with INT8 datapaths, we design two checksum-coding strategies: checksum decomposition and modulo-239 checksum coding. The modulo-239 scheme removes structural missed detections for two-bit flips with bit-position spacings in (1, 31), while preserving compatibility with signed INT8 inputs. Evaluations on ResNet, YOLOv8, and RT-DETR show that, on a 16×16 array, TR-ABFT introduces only 6.37% to 24.61% additional computational overhead. By converting spatial redundancy into schedulable temporal redundancy, TR-ABFT preserves systolic-array regularity and provides a low-overhead reliability-enhancement mechanism for space-grade neural-network accelerators. Full article
(This article belongs to the Special Issue Artificial Intelligence and Microsystems)
Show Figures

Figure 1

22 pages, 1556 KB  
Article
Hardware Accelerator Design for MUSIC-DOA Estimation with Bilateral Jacobi Optimization
by Yafan Gao, Weijiang Wang, Chengbo Xue, Shiwei Ren, Kuanhao Liu and Xiangnan Li
Electronics 2026, 15(10), 1982; https://doi.org/10.3390/electronics15101982 - 7 May 2026
Viewed by 347
Abstract
Real-time Direction of Arrival (DOA) estimation demands high computational throughput and numerical precision. Consequently, dedicated hardware accelerators are essential. This paper presents an architecture to accelerate the MUSIC algorithm using an improved complex bilateral Jacobi eigenvalue decomposition (EVD). First, we design a triangular [...] Read more.
Real-time Direction of Arrival (DOA) estimation demands high computational throughput and numerical precision. Consequently, dedicated hardware accelerators are essential. This paper presents an architecture to accelerate the MUSIC algorithm using an improved complex bilateral Jacobi eigenvalue decomposition (EVD). First, we design a triangular systolic array for Hermitian matrices. It employs an output-stationary dataflow to enable efficient parallel covariance computation. Second, we propose an enhanced EVD algorithm. It replaces CORDIC approximations with direct analytical rotations. This significantly improves numerical stability and accuracy. Third, we introduce hardware optimizations. These include unit reuse, integrated termination conditions, and pre-stored steering vectors. These measures reduce resource consumption while maintaining full functionality. Experiments on a Xilinx Virtex-6 platform validate the design. The architecture achieves a root mean square error (RMSE) below 0.24° with 300 snapshots. Processing latency is only 76.17 µs. The design utilizes 10,775 LUTs and 73 DSP slices. This work balances accuracy, speed, and efficiency. It offers a practical solution for real-time, high-precision DOA systems. Full article
(This article belongs to the Special Issue New Advances of FPGAs in Signal Processing)
Show Figures

Figure 1

16 pages, 998 KB  
Article
Architecture Design of a Convolutional Neural Network Accelerator for Heterogeneous Computing Based on a Fused Systolic Array
by Yang Zong, Zhenhao Ma, Jian Ren, Yu Cao, Meng Li and Bin Liu
Sensors 2026, 26(2), 628; https://doi.org/10.3390/s26020628 - 16 Jan 2026
Viewed by 990
Abstract
Convolutional Neural Networks (CNNs) generally suffer from excessive computational overhead, high resource consumption, and complex network structures, which severely restrict the deployment on microprocessor chips. Existing related accelerators only have an energy efficiency ratio of 2.32–6.5925 GOPs/W, making it difficult to meet the [...] Read more.
Convolutional Neural Networks (CNNs) generally suffer from excessive computational overhead, high resource consumption, and complex network structures, which severely restrict the deployment on microprocessor chips. Existing related accelerators only have an energy efficiency ratio of 2.32–6.5925 GOPs/W, making it difficult to meet the low-power requirements of embedded application scenarios. To address these issues, this paper proposes a low-power and high-energy-efficiency CNN accelerator architecture based on a central processing unit (CPU) and an Application-Specific Integrated Circuit (ASIC) heterogeneous computing architecture, adopting an operator-fused systolic array algorithm with the YOLOv5n target detection network as the application benchmark. It integrates a 2D systolic array with Conv-BN fusion technology to achieve deep operator fusion of convolution, batch normalization and activation functions; optimizes the RISC-V core to reduce resource usage; and adopts a locking mechanism and a prefetching strategy for the asynchronous platform to ensure operational stability. Experiments on the Nexys Video development board show that the architecture achieves 20.6 GFLOPs of computational performance, 1.96 W of power consumption, and 10.46 GOPs/W of energy efficiency ratio, which is 58–350% higher than existing mainstream accelerators, thus demonstrating excellent potential for embedded deployment. Full article
(This article belongs to the Section Intelligent Sensors)
Show Figures

Figure 1

28 pages, 1828 KB  
Article
Edge Detection on a 2D-Mesh NoC with Systolic Arrays: From FPGA Validation to GDSII Proof-of-Concept
by Emma Mascorro-Guardado, Susana Ortega-Cisneros, Francisco Javier Ibarra-Villegas, Jorge Rivera, Héctor Emmanuel Muñoz-Zapata and Emilio Isaac Baungarten-Leon
Appl. Sci. 2026, 16(2), 702; https://doi.org/10.3390/app16020702 - 9 Jan 2026
Cited by 1 | Viewed by 937
Abstract
Edge detection is a key building block in real-time image-processing applications such as drone-based infrastructure inspection, autonomous navigation, and remote sensing. However, its computational cost remains a challenge for resource-constrained embedded systems. This work presents a hardware-accelerated edge detection architecture based on a [...] Read more.
Edge detection is a key building block in real-time image-processing applications such as drone-based infrastructure inspection, autonomous navigation, and remote sensing. However, its computational cost remains a challenge for resource-constrained embedded systems. This work presents a hardware-accelerated edge detection architecture based on a homogeneous 2D-mesh Network-on-Chip (NoC) integrating systolic arrays to efficiently perform the convolution operations required by the Sobel filter. The proposed architecture was first developed and validated as a 3 × 3 mesh prototype on FPGA (Xilinx Zynq-7000, Zynq-7010, XC7Z010-CLG400A, Zybo board, utilizing 26,112 LUTs, 24,851 flip-flops, and 162 DSP blocks), achieving a throughput of 8.8 Gb/s with a power consumption of 0.79 W at 100 MHz. Building upon this validated prototype, a reduced 2 × 2 node cluster with 14-bit word width was subsequently synthesized at the physical level as a proof-of-concept using the OpenLane RTL-to-GDSII open-source flow targeting the SkyWater 130 nm PDK (sky130A). Post-layout analysis confirms the manufacturability of the design, with a total power consumption of 378 mW and compliance with timing constraints, demonstrating the feasibility of mapping the proposed architecture to silicon and its suitability for drone-based infrastructure monitoring applications. Full article
(This article belongs to the Special Issue Advanced Integrated Circuit Design and Applications)
Show Figures

Figure 1

22 pages, 1158 KB  
Article
High-Speed Architecture for Hybrid Arithmetic–Huffman Data Compression
by Yair Wiseman
Technologies 2025, 13(12), 585; https://doi.org/10.3390/technologies13120585 - 12 Dec 2025
Cited by 2 | Viewed by 1454
Abstract
This paper proposes a hardware–software co-design for adaptive lossless compression based on Hybrid Arithmetic–Huffman Coding, a table-driven approximation of arithmetic coding that preserves near-optimal compression efficiency while eliminating the multiplicative precision and sequential bottlenecks that have traditionally prevented arithmetic coding deployment in resource-constrained [...] Read more.
This paper proposes a hardware–software co-design for adaptive lossless compression based on Hybrid Arithmetic–Huffman Coding, a table-driven approximation of arithmetic coding that preserves near-optimal compression efficiency while eliminating the multiplicative precision and sequential bottlenecks that have traditionally prevented arithmetic coding deployment in resource-constrained embedded systems. The compression pipeline is partitioned as follows: flexible software on the processor core dynamically builds and adapts the prefix coding (usually Huffman Coding) frontend for accurate probability estimation and binarization; the resulting binary stream is fed to a deeply pipelined systolic hardware accelerator that performs binary arithmetic coding using pre-calibrated finite state transition tables, dedicated renormalization logic, and carry propagation mitigation circuitry instantiated in on-chip memory. The resulting implementation achieves compression ratios consistently within 0.4% of the theoretical entropy limit, multi-gigabit per second throughput in 28 nm/FinFET nodes, and approximately 68% lower energy per compressed byte than optimized software arithmetic coding, making it ideally suited for real-time embedded vision, IoT sensor networks, and edge multimedia applications. Full article
(This article belongs to the Special Issue Optimization Technologies for Digital Signal Processing)
Show Figures

Figure 1

21 pages, 2394 KB  
Article
AFHRE: An Accurate and Fast Hardware Resources Estimation Method for Convolutional Accelerator with Systolic Array Structure on FPGA
by Yongchang Wang, Hongzhi Zhao and Jinyao Zhao
Electronics 2025, 14(1), 168; https://doi.org/10.3390/electronics14010168 - 3 Jan 2025
Viewed by 1886
Abstract
FPGA-based convolutional accelerators have been widely used in image recognition scenarios. Many convolutional accelerators utilize the systolic array structure to enhance parallelism. Developing a method to efficiently estimate the utilized hardware resources of an FPGA for such a structure would be helpful in [...] Read more.
FPGA-based convolutional accelerators have been widely used in image recognition scenarios. Many convolutional accelerators utilize the systolic array structure to enhance parallelism. Developing a method to efficiently estimate the utilized hardware resources of an FPGA for such a structure would be helpful in improving the speed of achieving an optimal systolic array structure with the best performance on a given FPGA device. Currently, most estimations of work have either focused on the evaluation of hardware resources for general structures or have not adequately assessed hardware resources specifically for systolic arrays. To reduce estimation latency, this paper proposes an Accurate and Fast Hardware Resources Estimation method (AFHRE) that addresses these shortcomings by analyzing the structure of systolic arrays and utilizing mathematical formulas to describe their characteristics. Experiments show that the DSP resource occupancy estimated by AFHRE is fully consistent with that by Vivado HLS. The error rates of other three types of hardware resources (BRAM, LUT, and FF) are within 11%. In addition, the speed of resource estimation using this method is 40X to 610X faster than that of Vivado HLS. AFHRE can serve as a preprocessing step for Vivado HLS, achieving some optimal or sub-optimal solutions systolic array parameters much faster than original simulation manners of Vivado HLS. Full article
(This article belongs to the Special Issue FPGA-Based Reconfigurable Embedded Systems)
Show Figures

Figure 1

18 pages, 3376 KB  
Article
Heterogeneous Edge Computing for Molecular Property Prediction with Graph Convolutional Networks
by Mahdieh Grailoo and Jose Nunez-Yanez
Electronics 2025, 14(1), 101; https://doi.org/10.3390/electronics14010101 - 30 Dec 2024
Cited by 3 | Viewed by 2151
Abstract
Graph-based neural networks have proven to be useful in molecular property prediction, a critical component of computer-aided drug discovery. In this application, in response to the growing demand for improved computational efficiency and localized edge processing, this paper introduces a novel approach that [...] Read more.
Graph-based neural networks have proven to be useful in molecular property prediction, a critical component of computer-aided drug discovery. In this application, in response to the growing demand for improved computational efficiency and localized edge processing, this paper introduces a novel approach that leverages specialized accelerators on a heterogeneous edge computing platform. Our focus is on graph convolutional networks, a leading graph-based neural network variant that integrates graph convolution layers with multi-layer perceptrons. Molecular graphs are typically characterized by a low number of nodes, leading to low-dimensional dense matrix multiplications within multi-layer perceptrons—conditions that are particularly well-suited for Edge TPUs. These TPUs feature a systolic array of multiply–accumulate units optimized for dense matrix operations. Furthermore, the inherent sparsity in molecular graph adjacency matrices offers additional opportunities for computational optimization. To capitalize on this, we developed an FPGA GFADES accelerator, using high-level synthesis, specifically tailored to efficiently manage the sparsity in both the graph structure and node features. Our hardware/software co-designed GCN+MLP architecture delivers performance improvements, achieving up to 58× increased speed compared to conventional software implementations. This architecture is implemented using the Pynq framework and TensorFlow Lite Runtime, running on a multi-core ARM CPU within an AMD/Xilinx Zynq Ultrascale+ device, in combination with the Edge TPU and programmable logic. Full article
Show Figures

Figure 1

15 pages, 2101 KB  
Article
Scalable Transformer Accelerator with Variable Systolic Array for Multiple Models in Voice Assistant Applications
by Seok-Woo Chang and Dong-Sun Kim
Electronics 2024, 13(23), 4683; https://doi.org/10.3390/electronics13234683 - 27 Nov 2024
Cited by 4 | Viewed by 4808
Abstract
Transformer model is a type of deep learning model that has quickly become fundamental in natural language processing (NLP) and other machine learning tasks. Transformer hardware accelerators are usually designed for specific models, such as Bidirectional Encoder Representations from Transformers (BERT), and vision [...] Read more.
Transformer model is a type of deep learning model that has quickly become fundamental in natural language processing (NLP) and other machine learning tasks. Transformer hardware accelerators are usually designed for specific models, such as Bidirectional Encoder Representations from Transformers (BERT), and vision Transformer models, like the ViT. In this study, we propose a Scalable Transformer Accelerator Unit (STAU) for multiple models, enabling efficient handling of various Transformer models used in voice assistant applications. Variable Systolic Array (VSA) centralized design, along with control and data preprocessing in embedded processors, enables matrix operations of varying sizes. In addition, we propose an efficient variable structure and a row-wise data input method for natural language processing where the word count changes. The proposed scalable Transformer accelerator accelerates text summarization, audio processing, image search, and generative AI used in voice assistance. Full article
(This article belongs to the Topic Theory and Applications of High Performance Computing)
Show Figures

Figure 1

13 pages, 3319 KB  
Article
Energy and Precision Evaluation of a Systolic Array Accelerator Using a Quantization Approach for Edge Computing
by Alejandra Sanchez-Flores, Jordi Fornt, Lluc Alvarez and Bartomeu Alorda-Ladaria
Electronics 2024, 13(14), 2822; https://doi.org/10.3390/electronics13142822 - 18 Jul 2024
Cited by 3 | Viewed by 2958
Abstract
This paper focuses on the implementation of a neural network accelerator optimized for speed and energy efficiency, for use in embedded machine learning. Specifically, we explore power reduction at the hardware level through systolic array and low-precision data systems, including quantized approaches. We [...] Read more.
This paper focuses on the implementation of a neural network accelerator optimized for speed and energy efficiency, for use in embedded machine learning. Specifically, we explore power reduction at the hardware level through systolic array and low-precision data systems, including quantized approaches. We present a comprehensive analysis comparing a full precision (FP16) accelerator with a quantized (INT16) version on an FPGA. We upgraded the FP16 modules to handle INT16 values, employing data shifts to enhance value density while maintaining accuracy. Through single convolution experiments, we assess the energy consumption and error minimization. The paper’s structure includes a detailed description of the FP16 accelerator, the transition to quantization, mathematical and implementation insights, instrumentation for power measurement, and a comparative analysis of power consumption and convolution error. Our results attempt to identify a pattern in 16-bit quantization to achieve significant power savings with minimal loss of accuracy. Full article
(This article belongs to the Special Issue Recent Advances and Challenges in IoT, Cloud and Edge Coexistence)
Show Figures

Figure 1

10 pages, 10910 KB  
Article
Enhancing Computation-Efficiency of Deep Neural Network Processing on Edge Devices through Serial/Parallel Systolic Computing
by Iraj Moghaddasi and Byeong-Gyu Nam
Mach. Learn. Knowl. Extr. 2024, 6(3), 1484-1493; https://doi.org/10.3390/make6030070 - 1 Jul 2024
Cited by 5 | Viewed by 2854
Abstract
In recent years, deep neural networks (DNNs) have addressed new applications with intelligent autonomy, often achieving higher accuracy than human experts. This capability comes at the expense of the ever-increasing complexity of emerging DNNs, causing enormous challenges while deploying on resource-limited edge devices. [...] Read more.
In recent years, deep neural networks (DNNs) have addressed new applications with intelligent autonomy, often achieving higher accuracy than human experts. This capability comes at the expense of the ever-increasing complexity of emerging DNNs, causing enormous challenges while deploying on resource-limited edge devices. Improving the efficiency of DNN hardware accelerators by compression has been explored previously. Existing state-of-the-art studies applied approximate computing to enhance energy efficiency even at the expense of a little accuracy loss. In contrast, bit-serial processing has been used for improving the computational efficiency of neural processing without accuracy loss, exploiting a simple design, dynamic precision adjustment, and computation pruning. This research presents Serial/Parallel Systolic Array (SPSA) and Octet Serial/Parallel Systolic Array (OSPSA) processing elements for edge DNN acceleration, which exploit bit-serial processing on systolic array architecture for improving computational efficiency. For evaluation, all designs were described at the RTL level and synthesized in 28 nm technology. Post-synthesis cycle-accurate simulations of image classification over DNNs illustrated that, on average, a sample 16 × 16 systolic array indicated remarkable improvements of 17.6% and 50.6% in energy efficiency compared to the baseline, with no loss of accuracy. Full article
(This article belongs to the Section Network)
Show Figures

Figure 1

18 pages, 6734 KB  
Article
High-Speed CNN Accelerator SoC Design Based on a Flexible Diagonal Cyclic Array
by Dong-Yeong Lee, Hayotjon Aliev, Muhammad Junaid, Sang-Bo Park, Hyung-Won Kim, Keon-Myung Lee and Sang-Hoon Sim
Electronics 2024, 13(8), 1564; https://doi.org/10.3390/electronics13081564 - 19 Apr 2024
Cited by 6 | Viewed by 5916
Abstract
The latest convolutional neural network (CNN) models for object detection include complex layered connections to process inference data. Each layer utilizes different types of kernel modes, so the hardware needs to support all kernel modes at an optimized speed. In this paper, we [...] Read more.
The latest convolutional neural network (CNN) models for object detection include complex layered connections to process inference data. Each layer utilizes different types of kernel modes, so the hardware needs to support all kernel modes at an optimized speed. In this paper, we propose a high-speed and optimized CNN accelerator with flexible diagonal cyclic arrays (FDCA) that supports the acceleration of CNN networks with various kernel sizes and significantly reduces the time required for inference processing. The accelerator uses four FDCAs to simultaneously calculate 16 input channels and 8 output channels. Each FDCA features a 4 × 8 systolic array that contains a 3 × 3 processing element (PE) array and is designed to handle the most commonly used kernel sizes. To evaluate the proposed CNN accelerator, we mapped the widely used YOLOv5 CNN model and evaluated the performance of its implementation on the Zynq UltraScale+ MPSoC ZCU102 FPGA. The design consumes 249,357 logic cells, 2304 DSP blocks, and only 567 KB BRAM. In our evaluation, the YOLOv5n model achieves an accuracy of 43.1% (mAP@0.5). A prototype accelerator has been implemented using Samsung’s 14 nm CMOS technology. It achieves 1.075 TOPS, a peak performance with a 400 MHz clock frequency. Full article
(This article belongs to the Special Issue CMOS Integrated Circuits Design)
Show Figures

Figure 1

14 pages, 2743 KB  
Article
VerSA: Versatile Systolic Array Architecture for Sparse and Dense Matrix Multiplications
by Juwon Seo and Joonho Kong
Electronics 2024, 13(8), 1500; https://doi.org/10.3390/electronics13081500 - 15 Apr 2024
Cited by 7 | Viewed by 6287
Abstract
A key part of modern deep neural network (DNN) applications is matrix multiplication. As DNN applications are becoming more diverse, there is a need for both dense and sparse matrix multiplications to be accelerated by hardware. However, most hardware accelerators are designed to [...] Read more.
A key part of modern deep neural network (DNN) applications is matrix multiplication. As DNN applications are becoming more diverse, there is a need for both dense and sparse matrix multiplications to be accelerated by hardware. However, most hardware accelerators are designed to accelerate either dense or sparse matrix multiplication. In this paper, we propose VerSA, a versatile systolic array architecture for both dense and sparse matrix multiplications. VerSA employs intermediate paths and SRAM buffers between the rows of the systolic array (SA), thereby enabling an early termination in sparse matrix multiplication with a negligible performance overhead when running dense matrix multiplication. When running sparse matrix multiplication, 256 × 256 VerSA brings performance (i.e., an inverse of execution time) improvement and energy saving by 1.21×–1.60× and 7.5–30.2%, respectively, when compared to the conventional SA. When running dense matrix multiplication, VerSA results in only a 0.52% performance overhead compared to the conventional SA. Full article
(This article belongs to the Special Issue Heterogeneous and Energy-Efficient Computing Systems)
Show Figures

Figure 1

23 pages, 2718 KB  
Article
Voltage Scaled Low Power DNN Accelerator Design on Reconfigurable Platform
by Rourab Paul, Sreetama Sarkar, Suman Sau, Sanghamitra Roy, Koushik Chakraborty and Amlan Chakrabarti
Electronics 2024, 13(8), 1431; https://doi.org/10.3390/electronics13081431 - 10 Apr 2024
Cited by 2 | Viewed by 2933
Abstract
The exponential emergence of Field-Programmable Gate Arrays (FPGAs) has accelerated research on hardware implementation of Deep Neural Networks (DNNs). Among all DNN processors, domain-specific architectures such as Google’s Tensor Processor Unit (TPU) have outperformed conventional GPUs (Graphics Processing Units) and CPUs (Central Processing [...] Read more.
The exponential emergence of Field-Programmable Gate Arrays (FPGAs) has accelerated research on hardware implementation of Deep Neural Networks (DNNs). Among all DNN processors, domain-specific architectures such as Google’s Tensor Processor Unit (TPU) have outperformed conventional GPUs (Graphics Processing Units) and CPUs (Central Processing Units). However, implementing low-power TPUs in reconfigurable hardware remains a challenge in this field. Voltage scaling, a popular approach for energy savings, can be challenging in FPGAs, as it may lead to timing failures if not implemented appropriately. This work presents an ultra-low-power FPGA implementation of a TPU for edge applications. We divide the systolic array of a TPU into different FPGA partitions based on the minimum slack value of different design paths of Multiplier Accumulators (MACs). Each partition uses different near-threshold (NTC) biasing voltages to run its FPGA cores. The biasing voltage for each partition is roughly calculated by the proposed static schemes. However, further calibration of biasing voltage is performed by the proposed runtime scheme. To overcome the timing failure caused by NTC, the MACs with higher minimum slack are placed in lower-voltage partitions, while the MACs with lower minimum slack paths are placed in higher-voltage partitions. The proposed architecture is implemented in a commercial platform, namely Vivado with Xilinx Artix-7 FPGA and academic platform VTR with 22 nm, 45 nm and 130 nm FPGAs. Any timing error caused by NTC can be caught by the Razor flipflop used in each MAC. The proposed voltage-scaled, partitioned systolic array can save 3.1% to 11.6% of dynamic power in Vivado and VTR tools, respectively, depending on the FPGA technology, partition size, number of partitions and biasing voltages. The normalized performance and accuracy of benchmark models running on our low-power TPU are very competitive compared to existing literature. Full article
(This article belongs to the Special Issue Embedded Systems for Neural Network Applications)
Show Figures

Figure 1

23 pages, 1650 KB  
Article
A Heterogeneous Inference Framework for a Deep Neural Network
by Rafael Gadea-Gironés, José Luís Rocabado-Rocha, Jorge Fe and Jose M. Monzo
Electronics 2024, 13(2), 348; https://doi.org/10.3390/electronics13020348 - 14 Jan 2024
Cited by 4 | Viewed by 3183
Abstract
Artificial intelligence (AI) is one of the most promising technologies based on machine learning algorithms. In this paper, we propose a workflow for the implementation of deep neural networks. This workflow attempts to combine the flexibility of high-level compilers (HLS)-based networks with the [...] Read more.
Artificial intelligence (AI) is one of the most promising technologies based on machine learning algorithms. In this paper, we propose a workflow for the implementation of deep neural networks. This workflow attempts to combine the flexibility of high-level compilers (HLS)-based networks with the architectural control features of hardware description languages (HDL)-based flows. The architecture consists of a convolutional neural network, SqueezeNet v1.1, and a hard processor system (HPS) that coexists with acceleration hardware to be designed. This methodology allows us to compare solutions based solely on software (PyTorch 1.13.1) and propose heterogeneous inference solutions, taking advantage of the best options within the software and hardware flow. The proposed workflow is implemented on a low-cost field programmable gate array system-on-chip (FPGA SOC) platform, specifically the DE10-Nano development board. We have provided systolic architectural solutions written in OpenCL that are highly flexible and easily tunable to take full advantage of the resources of programmable devices and achieve superior energy efficiencies working with a 32-bit floating point. From a verification point of view, the proposed method is effective, since the reference models in all tests, both for the individual layers and the complete network, have been readily available using packages well known in the development, training, and inference of deep networks. Full article
Show Figures

Figure 1

Back to TopTop