FPGA-Based Deep Neural Network Accelerators Using Emerging Technologies

A special issue of Electronics (ISSN 2079-9292). This special issue belongs to the section "Circuit and Signal Processing".

Deadline for manuscript submissions: 20 October 2024 | Viewed by 4704

Special Issue Editor


E-Mail Website
Guest Editor
School of microelectronics, Xi’an Jiaotong University, Xi’an 710049, China
Interests: neural network accelerator; reconfigurable computing; VLSI SoC design
Special Issues, Collections and Topics in MDPI journals

Special Issue Information

Dear Colleagues,

Electronics invites manuscript submissions in the area of deep neural network (DNN) acceleration, such as fast convolution algorithms, sparsification, low-bit quantization, approximate computing and other emerging technologies or models, which are applied to implement FPGA-based deep learning accelerators. In recent years, DNNs have been widely used in many rising fields, such as computer-vision, object segmentation and autonomous driving, achieving excellent performance. Due to its excellent power consumption and configurability, FPGA is most commonly used to implement DNN accelerators. However, the huge parameters and computation workloads, as well as the rapid evolution of DNN models, make it difficult to deploy them in scenarios with tight limits on FPGA resources and requirements for high performance. Some emerging technologies have the potential to achieve computation reduction and runtime latency degradation. Many open issues still remain unresolved when those emerging and promising technologies are applied for hardware acceleration, such as how to extend the fast convolution algorithm to different convolution types with reduced hardware resources, how to efficiently eliminate the invalid zero elements in sparse DNNs and how to support different computation parallelism for various low-bit schemes with stable processing element utilization.

To tackle the above issues and challenges, this Special Issue of Electronics will advance innovative solutions and novel advancements in the field of DNN acceleration, sparsification, low-bit quantization, approximate computing, etc. Additionally, this field also looks forward to brand new ideas (e.g., other fast convolution algorithms and novel pruning strategies) and subversive FPGA-based accelerators on other emerging networks (e.g., graph neural networks, spike neural networks) to help solve the aforementioned problems and challenges.

In this Special Issue, original research articles and reviews are welcome. Research areas may include (but not limited to) the following:

  • Novel architecture for FPGA-based DNN accelerators;
  • Hardware/software co-design for FPGA-based DNN accelerators;
  • Resource/bandwidth optimizations for FPGA-based DNN accelerators;
  • FPGA-based DNN accelerators using fast convolution algorithms;
  • FPGA-based accelerators for sparse DNNs;
  • FPGA-based DNN accelerators performing low-bit/mixed-bit quantization;
  • FPGA-based DNN accelerators using approximate computing;
  • Dynamically reconfigurable DNN accelerators;
  • FPGA-based graph convolution neural network acceleration;
  • FPGA-based spike neural network acceleration;
  • FPGA-based transformer network acceleration;
  • FPGA-based acceleration of other atypical convolutions/networks;
  • Emerging applications of FPGA-based DNN accelerators.

Dr. Chen Yang
Guest Editor

Manuscript Submission Information

Manuscripts should be submitted online at www.mdpi.com by registering and logging in to this website. Once you are registered, click here to go to the submission form. Manuscripts can be submitted until the deadline. All submissions that pass pre-check are peer-reviewed. Accepted papers will be published continuously in the journal (as soon as accepted) and will be listed together on the special issue website. Research articles, review articles as well as short communications are invited. For planned papers, a title and short abstract (about 100 words) can be sent to the Editorial Office for announcement on this website.

Submitted manuscripts should not have been published previously, nor be under consideration for publication elsewhere (except conference proceedings papers). All manuscripts are thoroughly refereed through a single-blind peer-review process. A guide for authors and other relevant information for submission of manuscripts is available on the Instructions for Authors page. Electronics is an international peer-reviewed open access semimonthly journal published by MDPI.

Please visit the Instructions for Authors page before submitting a manuscript. The Article Processing Charge (APC) for publication in this open access journal is 2400 CHF (Swiss Francs). Submitted papers should be well formatted and use good English. Authors may use MDPI's English editing service prior to publication or during author revisions.

Keywords

  • deep learning
  • convolution neural network
  • fast convolution algorithm
  • sparsification
  • low-bit quantization
  • approximate computing
  • reconfigurable computing
  • FPGA

Published Papers (5 papers)

Order results
Result details
Select all
Export citation of selected articles as:

Research

16 pages, 926 KiB  
Article
Flexible Quantization for Efficient Convolutional Neural Networks
by Federico Giordano Zacchigna, Sergio Lew and Ariel Lutenberg
Electronics 2024, 13(10), 1923; https://doi.org/10.3390/electronics13101923 - 14 May 2024
Viewed by 556
Abstract
This work focuses on the efficient quantization of convolutional neural networks (CNNs). Specifically, we introduce a method called non-uniform uniform quantization (NUUQ), a novel quantization methodology that combines the benefits of non-uniform quantization, such as high compression levels, with the advantages of uniform [...] Read more.
This work focuses on the efficient quantization of convolutional neural networks (CNNs). Specifically, we introduce a method called non-uniform uniform quantization (NUUQ), a novel quantization methodology that combines the benefits of non-uniform quantization, such as high compression levels, with the advantages of uniform quantization, which enables an efficient implementation in fixed-point hardware. NUUQ is based on decoupling the quantization levels from the number of bits. This decoupling allows for a trade-off between the spatial and temporal complexity of the implementation, which can be leveraged to further reduce the spatial complexity of the CNN, without a significant performance loss. Additionally, we explore different quantization configurations and address typical use cases. The NUUQ algorithm demonstrates the capability to achieve compression levels equivalent to 2 bits without an accuracy loss and even levels equivalent to ∼1.58 bits, but with a loss in performance of only ∼0.6%. Full article
Show Figures

Figure 1

25 pages, 581 KiB  
Article
Quantization-Based Optimization Algorithm for Hardware Implementation of Convolution Neural Networks
by Bassam J. Mohd, Khalil M. Ahmad Yousef, Anas AlMajali and Thaier Hayajneh
Electronics 2024, 13(9), 1727; https://doi.org/10.3390/electronics13091727 - 30 Apr 2024
Viewed by 698
Abstract
Convolutional neural networks (CNNs) have demonstrated remarkable performance in many areas but require significant computation and storage resources. Quantization is an effective method to reduce CNN complexity and implementation. The main research objective is to develop a scalable quantization algorithm for CNN hardware [...] Read more.
Convolutional neural networks (CNNs) have demonstrated remarkable performance in many areas but require significant computation and storage resources. Quantization is an effective method to reduce CNN complexity and implementation. The main research objective is to develop a scalable quantization algorithm for CNN hardware design and model the performance metrics for the purpose of CNN implementation in resource-constrained devices (RCDs) and optimizing layers in deep neural networks (DNNs). The algorithm novelty is based on blending two quantization techniques to perform full model quantization with optimum accuracy, and without additional neurons. The algorithm is applied to a selected CNN model and implemented on an FPGA. Implementing CNN using broad data is not possible due to capacity issues. With the proposed quantization algorithm, we succeeded in implementing the model on the FPGA using 16-, 12-, and 8-bit quantization. Compared to the 16-bit design, the 8-bit design offers a 44% decrease in resource utilization, and achieves power and energy reductions of 41% and 42%, respectively. Models show that trading off one quantization bit yields savings of approximately 5.4K LUTs, 4% logic utilization, 46.9 mW power, and 147 μJ energy. The models were also used to estimate performance metrics for a sample DNN design. Full article
Show Figures

Figure 1

24 pages, 10357 KiB  
Article
Design of a Generic Dynamically Reconfigurable Convolutional Neural Network Accelerator with Optimal Balance
by Haoran Tong, Ke Han, Si Han and Yingqi Luo
Electronics 2024, 13(4), 761; https://doi.org/10.3390/electronics13040761 - 14 Feb 2024
Viewed by 805
Abstract
In many scenarios, edge devices perform computations for applications such as target detection and tracking, multimodal sensor fusion, low-light image enhancement, and image segmentation. There is an increasing trend of deploying and running multiple different network models on one hardware platform, but there [...] Read more.
In many scenarios, edge devices perform computations for applications such as target detection and tracking, multimodal sensor fusion, low-light image enhancement, and image segmentation. There is an increasing trend of deploying and running multiple different network models on one hardware platform, but there is a lack of generic acceleration architectures that support standard convolution (CONV), depthwise separable CONV, and deconvolution (DeCONV) layers in such complex scenarios. In response, this paper proposes a more versatile dynamically reconfigurable CNN accelerator with a highly unified computing scheme. The proposed design, which is compatible with standard CNNs, lightweight CNNs, and CNNs with DeCONV layers, further improves the resource utilization and reduces the gap of efficiency when deploying different models. Thus, the hardware balance during the alternating execution of multiple models is enhanced. Compared to a state-of-the-art CNN accelerator, Xilinx DPU B4096, our optimized architecture achieves resource utilization improvements of 1.08× for VGG16 and 1.77× for MobileNetV1 in inference tasks on the Xilinx ZCU102 platform. The resource utilization and efficiency degradation between these two models are reduced to 59.6% and 63.7%, respectively. Furthermore, the proposed architecture can properly run DeCONV layers and demonstrates good performance. Full article
Show Figures

Figure 1

17 pages, 3032 KiB  
Article
WRA-MF: A Bit-Level Convolutional-Weight-Decomposition Approach to Improve Parallel Computing Efficiency for Winograd-Based CNN Acceleration
by Siwei Xiang, Xianxian Lv, Yishuo Meng, Jianfei Wang, Cimang Lu and Chen Yang
Electronics 2023, 12(24), 4943; https://doi.org/10.3390/electronics12244943 - 8 Dec 2023
Viewed by 836
Abstract
FPGA-based convolutional neural network (CNN) accelerators have been extensively studied recently. To exploit the parallelism of multiplier–accumulator computation in convolution, most FPGA-based CNN accelerators heavily depend on the number of on-chip DSP blocks in the FPGA. Consequently, the performance of the accelerators is [...] Read more.
FPGA-based convolutional neural network (CNN) accelerators have been extensively studied recently. To exploit the parallelism of multiplier–accumulator computation in convolution, most FPGA-based CNN accelerators heavily depend on the number of on-chip DSP blocks in the FPGA. Consequently, the performance of the accelerators is restricted by the limitation of the DSPs, leading to an imbalance in the utilization of other FPGA resources. This work proposes a multiplication-free convolutional acceleration scheme (named WRA-MF) to relax the pressure on the required DSP resources. Firstly, the proposed WRA-MF employs the Winograd algorithm to reduce the computational density, and it then performs bit-level convolutional weight decomposition to eliminate the multiplication operations. Furthermore, by extracting common factors, the complexity of the addition operations is reduced. Experimental results on the Xilinx XCVU9P platform show that the WRA-MF can achieve 7559 GOP/s throughput at a 509 MHz clock frequency for VGG16. Compared with state-of-the-art works, the WRA-MF achieves up to a 3.47×–27.55× area efficiency improvement. The results indicate that the proposed architecture achieves a high area efficiency while ameliorating the imbalance in the resource utilization. Full article
Show Figures

Figure 1

12 pages, 516 KiB  
Article
Timing-Driven Simulated Annealing for FPGA Placement in Neural Network Realization
by Le Yu and Baojin Guo
Electronics 2023, 12(17), 3562; https://doi.org/10.3390/electronics12173562 - 23 Aug 2023
Cited by 1 | Viewed by 1149
Abstract
The simulated annealing algorithm is an extensively utilized heuristic method for heterogeneous FPGA placement. As the application of neural network models on FPGAs proliferates, new challenges emerge for the traditional simulated annealing algorithm in terms of timing. These challenges stem from large circuit [...] Read more.
The simulated annealing algorithm is an extensively utilized heuristic method for heterogeneous FPGA placement. As the application of neural network models on FPGAs proliferates, new challenges emerge for the traditional simulated annealing algorithm in terms of timing. These challenges stem from large circuit sizes and high heterogeneity in the block proportions typical in neural networks. To address these challenges, this study introduces a timing-driven simulated annealing placement algorithm. This algorithm integrates cluster criticality identification during the cluster selection phase, which enhances the probability of high-criticality cluster selection. In the cluster movement phase, the proposed method employs an improved weighted center movement for high-criticality clusters and a random movement strategy for other clusters. Experimental evidence demonstrates that the proposed placement algorithm decreases the average wire length by 1.52% and the average critical path delay by 5.03%. This improvement in performance is achieved with a marginal increase of 5.01% in runtime, as compared to VTR8.0. Full article
Show Figures

Figure 1

Back to TopTop