Convolutional Neural Network Acceleration Techniques Based on FPGA Platforms: Principles, Methods, and Challenges

Gao, Li; Luo, Zhongqiang; Wang, Lin

doi:10.3390/info16100914

Open AccessReview

Convolutional Neural Network Acceleration Techniques Based on FPGA Platforms: Principles, Methods, and Challenges

by

Li Gao

¹,

Zhongqiang Luo

^1,2,*

and

Lin Wang

¹

School of Automation and Information Engineering, Sichuan University of Science and Engineering, Yibin 644000, China

²

Intelligent Perception and Control Key Laboratory of Sichuan Province, Sichuan University of Science and Engineering, Yibin 644000, China

^*

Author to whom correspondence should be addressed.

Information 2025, 16(10), 914; https://doi.org/10.3390/info16100914

Submission received: 4 September 2025 / Revised: 8 October 2025 / Accepted: 15 October 2025 / Published: 18 October 2025

Download

Browse Figures

Versions Notes

Abstract

As the complexity of convolutional neural networks (CNN) continues to increase, efficient deployment on computationally constrained hardware platforms has become a significant challenge. Against this backdrop, field-programmable gate arrays (FPGA) emerge as an up-and-coming CNN acceleration platform due to their inherent energy efficiency, reconfigurability, and parallel processing capabilities. This paper establishes a systematic analytical framework to explore CNN optimization strategies on FPGA from both algorithmic and hardware perspectives. It emphasizes co-design methodologies between algorithms and hardware, extending these concepts to other embedded system applications. Furthermore, the paper summarizes current performance evaluation frameworks to assess the effectiveness of acceleration schemes comprehensively. Finally, building upon existing work, it identifies key challenges in this field and outlines future research directions.

Keywords:

CNN; FPGA; accelerated design; algorithm optimization; hardware acceleration; co-optimization; performance evaluation

Graphical Abstract

1. Introduction

With the vigorous development of the big data industry and the advent of the Internet of Things era, global data volumes have experienced explosive growth in recent years, laying a solid foundation and providing abundant resources for the advancement of artificial intelligence (AI) [1]. As a core technology for achieving AI, deep learning models based on neural networks have garnered widespread attention and achieved rapid development due to their powerful feature extraction capabilities and superior performance in processing complex data and recognizing patterns [2].

Among numerous deep learning models, convolutional neural networks (CNNs) have achieved significant accomplishments in multiple domains such as image recognition [3], speech recognition [4], and natural language processing [5]. Their core advantage lies in the hierarchical combination of convolution operations, pooling operations, and nonlinear activation functions, enabling automated multi-level feature extraction from high-dimensional data. However, to address increasingly complex real-world problems and enhance model accuracy, the structural complexity and parameter count of CNNs continue to grow [6]. While this increased complexity optimizes model performance, it also leads to a dramatic surge in computational load and storage requirements, posing severe challenges for many traditional hardware platforms. Despite continuous advancements in hardware technology and significant improvements in data processing capabilities, the growth of computational power still lags behind the evolution of deep learning model complexity.

For large-scale convolutional neural networks, the core challenge in deployment lies in balancing the trade-off between model scaling and resource consumption. On the one hand, continuously increasing model size and complexity is necessary to achieve superior performance. On the other hand, constraints on computational resources, energy consumption, and deployment costs severely limit model scalability. This contradiction is particularly pronounced in edge computing scenarios, where edge devices impose extremely stringent requirements on real-time performance, energy efficiency, and hardware platform capabilities.

Against this backdrop, field-programmable gate arrays (FPGAs) have emerged as one of the preferred platforms for accelerating CNN inference tasks, owing to their high parallelism, flexibility, low latency, and high energy efficiency. FPGA support hardware architecture customization for specific algorithms, achieving a balance between computational efficiency and flexibility through hardware-software co-optimization. This provides a viable pathway for high-performance inference in resource-constrained scenarios.

In fact, as a powerful computing platform, FPGAs have demonstrated tremendous potential across diverse computationally intensive tasks. For instance, in the field of neural networks, researchers designed and implemented a spiking neural network based on the Izhikevich model on an Altera DE2 FPGA to accomplish the task of recognising specific printed English letters [7]. In biometric recognition, studies have achieved efficient standalone face recognition systems based on deep learning and FPGAs [8]. Even in digital signal processing, researchers have designed an efficient low-pass FIR filter using dual-port RAM and a single multiply-accumulate unit on an FPGA [9]. These successful cases collectively validate the advantages of FPGAs for handling specific complex computational tasks. They lay a practical foundation for applying FPGAs to accelerate CNNs with more complex computation-memory patterns, highlighting the necessity for systematic review and synthesis in this field.

However, deploying complex CNN models onto FPGAs is no simple task, as this process faces multiple inherent limitations and technical challenges. First is the resource constraint: the limited number of logic units, DSP modules, and on-chip memory resources within the FPGA struggles to accommodate the millions of parameters and vast intermediate feature maps typical of modern CNN models. Second is the memory bandwidth bottleneck: the bandwidth of off-chip DDR memory falls far short of the data consumption rate of on-chip computational units, leading to frequent idle periods in the compute units and severely impacting parallel computing efficiency. Third is design complexity: designing an optimal data flow and memory hierarchy that maximizes data reuse while matching the computational patterns of CNN presents a highly complex and critical design challenge. Moreover, the quantization process from floating-point to fixed-point models may cause accuracy loss. At the same time, lengthy hardware development cycles substantially increase the difficulty and time cost of exploring design space. These challenges drive researchers to continuously explore new algorithms, hardware, and co-design methods, highlighting the necessity for a systematic review of these approaches.

Although existing research has explored challenges in accelerating CNN models based on FPGAs, most studies focus on single-level acceleration optimizations and lack a holistic analysis of the coupled “algorithm–hardware” system. This paper compares optimization methods from existing reviews in Table 1.

The efficient deployment of CNNs on FPGAs is not a simple “translation” process, but rather a cross-layer collaborative redesign: the algorithm layer must consider techniques like quantization and pruning to reduce computational and memory access overhead; the hardware layer requires restructuring the memory hierarchy, parallel architecture, and data flow to maximize the potential of FPGAs. While closely intertwined, these layers exhibit distinct design spaces and methodologies. Therefore, a deep understanding of the synergistic relationship between the algorithm and hardware design is essential. Establishing a systematic collaborative optimization framework to find the design equilibrium between algorithmic refinement and hardware implementation is crucial. This approach breaks through the bottlenecks of real-time performance, energy efficiency, and flexibility in edge computing scenarios, which is vital for driving the widespread deployment of convolutional neural networks in edge computing and real-time applications.

This paper systematically explores the latest advancements in FPGA-based CNN acceleration technology, focusing on acceleration methods, architectural innovations, hardware optimization techniques, and hardware–software co-design frameworks, while summarizing performance evaluation metrics. Compared to prior work, the main contributions of this paper include the following:

(1) A quantitative comparison of existing CNN acceleration methods has been conducted to guide the selection of solutions tailored to different system requirements.

(2) Classify and elaborate on algorithm-level and hardware-level optimizations, systematically outlining optimization pathways and a three-stage collaborative design methodology.

(3) Introduce multi-dimensional evaluation metrics to measure the comprehensive performance of network models and hardware acceleration.

(4) This paper identifies future research directions for FPGA-based CNN acceleration and provides an outlook on developing FPGA-accelerated neural networks.

It should be noted that although this paper focuses on CNN acceleration as its primary application, the collaborative design philosophy and optimization methods described herein possess high versatility and can be extended to other embedded system designs with stringent requirements for power consumption, latency, and computational efficiency.

The remainder of this paper is organized as follows: Section 2 provides an overview of CNN and FPGA fundamentals, including the evolution of CNNs and a comparison of FPGA characteristics versus other hardware platforms. Section 3 systematically elaborates on acceleration methods at the algorithmic and hardware levels. Section 4 focuses on co-design methodologies, covering algorithm mapping, design space exploration, and performance evaluation and modeling. Section 5 extends the discussion to other embedded application scenarios, analyzing the generalizability of the proposed methods. Section 6 summarizes mainstream performance evaluation metrics in existing research. Based on the preceding analysis, Section 7 dissects core challenges facing current FPGA-based CNN acceleration techniques and outlines future research directions. Finally, Section 8 concludes the paper. The article’s structure is illustrated in Figure 1.

2. Background Information

2.1. CNN

2.1.1. The Evolution of CNN

The development of convolutional neural networks can be broadly summarized into five key phases: theoretical inception, modern foundation, stagnation period, deep learning revolution, and diversified evolution. The specific evolutionary trajectory is outlined in Table 2.

As the cornerstone of modern computer vision, the evolution of convolutional neural networks embodies the deep integration of theoretical exploration and engineering practice. Its conceptual origins trace back to Hubel and Wiesel’s pioneering discovery of local receptive fields in biological visual systems [22]. This biological principle was materialized into a computational model—the neural cognitron—in 1980 [23]. Although this model possessed the rudiments of hierarchical convolutions, its practical recognition capabilities remained limited due to the absence of an effective global optimization mechanism [24].

The true birth of modern CNNs is marked by LeNet-5, designed by LeCun et al. [25]. This network seamlessly integrated convolution, pooling, and backpropagation algorithms, establishing the first deep learning model capable of end-to-end training. It achieved remarkable success in tasks such as handwritten digit recognition.

Despite the significance of LeNet-5, the development of CNNs remained relatively stagnant over the following decade due to limitations in data scale and computational power at the time.

Entering the 21st century, with advances in big data and computational power, CNN development reached a turning point. AlexNet, designed by Krizhevsky et al. [26], leveraged the massive ImageNet dataset and the powerful computational capabilities of employed innovations such as the ReLU activation function and Dropout regularization to triumph in the ImageNet competition, thereby ushering in the golden age of deep learning [27].

Subsequently, CNN architectures entered a phase of rapid iteration and diversified development. VGGNet demonstrated the critical role of network depth in performance by stacking small-sized convolutional kernels [28]. GoogLeNet enhanced the network’s multi-scale feature capture capability through its parallel Inception structure without significantly increasing computational burden [29]. In 2015, to address the challenges of vanishing gradients and performance degradation in ultra-deep networks, He et al. [30] introduced the landmark ResNet. Its core residual learning mechanism, enabled by shortcut connections, dramatically simplified the optimization process for deep networks, making the construction of models with hundreds of layers a routine practice. Following ResNet, CNN research converged on two major trends: First, application-specific architectural innovations, such as Redmon et al.’s [31] 2016 proposal of YOLO, which unified object detection into a single-stage regression task, achieving breakthroughs in real-time performance. Second, continuous optimization of general architectures, exemplified by the MobileNet series, which pursue extreme efficiency, and various works introducing attention mechanisms to enhance model representation capabilities. In recent years, a more disruptive transformation has emerged: the Vision Transformer (ViT) proposed by Dosovitskiy et al. [32] completely abandons traditional convolutional structures. It demonstrates that the Transformer architecture, based on self-attention mechanisms, possesses formidable competitiveness in computer vision, opening entirely new avenues of exploration for the field.

2.1.2. CNN Calculation

CNNs have achieved outstanding performance in visual tasks through hierarchical feature extraction mechanisms. The core architecture consists of convolutional, pooling, and fully connected layers, with components such as activation functions, batch normalization, and Dropout, further enhancing the model’s nonlinear expressive power and generalization capabilities. Since the overall performance of CNNs heavily relies on the computational efficiency of these fundamental modules, optimising their underlying operations has become crucial for accelerating CNN deployment.

The feature extraction capability of CNNs primarily stems from its core convolutional (CONV) operation. This operation employs learnable convolutional kernels to perform sliding-window computations on input feature maps, effectively capturing local spatial correlations. Subsequently, activation functions introduce nonlinearity into the network, enabling the model to learn complex distributions; pooling layers reduce feature dimensions through downsampling, endowing the model with a degree of translation invariance [33]. Ultimately, features abstracted through multiple layers complete classification decisions via fully connected layers. This end-to-end learning capability—transforming raw pixels into high-level semantic representations—enables CNNs to demonstrate accuracy and robustness far surpassing traditional machine learning methods across diverse visual tasks, including image classification [34], object detection [35], and semantic segmentation [36].

The convolutional layer is the core unit of a CNN. Its core operation involves performing a multiply-accumulate (MAC) operation on the input feature map and the convolutional kernel to generate the output feature map, as shown in Figure 2.

Here,

K_{x} \times K_{y}

denotes the size of the convolution kernel, M and N represent the input and output channel counts, respectively; W and H correspond to the width and height of the input feature map, while C and R correspond to the width and height of the output feature map.

Convolve a kernel of size

M \times K_{x} \times K_{y}

with a set of parameters from the corresponding input feature map to generate a single pixel at the corresponding position in the output feature map. Subsequently, the kernel slides across the input feature map at a preset stride, traversing all spatial positions to compute the complete output feature map. Its pseudocode is shown in Table 3.

If the stride of the convolution operation is

S = 1

, then the number of parameters P for each convolution layer is as follows:

P = N \times (K_{x} \times K_{y} + 1)

(1)

The number of inner multiplication operations within the convolutional layer is as follows:

F = K_{x} \times K_{y} \times M \times C \times R \times N l a b e l e q 2

(2)

2.2. FPGA

2.2.1. Fundamental Principles of FPGA Technology

A field-programmable gate array (FPGA) is a large-scale integrated circuit that enables customization of hardware functionality through dynamic configuration. Their hardware framework, as shown in Figure 3, primarily consists of three programmable circuit elements—Programmable Logic Blocks (PLB), Input/Output Logic Blocks (IOBs), and Programmable Interconnects (PIR)—alongside Static Random Access Memory (SRAM). As the fundamental unit for implementing logic functions, the CLB integrates components such as flip-flops, lookup tables (LUTs), and multiplexers, enabling the programming of complex logic and storage capabilities. IOBs serve as the interface between internal logic and external pins, handling signal input and output; PIRs provide abundant interconnect resources, precisely linking vast numbers of CLBs according to design requirements to construct customized digital circuits. The programming data for these three types of programmable logic are stored and controlled by configuration data within SRAM, collectively achieving complete system logic.

The operation of an FPGA is divided into two phases: configuration and execution. A bitstream is loaded into the chip’s internal SRAM during the configuration phase to define the hardware circuit structure. In the execution phase, the FPGA transforms from a general-purpose programmable device into dedicated hardware circuitry, efficiently processing input data streams according to predefined logic. This ability to execute complex algorithms and manage high-speed data directly at the hardware level highlights the FPGA’s unique advantages in high-performance and parallel computing tasks. More importantly, users can load new bitstreams at any time to redefine the FPGA’s functionality, enabling rapid adaptation to new application requirements without requiring hardware replacement.

The unique operating principles of FPGAs confer triple advantages in performance, flexibility, and development efficiency. First, their inherent spatial computing architecture enables large-scale, fine-grained parallel processing, delivering performance far surpassing that of general-purpose processors in time-computing models for tasks such as signal processing and data analysis. Second, their flexibility stems from the reconfigurable nature of SRAM, allowing systems to undergo functional upgrades or error correction even after deployment. Finally, FPGAs dramatically accelerate hardware prototyping by providing a rapid iteration platform for algorithm–hardware co-design, significantly shortening the time-to-market from proof-of-concept to final product realization. Consequently, FPGAs have become an indispensable technological cornerstone in critical domains including communications, high-performance computing, AI acceleration, and embedded systems.

2.2.2. Comparison of FPGA with Other Platforms

As semiconductor processes approach their physical limits, both the economic benefits of Moore’s Law and the energy efficiency gains from Dennard scaling have slowed, resulting in a bottleneck in performance growth for general-purpose processors. Meanwhile, the demand for computing power in emerging fields, such as deep learning, has exploded. This structural contradiction between supply and demand has driven the evolution of computing architectures from traditional general-purpose computing toward more efficient domain-specific architectures. Against this backdrop, academia and industry have explored various hardware acceleration solutions, including graphics processing units (GPUs), application-specific integrated circuits (ASICs), and FPGAs. Together with traditional CPUs, these form today’s diverse computing platform ecosystem. A comparison of the characteristics of these four platform types is shown in Table 4.

As the core component of modern computing systems, the CPU adheres to the von Neumann architecture. Its design focuses on handling complex logic and sequential tasks. In deep learning applications, CPUs are well-suited for control-intensive workloads such as model initialization, data preprocessing, and multi-component communication. However, their limited parallel processing capabilities prove inefficient when confronting the large-scale, regular matrix operations required by CNNs. Constrained by memory bandwidth, CPUs struggle to handle high-intensity computational tasks [37].

To address bottlenecks in parallel computing, GPUs emerged with their design featuring numerous parallel processing units for handling massive datasets. Capable of executing vast quantities of similar operations simultaneously, they excel during the training phase of deep learning models, significantly reducing the number of algorithm iteration cycles [38]. However, achieving peak GPU performance relies on high power consumption and complex cooling systems. This high energy demand limits their application in high-performance computing clusters. Deploying them on edge devices would result in significant performance degradation.

In terms of specialization, ASICs are designed to meet the requirements of specific products. The primary advantage of ASICs in deep learning applications lies in their customizability. By “hardwiring” computational logic onto silicon wafers, ASICs eliminate redundant components, significantly reducing chip area and power consumption compared to CPU and GPU [39]. However, this extreme optimization comes with trade-offs: first, high non-recurring engineering costs and lengthy development cycles; second, a complete loss of flexibility. Once manufactured, an ASIC’s functionality is permanently fixed, making it unable to adapt to the rapid evolution of algorithms. This poses significant commercial and technical risks in the dynamically evolving field of AI.

To address this core trade-off between performance and flexibility, FPGAs offer a unique solution based on hardware reconfigurability. In convolutional neural networks, FPGAs serve as a low-latency hardware platform that enables custom data path designs and optimized memory hierarchies, making them a valuable complement to CPUs and GPUs. Unlike GPUs, which have high-peak parallel throughput, FPGAs can be loaded with configuration files to build data pathways and deep pipelines tailored to specific algorithms at the hardware level. This enables FPGAs to execute computations with extremely low latency and outstanding energy efficiency, particularly excelling in real-time-critical AI inference and resource-constrained edge computing scenarios [40]. Compared to ASICs, the reconfigurable nature of FPGAs allows products to iterate algorithms and fix defects via remote updates even after deployment. This perfectly avoids the inflexibility risk inherent in ASICs, thereby reducing development cycles and costs.

Therefore, in the post-Moore’s Law era, no single computing platform can perfectly meet all demands when facing the challenges of specialized computing in fields like deep learning. CPUs remain indispensable as control cores, while GPUs dominate model training with their powerful parallel computing capabilities. ASICs, on the other hand, represent the ultimate pursuit of performance and efficiency for specific tasks. Among these platforms, FPGAs achieve a remarkable balance between performance, power consumption, flexibility, and development costs through their unique hardware reconfigurability. Not only do they deliver acceleration performance and energy efficiency approaching that of specialized hardware, but they also rapidly adapt to evolving technological demands. This makes FPGAs a highly competitive, efficient, and flexible computing platform for AI inference, edge computing, and numerous emerging fields.

3. Optimization Technology

In the field of deep learning, high-precision CNN models are typically large and computationally intensive. Higher precision also translates to greater memory and storage demands, longer inference times, and increased energy consumption, making CNNs challenging to deploy on resource-constrained edge devices.

To address this issue, algorithmic optimization becomes critical. Its core objective is to significantly enhance model execution efficiency while preserving accuracy to the greatest extent possible. Algorithmic optimization focuses on systematically adjusting CNN models and their execution algorithms, encompassing strategies such as parameter quantization, model compression, and network architecture optimization. These approaches aim to improve computational and storage efficiency when processing large-scale data and complex network structures.

Furthermore, optimization methods extend beyond the model’s design to encompass deep adaptation and utilization of underlying hardware characteristics. This includes strategies such as optimising data flow, improving storage schemes, fully leveraging hardware acceleration instruction sets, and adjusting parallel computing, all tailored to specific hardware platforms like FPGAs.

As shown in Figure 4, FPGA-based CNN acceleration techniques can be categorized into two main approaches: algorithm-level optimization and hardware-level optimization, based on different design philosophies and requirements. These measures work synergistically to enable models to tap into and leverage the potential of hardware resources more effectively, thereby enhancing overall execution efficiency.

3.1. Algorithm-Level Optimization

In this section, this paper outlines mainstream algorithmic optimization methods for CNN acceleration. CNN acceleration techniques are categorized into the following groups: pruning and quantization, model architecture optimization, computational reduction, and low-rank approximation.

3.1.1. Model Quantification

Model quantization is a critical algorithm optimization technique in deep learning research and applications. By converting model parameters and activation values from standard 32-bit floating-point format to low-bit-width fixed-point integer formats (e.g., 8-bit), quantization significantly reduces memory bandwidth pressure and accelerates inference processes. Its fundamental objective is to enable efficient model deployment on resource-constrained platforms, such as FPGAs, by establishing efficient mappings between floating-point and fixed-point numbers, all within acceptable accuracy loss margins. Fixed-point quantization, due to its logically straightforward nature, offers advantages over floating-point operations in terms of power consumption and resource costs [41,42,43], making it widely adopted in hardware acceleration designs.

Based on their core mapping methods, quantization techniques can be systematically categorized into two major types: linear quantization and nonlinear quantization. In Table 5. visually illustrates the trade-offs between the two approaches regarding implementation and performance. Linear quantization offers high hardware compatibility and ease of deployment, while nonlinear quantization provides greater compression potential but requires custom hardware support.

Linear quantization, the most widely adopted quantization scheme today, has matured in industrial applications. Its implementation mechanism is straightforward, directly leveraging standard integer arithmetic logic units in hardware, resulting in exceptionally high computational efficiency. Particularly, 8-bit linear quantization is highly regarded for its outstanding balance between model compression, performance acceleration, and precision preservation, and has gained native support from mainstream deep learning frameworks and AI chips. Its effectiveness in FPGA-based CNN acceleration designs has also been thoroughly validated: both [44,45] significantly reduced hardware resource requirements by converting 32-bit floating-point operations into compact 16-bit or 8-bit integer models, achieving enhanced computational speed and energy efficiency. At the hardware implementation level, ref. [43] emphasizes the importance of INT8 quantization for CNN acceleration on FPGAs. By replacing 32-bit floating-point operations with 8-bit fixed-point operations, it achieves reduced memory bandwidth and power consumption while maintaining an acceptable level of accuracy loss. However, traditional post-training quantization methods suffer severe performance degradation when facing more extreme compression demands (e.g., below 8-bit).To address this, ref. [46] proposes a quantization-aware training method for large language models. This approach employs a knowledge distillation framework that requires no original training data, successfully quantizing model weights and activations to 4-bit precision. Ultimately, it significantly reduces model size and memory consumption while delivering performance markedly superior to existing post-training quantization methods.

In contrast, nonlinear quantization has emerged as a research hotspot due to its superior ability to capture weight distribution characteristics. Typical methods include quantization, power-of-two quantization, K-means quantization, and extreme quantization. Among these, extreme quantization stands out as a significant branch, garnering considerable attention for its unique advantages. Binary quantization within this branch quantizes parameters as well, which, while potentially causing noticeable accuracy loss, demonstrates extremely high speed and energy efficiency gains on reconfigurable platforms such as FPGAs [47]. Ternary quantization further introduces zero values, enabling implicit network pruning while reducing inference latency and offering potential for model accuracy restoration and enhancement [48]. A comparative analysis of various quantization methods is shown in Table 6.

In summary, model quantization demonstrates distinct advantages and trade-offs across two major technical approaches: linear and nonlinear. Linear quantization offers straightforward implementation, enabling efficient computation on standard hardware. At 8-bit precision, it strikes a good balance between model performance and compression ratio. In contrast, though more complex to implement, nonlinear quantization better captures intricate weight distributions, yielding exceptionally high energy efficiency on specific hardware. Consequently, selecting a quantization approach in practical applications is not a simple matter of superiority or inferiority but a decision-making process requiring careful balancing between implementation complexity, hardware compatibility, model accuracy, and compression potential.

3.1.2. Pruning

As a core technique for achieving sparse neural network model structures, pruning significantly reduces the number of parameters and computational complexity by eliminating redundant parameters or structures. A typical pruning workflow is illustrated in Figure 5a. The initial model is in dense mode. Unimportant weights, neurons, or channels are removed based on parameter importance. Finally, model performance is restored through fine-tuning. This “train-evaluate-prune-fine-tune” cycle is often iterated to achieve higher compression rates.

Based on their operational targets, pruning can be categorized into two major types: structural pruning and non-structural pruning, also referred to as coarse-grained pruning and fine-grained pruning, respectively. Figure 5b illustrates different sparse structures of the 4D weight tensor in a convolutional neural network, where regular sparse structures facilitate hardware acceleration.

Structured pruning involves removing entire structured modules from neural network architectures—such as convolutional kernels, channels, or even entire layers—rather than individual parameters. Early research [56] proposed structured sparse learning methods, demonstrating the basic feasibility of structured pruning. However, these approaches suffer from two critical shortcomings: first, they rely on manually designed pruning rules, lacking adaptive capabilities; second, they fail to account for the actual performance of pruned models across different hardware architectures. Building upon this, Reference [57] introduced learnable scaling factors for specific structures and applied sparsity regularization, achieving a 43% reduction in computational load with only a 4.3% increase in TOP-1 error. Reference [58] frames structured pruning as a generative-adversarial game, proposing an unlabeled generative adversarial learning pruning method that achieves a 2.7-fold computational reduction with only a 3.8% increase in Top-5 error. Reference [59] focuses on meeting specific hardware budgets. Its proposed budget-aware regularization method achieves a 0.7% accuracy improvement after compressing the model volume by a factor of 16.

Research indicates that structured pruning generates smaller yet well-structured dense networks, offering outstanding hardware compatibility and making it ideal for resource-constrained embedded systems and edge devices. However, this coarse-grained removal approach has limitations: it may significantly impact model accuracy and cannot achieve the extreme sparsity levels attainable through unstructured pruning.

Unstructured pruning treats individual weight parameters in the network as independent pruning targets, eliminating (setting to zero) weights below a certain threshold based on importance scores (e.g., absolute magnitude). Reference [60] introduced constrained Bayesian optimization and cooling strategies, reducing the number of parameters in AlexNet by 95%. However, the Bayesian optimization approach incurs substantial computational overhead, rendering it unsuitable for resource-constrained environments. Reference [61] extended the classic “Optimal Neurosurgeon” pruning algorithm, compressing VGG-16 to 7.5% and increasing Top-1 accuracy by 0.3%. Reference [62] proposed a frequency-domain dynamic adaptive pruning scheme achieving 8.4x compression and 9.2x theoretical acceleration on ResNet-110 while improving accuracy by 0.12%; Reference [63] performed one-time pruning based on connection sensitivity, reducing Top-1 accuracy by only 0.47% after pruning 80% of parameters. Although fine-grained pruning at the weight parameter level offers high flexibility and accuracy, it requires specialized software and hardware for sparse matrix operations to achieve practical results. To address this bottleneck, proposed sparse convolutional units that store only quantized non-zero parameters and their coordinates, reducing sparse matrix storage requirements by over 60%.

Table 7 summarizes and compares several pruning techniques. FLOPs (floating-point operations per second) measure computational complexity, while Top-1/Top-5 represents the proportion of the first or top five predicted categories matching the actual label in image classification tasks. CR denotes the ratio of the original model’s parameters to those after pruning.

Although research on pruning techniques has long focused on metrics such as compression ratio and accuracy, these evaluations often overlook a critical issue: the adaptability of pruning strategies to hardware architectures. In fact, the core value difference between structured and unstructured pruning lies precisely in their respective adaptability to different hardware platforms and application scenarios.

Structured pruning, with its outstanding “hardware-friendliness,” is an ideal choice for general-purpose computing platforms. It enables efficient inference directly on existing general-purpose hardware (such as CPUs and GPUs) and standard computational libraries (like BLA), without requiring specialized software or hardware support. This characteristic makes it ideally suited for resource-constrained edge devices. For instance, in autonomous driving perception modules, smartphone image processing pipelines, and real-time diagnostic systems for industrial IoT, structured pruning delivers direct, effective performance gains without requiring hardware modifications [64].

In contrast, unstructured pruning pursues extreme compression ratios, but its full performance potential relies on coordinated optimization across both software and hardware layers. The resulting sparse weight matrix structure is irregular, often triggering inefficient memory access patterns on traditional hardware. This makes translating the reduction in theoretical computational load into tangible acceleration gains difficult. To overcome this bottleneck, Reference [65] proposed the “sparse convolution unit.” This unit significantly enhances sparse computation efficiency by storing and processing only non-zero parameters and their coordinates at the hardware level. Consequently, unstructured pruning is better suited for environments with customized computational capabilities, such as dedicated accelerator cards with FPGAs or ASICs, next-generation GPUs supporting sparse tensor operations, and high-performance computing clusters. It demonstrates significant potential in scenarios demanding extreme energy efficiency and inference speed, such as cloud-based inference and large-scale semantic understanding [66].

In summary, from the perspective of algorithm–hardware co-design, structured pruning is more suitable for deployment on highly versatile, non-modifiable edge devices. In contrast, unstructured pruning should pursue peak performance on platforms supporting sparse specialized computing. This distinction also provides clear selection guidance for pruning techniques across broader embedded and high-performance computing applications.

3.1.3. Model Architecture Optimization

The complexity of CNN architectures is primarily determined by three dimensions: depth, width, and branching structure. Together, these factors dictate the model’s expressive power and resource consumption. Network depth refers to the number of layers in the model. Increasing the number of layers enhances feature abstraction capabilities, but excessive depth may lead to vanishing or exploding gradients, complicating training. Network width denotes the number of neurons or channels per layer. Expanding the number of channels per layer enhances feature representation capabilities, but extensive networks may result in redundant parameters and higher memory consumption. Branch structure refers to the design incorporating multiple parallel paths at different stages of the model. This architecture enhances model performance by fusing multi-scale information through parallel paths, but simultaneously increases computational complexity and storage requirements.

General principles indicate that increased architectural complexity correlates positively with enhanced model performance, yet a sharp rise in computational overhead and memory requirements accompanies this improvement. The core of model structure optimization lies in systematically balancing the network’s expressive power with resource efficiency, avoiding overfitting and computational burdens caused by excessive complexity, while preventing overly simplistic structures that adequately capture data features. Thus, seeking the optimal equilibrium between performance and efficiency remains critical in model structure optimization.

(1) Lightweight Network

Lightweight neural network models represent a significant research direction in computer vision and deep learning. Their core objective is to design innovative, efficient convolutional modules that replace computationally intensive traditional convolutions, thereby constructing neural networks with fewer parameters and reduced computational demands while maintaining high accuracy. This is crucial for deploying AI applications on resource-constrained platforms such as mobile phones, embedded devices, and autonomous driving systems.

To overcome the problem of computational cost surges caused by simply increasing network depth or width, Szegedy et al. [29] proposed GoogLeNet. Its core “Inception module” cleverly employs 1 × 1 convolutions for dimensionality reduction by utilising multi-sized convolutional kernels in parallel. This approach pioneered the capture of multi-scale features within the same layer while maintaining low computational cost. It simultaneously enhances both network depth and width without significantly increasing computational load.

The MobileNet series has garnered significant attention for its lightweight design. MobileNetV1, proposed by Howard et al. [67], introduced the landmark separable convolution, significantly reducing parameter count and computational complexity. However, its early versions exhibited limitations in accuracy retention; subsequent iterations like MobileNetV2 [68] further optimized performance through inverted residual connections and linear bottleneck structures. MobileNetV3 [69] then integrated neural architecture search (NAS), efficient activation functions, and attention mechanisms, achieving even higher accuracy while maintaining exceptional efficiency.

Inspired by this, Megvii Technology’s ShuffleNet series further addressed the computational overhead of

1 \times 1

convolutions by introducing a novel channel shuffle mechanism to optimize information flow [70]. Meanwhile, Han et al. [71] proposed GhostNet, which leveraged feature map redundancy to generate additional feature maps through simple linear operations. This approach significantly reduced computational complexity and parameter count while maintaining accuracy, offering an alternative strategy for lowering computational costs.

Additionally, there exist approaches that pursue extreme compression or entirely novel perspectives. SqueezeNet significantly reduces parameters by extensively replacing 3 × 3 convolutional kernels with 1 × 1 kernels through its Fire module, achieving remarkable model compression rates [72]. Meanwhile, Google’s EfficientNet series addresses the limitations of previous model scaling approaches from a systemic level by proposing a composite scaling theory. This unifies optimization across depth, width, and resolution, delivering substantial accuracy improvements under equivalent computational constraints [73].

In research on FPGA-based CNN acceleration technology, the design philosophy of lightweight models holds direct practical value. Given FPGAs’ limited computational and storage resources, lightweight techniques effectively reduce a model’s resource consumption. Consequently, applying lightweight models to FPGA platforms provides a viable technical pathway for accelerating neural networks and offers efficient deployment solutions for resource-constrained embedded scenarios.

(2) Knowledge Distillation

Knowledge distillation, as a model compression technique, aims to transfer knowledge from large, complex models (teacher models) to smaller, streamlined models (student models). Its objective is to enable student models to approximate the performance of teacher models while significantly reducing the number of parameters. Leveraging this advantage, knowledge distillation has been widely adopted in resource-sensitive scenarios such as mobile deployment, edge computing, and real-time inference.

Knowledge distillation was proposed by Hinton et al. in 2015. Its core idea involves training student models more accurately by utilizing soft targets alongside challenging targets. The soft target represents the prediction output from the teacher network model, while the hard target corresponds to the original sample labels, i.e., the actual labels. In his paper [74], Hinton recommends an empirical weighting ratio of 9:1 for soft and hard targets.

The classic framework of knowledge distillation, as illustrated in Figure 6, comprises three core components: a pre-trained teacher network with fixed parameters, a student network optimized with a more streamlined structure, and a composite guidance loss function. Knowledge distillation typically employs the mean squared error loss function as the guiding loss function for both the teacher and student networks.

In existing CNN acceleration research, addressing the limitations of knowledge distillation, researchers have proposed innovative solutions from diverse perspectives. To address the low knowledge transfer efficiency caused by excessive capacity gaps between teacher and student models, Reference [75] introduced a series of “teacher assistants” as intermediate bridges, supplemented by randomly dropping knowledge connections to suppress overfitting. In contrast, ref. [76] employs a counter-clockwise block-wise knowledge distillation framework to implement differential compression on the teacher network, enabling efficient knowledge transfer. To optimize knowledge absorption efficiency in the student network, ref. [77] designs two novel distillation strategies—TGKD and SAKD—which simplify the training process and enable the student network to autonomously learn inter-class relationships. However, these methods still face challenges in resource-constrained embedded scenarios. To address this, Reference [78] employs feature map cosine similarity as the distillation objective, applying an ultra-lightweight CNN to wearable medical devices. Similarly, ref. [79] utilizes feature-based knowledge distillation to successfully deploy lightweight student networks on FPGAs in industrial settings, demonstrating superior practical value.

(3) Layer Integration

Layer fusion is a key optimization technique during the model inference stage. Its core principle involves merging consecutive layers within the computational graph that can be mathematically combined into a single layer. This approach reduces the number of computational steps, lowers memory access overhead, and enhances overall inference speed.

Among these, the most common batch normalization (BN) layer fusion is particularly crucial for hardware acceleration. For instance, both [80,81] incorporate BN layer parameters into preceding convolutional layers, eliminating the need to design dedicated FPGA hardware circuits for BN layers. This approach significantly conserves hardware resources and reduces data transmission. Beyond BN layer fusion, researchers have also explored other forms of hierarchical optimization strategies. Reference [82] optimizes the data path by placing a shortcut connection layer before the convolutional layer, enabling reuse of features in dynamic random-access memory (DRAM) and effectively reducing redundant data access. Reference [83] designed unique mixed-precision schemes and streaming convolution architectures for each layer, achieving parallel computation while reducing DRAM accesses. Furthermore, Reference [84] applied layer fusion concepts to the Fast Fourier Transform (FFT), eliminating redundant operations by merging pooling and convolution layers. These layer-specific optimizations are crucial for efficiently executing CNN models in hardware.

In summary, lightweight networks, knowledge distillation, and layer fusion are core strategies for optimising model architecture, each addressing different challenges CNN models face in embedded deployments. Lightweight networks reduce inherent model complexity at the source; knowledge distillation enhances the performance ceiling of small models without altering network structure; layer fusion further squeezes hardware efficiency during deployment. In practical applications, these three techniques are often combined to achieve the optimal accuracy-efficiency trade-off under the stringent constraints of embedded systems. These optimization approaches exhibit high versatility, with their core principles extendable beyond computer vision to other embedded intelligent tasks such as speech recognition and natural language processing. This provides crucial methodological support for building efficient edge computing systems.

3.1.4. Reduction in Computational Load

In CNNs, convolutional layers have become the core focus for optimization due to their computationally intensive and resource-demanding nature, particularly when deploying to edge platforms like FPGAs. While traditional spatial convolution algorithms are intuitive, they are implemented through multiple nested loops, resulting in low computational efficiency. To fundamentally reduce the computational complexity of convolution and enhance throughput performance, convolution based on the Fast Fourier Transform (FFT) and the Winograd algorithm has emerged as a key optimization direction.

(1) FFT

The FFT method leverages the convolution theorem—that “time-domain convolution equals frequency-domain multiplication”—to shift computations to the frequency domain. It first transforms the input feature map and convolution kernel into the frequency domain via FFT, performs element-wise multiplication, and then converts the result back to the spatial domain using inverse FFT (IFFT). The effectiveness of this approach has been validated across multiple domains: In fault diagnosis, Reference [85] employs FFT for signal preprocessing to efficiently extract critical frequency-domain features, thereby enhancing the classification accuracy of CNN models. In signal processing, ref. [86] employs overlapping FFT convolution techniques, segmenting input data into blocks for separate FFT convolution. By shifting convolution computations to the frequency domain, this approach significantly reduces the computational complexity of CNNs. Within deep learning, ref. [87] utilizes a hybrid-basis FFT to effectively minimize redundant elements generated by feature map expansion, thereby reducing unnecessary computational and transformation overhead.

Although FFT has mature IP cores on FPGAs, it involves complex number operations and multiple data transformations, requiring substantial on-chip cache. Consequently, it is most suitable for specific scenarios involving large-sized convolution kernels. For small-size convolution kernels (such as

3 \times 3

) that dominate modern CNN architectures, FFT methods often prove inefficient. This has prompted researchers to explore more efficient alternative algorithms.

(2) Winograd

The Winograd fast convolution algorithm is currently one of the most efficient algorithms for small convolution kernels (especially the

3 \times 3

size most commonly used in CNNs). By increasing additions and reducing multiplications, it lowers FPGA resource consumption and improves computational speed. Although first proposed by Shmuel Winograd in 1980, the algorithm was initially pioneered for neural network acceleration by Lavin et al. [88]. By implementing block-wise processing of input data, they demonstrated its ability to substantially reduce computational load in small kernel scenarios. Building upon this foundation, subsequent research has proliferated: For instance, addressing the lack of efficient, configurable Winograd IP on FPGA platforms, ref. [89] proposed the Structured Direct Winograd (SDW) reconstruction algorithm for automated generation of highly compatible IP cores. Ref. [90] designed a dedicated Winograd acceleration architecture implemented on an FPGA, achieving a 38% reduction in resource consumption compared to traditional GEMM methods.

Although the Winograd transformation process increases logic resource consumption and introduces numerical precision challenges, the customizability of FPGAs makes them highly suitable for building specialized, efficient Winograd processing pipelines.

In contemporary deep learning practice, FFT convolution and the Winograd algorithm are not in direct competition but rather complementary. Given the design trend of modern convolutional neural networks to stack numerous small

3 \times 3

convolution kernels, the Winograd algorithm is clearly preferred and has become the de facto standard acceleration technique. Meanwhile, FFT convolution retains its unique advantages for large convolutional kernels, remaining an indispensable and efficient solution for specific network architectures or tasks.

3.1.5. Low-Rank Approximation

In FPGA-based CNN acceleration, low-rank approximation serves as an algorithmic model compression and computational optimization technique. Rather than directly leveraging hardware parallelism for raw computations, it employs matrix decomposition methods, such as SVD, to decompose computationally intensive large convolutional kernels into sequences of smaller operations at the algorithmic level. This algorithmic transformation directly reduces the model’s parameter count and computational operations while generating computational patterns that are more readily pipelined and parallelized efficiently on FPGAs.

Take the classic decomposition of 2D convolutions as an example:

Suppose there is a standard

K \times K

convolution kernel.

In the original operation, the parameter count is

K^{2}

, and the computational complexity per sliding window is

K^{2}

.

In low-rank approximation, it can be decomposed into two consecutive, simpler convolutional operations: a vertical convolution of

K \times 1

and a horizontal convolution of

1 \times k

, as shown in Figure 7. Its parameters and computational complexity are

K \times 1 + 1 \times K = 2 K

.

Compared to the original operation, the number of parameters and computational complexity are reduced from

K^{2}

to

2 K

after low-rank approximation. For a

3 \times 3

convolution kernel, the parameters and computational complexity can be reduced from 9 to 6, representing a decrease of approximately 33%. For a

5 \times 5

convolution kernel, they can be reduced from 25 to 10, representing a 60% decrease.

This decomposition directly transforms a computationally intensive 2D convolution task into two sequences of significantly less computationally demanding 1D convolutions. The resulting computational pattern is highly favorable for FPGA pipeline designs.

Early studies, such as [91], employed linear combinations to replace standard convolution kernels, achieving a 2.5-fold increase in model inference speed without precision loss. While this approach demonstrated fundamental feasibility, it exhibited limitations, including a narrow range of decomposable patterns and a failure to account for hardware characteristics. Researchers have recently combined low-rank approximation with advanced techniques to propose more sophisticated optimization frameworks. For instance, ref. [92] addresses the limited compression effectiveness of single methods by introducing a two-stage hybrid approach that integrates low-rank approximation with channel pruning. By synergistically handling intra-filter and inter-filter redundancies, this method achieves significantly superior compression and acceleration performance compared to standalone approaches. In contrast, ref. [93] addresses the limitations of manually designed decomposition schemes by innovatively transforming the low-rank approximation problem into a neural architecture search (NAS) task. This approach automatically identifies optimal decomposition schemes for each layer, substantially improving model accuracy in data-constrained scenarios. Furthermore, ref. [94] employs co-design of algorithms and hardware, leveraging hardware performance feedback to guide the model’s low-rank decomposition, thereby achieving the optimal accuracy-throughput trade-off on FPGA. Additionally, some works focus on designing dedicated FPGA accelerators for decomposed low-rank structures, such as the approach proposed in [95].

3.1.6. Summary

This section reviews several mainstream algorithmic acceleration techniques and summarizes their key research contributions in Table 8.

Analysis indicates that while various methods aim to enhance model efficiency, each possesses distinct advantages and limitations. Notably, these techniques are not mutually exclusive but exhibit significant complementarity and synergistic potential. For instance, combining quantization with pruning achieves higher compression rates while maintaining accuracy; integrating quantization with knowledge distillation further elevates the performance ceiling of small models.

In practical deployments, the optimal strategy typically involves combining multiple technologies. For instance, building upon a lightweight network architecture, one can perform quantization-aware training and structural pruning, ultimately leveraging fast algorithms like Winograd for end-to-end acceleration during hardware deployment. This multi-level collaborative optimization approach maximizes the synergistic benefits of different technologies, achieving the optimal energy-efficiency balance in edge computing scenarios.

However, the effectiveness of algorithmic optimizations ultimately depends on robust support from underlying hardware. To translate theoretical performance gains into practical systems, algorithmic refinements must be tightly integrated with hardware architecture. Moving forward, we will shift our focus from the algorithmic to the hardware level, exploring how to further unlock the inference potential of CNN models on reconfigurable platforms like FPGAs through techniques such as memory optimization, data flow scheduling, and parallel architecture design.

3.2. Hardware-Level Optimization

FPGAs have emerged as one of the mainstream hardware platforms for accelerating CNN inference tasks, owing to their inherent parallelism, high energy efficiency, and reconfigurable flexibility. Unlike general-purpose processors, FPGA-based CNN acceleration designs fundamentally involve constructing customized hardware architectures tailored to specific CNN models. To fully unlock the potential of FPGAs, deep hardware-level optimization is crucial. These optimization strategies can be systematically categorized into three core directions: computational optimization, memory and data flow optimization, and hardware architecture optimization.

Computational optimizations such as quantization, pruning, and efficient convolution algorithms—due to their tight coupling with model algorithms—have already been discussed in previous sections of this paper. Therefore, this chapter will focus exclusively on pure hardware implementation strategies. First, storage and data flow optimization aim to overcome the “memory wall” bottleneck. This is achieved through micro-level techniques, such as designing efficient data flow architectures, optimizing loop mapping, and implementing ping-pong operations, to ensure data is supplied to computational units efficiently. Second, hardware architecture optimization determines the overall form and performance ceiling of the accelerator, encompassing macro-level designs such as pipelining, parallel computing, and the overall layout of the accelerator. Table 9 visually illustrates the interrelationships and focal points of these three optimization strategies. The following sections will provide detailed discussions on the latter two.

3.2.1. Data Flow Optimization

In FPGA-based CNN acceleration designs, the significant performance gap between compute units and memory has become a core bottleneck constraining overall system performance and energy efficiency. As noted in [68], in resource-constrained edge devices, the time and energy overhead from frequent off-chip data exchanges may even exceed that of computation itself. To overcome this bottleneck, constructing an efficient data flow is the core task of hardware-level optimization. This section will delve into a series of key cooperative techniques for achieving this goal, including loop tiling, loop unrolling, and double buffering. These technologies enhance data reuse efficiency across multiple dimensions, aiming to achieve seamless coordination between computation and storage, thereby fully unleashing the parallel computing potential of FPGAs.

(1) Loop Tiling

Loop tiling, also known as tiling, is a key optimization technique for hardware acceleration, particularly suited for CNN models processing large-scale data [96,97]. The implementation of large-scale CNN acceleration designs relies on loading feature maps and weights from off-chip memory (typically DRAM), but frequent high-volume DRAM accesses significantly increase network-level latency [98].

To alleviate this bottleneck, Loop tiling partitioning technology was introduced. Its core objective is to minimize memory access overhead and enhance hardware resource utilization efficiency by improving cache utilization. Figure 8 illustrates the loop tiling partitioning scheme, which also incorporates an additional outer loop to divide the dataset into smaller blocks. Each block is processed independently and efficiently stored in the cache, not only reducing memory access overhead but also effectively boosting overall execution speed by minimizing data transfer between main memory and processing units.

Selecting optimal chunking parameters is a core challenge in applying this technology. To address this challenge, researchers have proposed various optimization strategies. For example, ref. [22] explores an optimal chunking strategy by establishing a latency analysis model to balance the overhead of on-chip storage and external memory access. Reference [99] proposes a more flexible “dynamic partitioning” approach, allowing different partitioning factors to be configured for each layer in the network. This approach better adapts to the unique computational and memory access demands of each layer, resulting in a 1.7-fold improvement in latency. In practical applications, cyclic blocking is often combined with techniques such as double buffering and pipelining, as demonstrated in [81,100], to hide data transfer latency and further enhance overall system throughput and efficiency.

Although loop tiling yields significant benefits, it also introduces more complex on-chip data paths and hardware control logic, imposing higher demands on design and verification. Therefore, in accelerator design, precise quantitative analysis and trade-offs must be conducted between the performance gains and additional resource overhead introduced by partitioning strategies to achieve optimal overall design objectives.

(2) Loop Unrolling

Loop unrolling is a widely used optimization technique in CNN hardware acceleration design, particularly in FPGA implementations. Its essence lies in merging multiple iterations of a loop into a larger operational unit, thereby reducing the total number of iterations and increasing computational parallelism, ultimately accelerating the execution of the loop body.

Convolution computations in CNNs typically manifest as multi-layered nested loops, resulting in high iteration counts when processing large-scale input data. Loop unrolling addresses this inefficiency by transforming a single iteration into multiple parallel operations. In the six-layer for-loop of the convolution operation shown in Table 3, loop unrolling can be applied to the two dimensions corresponding to parallelism between convolution kernels: Loop1 in the output feature map channel dimension and Loop4 in the input feature map channel dimension, as illustrated in Figure 9. This approach not only increases the computational load per iteration but also reduces reliance on control logic such as loop counters and branch conditions. Consequently, it enhances instruction execution efficiency and maximizes hardware parallelism.

In CNN acceleration design, designers can strategically select the dimension to unfold based on optimization objectives. For instance, addressing the limitations of existing unfolding strategies—such as single-dimensional approaches and difficulties in balancing parallelism across different temporal levels—[101] proposes an innovative combined strategy. This approach performs spatial unfolding on input and output channel cycles while implementing temporal unfolding on the spatial dimensions of output feature maps. Ultimately, it achieves a throughput of up to 721.48 GOPS for the VGG-16 network on a Xilinx ZCU111 FPGA (San Jose, CA, USA). Reference [102] addresses the issue of expansion factors relying on manual experience and lacking systematic search methods by proposing channel-level cyclic unfolding. It utilizes incremental spatial search to determine expansion factors rapidly. Experiments demonstrate that the performance error between parameters obtained via this automatic search method and manually optimized results is only 11.58%. To enhance hardware design security without additional overhead, ref. [103] combines unfolding with register-level watermarking techniques in high-level synthesis. This approach achieves an anti-tampering strength of

3.27 \times 10^{150}

with negligible area and latency overhead.

It is worth noting that the extent of loop unrolling must align with available hardware resources, such as logic units and memory bandwidth. Reference [104] suggests that excessive loop unrolling can result in wasteful FPGA resource utilization, thereby degrading system performance. Therefore, in the practice of CNN hardware acceleration, loop unrolling must be employed strategically to strike a balance between performance gains and efficient hardware utilization.

(3) Double Buffering

In FPGA acceleration designs, on-chip memory must not only handle data transfers with off-chip memory but also efficiently supply operands to processing elements (PEs) while storing the results of multiply-accumulate (MAC) operations. A single-buffered architecture causes PEs to idle while waiting for off-chip data loading, which severely impacts computational efficiency and increases latency.

Double buffering effectively resolves this issue by establishing two buffers that operate in parallel, creating a “ping-pong” mechanism. As shown in Figure 10, by configuring two parallel buffers, a “ping-pong” mechanism is established: while one buffer supplies data to the current computational layer, the other simultaneously fetches data for the next layer from off-chip memory. Upon completion of each computational layer, the two buffers switch roles, effectively masking memory access latency and ensuring continuous computation by the PE.

Numerous studies have successfully applied this mechanism to optimize acceleration design performance. For instance, ref. [105] employs a dual-buffering structure both within computational units and between modules, ultimately achieving a 1.5x acceleration effect with detection speeds reaching 15FPS. Reference [106] applied the ping-pong cache mechanism to feature maps and weight caches, significantly enhancing the system’s overall throughput and efficiency. Addressing the challenge of frequent off-chip memory access for large-scale feature maps, Reference [107] proposed a block convolution method. This technique divides feature maps into multiple small blocks and combines them with double buffering for iterative loading, minimizing off-chip memory access. Furthermore, ref. [108] incorporates dual buffering into network architecture search to mask memory access latency, integrating latency and energy consumption into the loss function for optimization. This approach achieves 1150 FPS inference performance on the same platform, effectively addressing the lack of hardware performance metrics in traditional architecture search.

However, this mechanism also has its shortcomings. Compared to single buffering, double buffering results in a more complex hardware structure, increases the difficulty of designing memory controllers, and typically consumes more on-chip memory resources.

3.2.2. Hardware Architecture Optimization

(1) Parallel Computing Architecture

Parallel computing is the core technology for unlocking the potential of FPGAs in accelerating CNNs, particularly for computationally intensive convolutional layers. Simultaneously executing the same operation on multiple data items accelerates the computational process, significantly boosting the speed of data processing tasks. Unlike fixed-architecture processors, the flexibility and reconfigurability of FPGAs enable designers to deeply customize parallel computing architectures according to specific algorithmic requirements, thereby achieving optimal computational performance and resource utilization.

The inherently multi-layered nested loop structure of CNN convolutional layers provides a vast design space for parallelization strategies. By unrolling different loop dimensions, four fundamental parallelization patterns can be derived, as shown in Figure 11. These patterns emphasize other aspects such as data reuse, memory access, and resource consumption, defined as follows:

Filter Parallelism (

P_{f}

): This strategy parallelizes the output channel (

P_{n}

) dimension. In convolutional neural networks, each input feature map undergoes convolution with multiple filters to generate corresponding output feature maps. Traditionally, this process is executed sequentially, failing to fully leverage hardware resources. Under filter parallelism, multiple computational units are deployed to enable different filters to process the same input feature map simultaneously, thereby generating various output channels in parallel. Its primary advantage lies in efficiently reusing input feature map data, significantly boosting computational throughput.

Channel Parallelism (

P_{c}

): This strategy parallels the input channel (

P_{m}

) dimension. CNN typically processes multi-channel input feature maps. Traditionally, each convolution kernel computes results for every channel independently, then sums them to produce the final output. Channel parallelism aims to accelerate this summation process by concurrently processing convolution and summation operations across different input channels, effectively reducing computational serial dependencies.

Pixel Parallelism (

P_{v}

): This strategy parallelizes the spatial dimensions (

P_{w}

,

P_{h}

) of the output feature map. During convolution operations, the kernel slides across the input feature map to compute convolutional results at different spatial positions. Since computations at these distinct positions are largely independent, they can be processed in parallel. Pixel Parallelism leverages the independence of computations at different output positions, enabling multiple processing units to use the same filter to compute pixels at different spatial locations on the output map in parallel. Its advantage lies in achieving efficient reuse of filter weights. However, due to the sliding window nature of convolution, it requires a more sophisticated on-chip buffer design to support data supply.

Kernel Parallelism (

P_{k}

): This strategy parallelizes the spatial dimensions (

P_{x}

,

P_{k}

) within the convolution kernel. When computing a single output pixel, it processes all multiply-accumulate operations within the kernel in parallel. This fine-grained parallelism maximizes the reuse of local input data; however, the kernel size often constrains the method and imposes higher demands on data storage.

In advanced acceleration design methodologies, the above four fundamental parallelization strategies are rarely employed in isolation. Instead, they are integrated in various combinations to construct hybrid parallel architectures. This paper aims to systematically compare these parallel strategies, evaluating their characteristics and trade-offs in terms of computational efficiency, resource consumption, and other relevant aspects. Its objective is to gain a deeper understanding of the latest advancements in CNN hardware acceleration, enabling researchers to better comprehend the trade-offs between different approaches and thereby design optimal acceleration solutions for specific application scenarios. Table 10 compares the performance of different parallel strategies. Analysis indicates that a single parallel strategy struggles to meet the demands of complex networks, making hybrid parallel architectures the current mainstream research direction.

Explorations of hybrid parallel architectures exhibit diverse trends. For instance, Reference [109] implemented a combination of core, pixel, and filter parallelism (

P_{k}

P_{v}

P_{f}

), achieving a throughput of 107 GOPS on the Xilinx Zynq ZC706 platform. However, resource efficiency still holds room for improvement. To further enhance the coordination between resource utilization and parallelism, the combination of channel and filter parallelism (

P_{c}

and

P_{f}

) gained popularity due to its high flexibility. Some studies further integrated pixel parallelism (

P_{v}

) on this basis, yielding higher throughput. To pursue greater parallelism dimensions, a few cutting-edge designs attempted to fuse all four parallel strategies to maximize parallelism dimensions.

However, most hybrid architectures still employ fixed parallel configurations, making achieving balanced resource utilization and computational efficiency across different network layers challenging. To address this, dynamic parallel approaches have been proposed to enhance architectural adaptability. Reference [121] introduces a method that dynamically adjusts the degree of parallelism based on the size of input feature maps and filters, thereby maintaining stable, high computational throughput across different layers.

Although dynamic parallelism mechanisms effectively mitigate load imbalance issues, their implementation introduces additional control complexity and resource overhead. To further evaluate the overall performance of different architectures, in Figure 12 provides a systematic comparison of multiple CNN accelerator designs from an energy efficiency perspective.

Two primary trends can be observed in the figure: First, only a handful of designs achieve high energy efficiency at low power consumption, indicating that efficient computing within the low-power range remains a significant challenge. Second, across the overall distribution, most designs exhibit a clear positive correlation between energy efficiency and power consumption. That is, as power consumption increases, system throughput typically rises as well. This reflects the strong coupling between performance and energy efficiency in current designs.

Based on the above analysis, future research should focus on enhancing computational throughput while placing greater emphasis on energy efficiency optimization under low-power conditions. Efforts should be directed toward exploring co-design strategies that balance low power consumption with high throughput, thereby advancing high-performance inference acceleration in resource-constrained scenarios such as edge computing.

(2) Deep Pipeline Architecture

Deep pipeline architecture enhances system clock frequency and throughput by decomposing complex computations into multiple stages, thereby reducing logical complexity. During operation, data sequentially traverses each stage within the FPGA in a pipelined fashion, enabling overlapping execution over time. This creates highly efficient pipeline parallelism, significantly boosting overall throughput. This stage-based pipelining approach is particularly well-suited for computationally intensive tasks and has become a core technology in modern high-performance processor and specialized accelerator designs.

Based on the granularity or abstraction level of pipeline technology application, this architecture can be systematically categorized into the following common types: arithmetic pipelines, inter-stage pipelines, instruction-level pipelines, and hybrid pipelines. Table 11 in this paper compares their respective approaches.

Computational pipelines specifically refer to fine-grained pipelines implemented within a single, complex computational unit. For example, a 64-bit floating-point multiplier can be decomposed into stages such as exponent and mantissa multiplication, followed by normalization. By pipelining these sub-operations, the logic within each stage is simplified, which significantly increases the maximum operating clock frequency of the computational unit and thereby enhances its computational throughput.

Inter-layer pipelines specifically refer to coarse-grained pipelines formed by connecting multiple independent functional modules or processors. In CNN acceleration, a typical implementation involves creating independent hardware modules for the convolution layer, activation layer, pooling layer, and other components, and then connecting them in series. When the convolution layer finishes processing one batch of data and passes it to the next layer, it can immediately begin processing the subsequent batch. This approach achieves task-level parallelism, thereby maximizing the overall throughput of the accelerator and making it particularly suitable for processing streaming data, such as video.

The instruction-level pipeline is a fundamental concept in general-purpose processor design, referring to the hardware technique of decomposing the execution of a single instruction into multiple stages. In FPGAs, dedicated hardware execution units can be designed to perform specific arithmetic or logical operations and pipeline them for efficient execution. Through this approach, FPGAs can simultaneously process multiple instructions at different execution stages, thereby enhancing instruction-level parallelism.

Hybrid pipeline strategies enable designers to achieve optimal acceleration by flexibly combining multiple pipeline techniques according to specific network architectures and performance objectives. In practice, researchers have proposed a series of innovative solutions targeting different bottlenecks. To address the low resource utilization caused by the independent execution of computational layers in traditional designs, ref. [122] merged convolution, batch normalization, and activation layers into a unified computational unit, combined with the Winograd algorithm and block-based techniques; this approach significantly improved resource efficiency. To mitigate storage overhead caused by multi-layer data buffering, ref. [123] eliminated dedicated output buffers. Instead, it employed FIFO channels to feed each layer’s output directly into the next layer’s input buffer, effectively reducing on-chip memory consumption.Other designs focus on maximizing performance and versatility through multi-level pipelining. Reference [124] inserts fine-grained pipeline registers within modules such as convolution, pooling, and SoftMax, while leveraging inter-layer pipelining to overlap multi-channel feature map accumulation with off-chip memory access operations, effectively balancing computational speed and power consumption. Addressing the challenges of multimodal computing demands, ref. [125] designed a universal computing array to map convolutions and attention mechanisms uniformly. It also developed specialized acceleration units for various nonlinear and normalization operations, ultimately achieving efficient inference acceleration for ResNet-50 on the Xilinx XCVU37P platform.

Typical CNN accelerator designs often employ a hybrid pipeline architecture: the top layer utilizes inter-layer pipelines to serialize different network layers, enabling macro-task parallelism. In contrast, the bottom layer implements computational pipelines within convolutional layers to optimize each multiply-accumulate operation. This multi-level pipeline design optimistically optimizes latency and throughput, which is the key to achieving high-performance hardware acceleration.

3.2.3. Hardware Accelerator

In FPGA-based convolutional neural network acceleration designs, hardware optimization plays a crucial role. FPGA hardware accelerators effectively offload computationally intensive tasks from general-purpose processors through customized computing architectures. In a typical architecture, shown in Figure 13, the input buffer and weight buffer store the feature maps of the current network layer and the convolution kernel weights, respectively. The processing elements (PEs) in the compute array load portions of the data as needed for computation, with the generated output feature maps stored in the output buffer.

FPGA accelerators exhibit diversity in hardware architecture design. To adapt to various application scenarios and performance requirements, they are deeply customized for specific tasks. Hardware accelerator architectures are typically categorized into two main types: single-engine architecture and streaming architecture. Table 12 compares the advantages and disadvantages of these two architectures.

Single-engine architectures primarily target application scenarios demanding high flexibility and resource reuse rates. Their core concept involves employing a unified, powerful parallel computing engine to sequentially execute computational tasks across all CNN layers through time-division multiplexing. However, while early single-engine architectures achieved basic computational function reuse, they struggled to accommodate the distinct computational demands of training and inference. To address this, ref. [126] proposed a convolutional single-engine architecture that supports both CNN inference and training. This architecture efficiently reuses hardware by incorporating self-adding units and configurable addition trees, accommodating the computational demands of forward and backward propagation. To further enhance the architecture’s adaptability to diverse convolutional operations, Reference [127] designed a single-engine architecture supporting standard convolutions, depthwise separable convolutions, and pointwise convolutions. Reference [128] implemented a functionally similar single-engine using an overlay architecture on FPGAs, configuring a single processing pipeline via control words sent from the host to flexibly process different CNN layers sequentially.

Streaming architectures are primarily designed for data-intensive tasks. Their core concept involves mapping different network layers of a CNN onto a series of specialized hardware processing units arranged in series. Data is pipelined within these units, thereby minimizing global data movement and accelerating overall computation.

The performance of this architecture heavily relies on precise delay balancing between pipeline stages, thus limiting its flexibility and adaptability when dealing with different network structures. To enhance the architecture’s scalability, reference [129] proposed a streaming architecture based on dataflow functional decomposition, achieving deep pipelining and scalability across multiple FPGAs. However, this approach suffers from suboptimal resource utilization versus throughput trade-offs. To address this, reference [130] adopted a fully pipelined architecture, mapping each CNN layer to a dedicated hardware processing stage. Through parallelism optimization search algorithms, it achieved high throughput and resource utilization on the Xilinx KCU1500 platform. Building upon this, ref. [131] proposes a hybrid computation granularity scheme. This approach allows each computational node within the pipeline to possess independently optimized computation granularity, enabling flexible balancing between on-chip memory and off-chip bandwidth. Consequently, the architecture’s efficiency and adaptability are significantly enhanced.

Single-engine architecture and streaming architecture represent two distinct core design trade-offs. The former excels in versatility and resource reuse, easily adapting to diverse neural network models; the latter performs better for specific networks, delivering specialized data pathways optimized for ultra-low latency and high throughput. In practical accelerator design, the choice of architecture depends on the comprehensive requirements of the target application across multiple dimensions, including performance, flexibility, power consumption, and cost.

3.2.4. Impact of Different FPGA Platforms on CNN Acceleration Performance

In the design of FPGA-based CNN accelerators, the selection of the hardware platform is the cornerstone determining the success or failure of the project. Significant differences exist in the underlying architectures of products from different manufacturers (primarily Xilinx and Altera) and across various product series. These differences directly determine the accelerator’s performance ceiling, resource overhead, and power consumption. Therefore, when selecting an FPGA, it is essential to evaluate its internal critical resources based on application requirements comprehensively. The impact of different FPGA platforms on CNN acceleration performance is primarily reflected in the following aspects:

(1) Core Computing Resources: DSPs are the primary units for executing MAC operations. Xilinx’s DSP Slices (e.g., DSP48) excel in fixed-point operations, making them well-suited for CNN’ dense multiply-accumulate operations. Its Versal series integrates the AI Engine (AIE) as a dedicated AI processor, delivering performance and energy efficiency far surpassing traditional DSPs. In contrast, Altera’s DSP modules offer superior support for floating-point operations, making them ideal for mixed-precision inference scenarios [132].

(2) On-chip Storage Architecture: Xilinx’s BRAM/URAM and Altera’s M20K each exhibit distinct characteristics in capacity and bandwidth, directly influencing block segmentation strategies and external memory access frequency during data flow optimization. These factors are pivotal in determining data reuse efficiency [133].

(3) Heterogeneous Integration: SoC FPGAs integrating ARM cores support collaborative processing of software and hardware tasks, making them suitable for embedded scenarios where control and computation tasks are separated. Meanwhile, the high-bandwidth memory (HBM) integrated into high-end devices (such as the Virtex and Stratix 10 series) provides bandwidth far exceeding traditional DDRs for large CNN models, effectively alleviating memory bottlenecks [134].

To illustrate these differences more clearly, we have summarized them in Table 13.

These architectural differences directly translate into real-world CNN application performance. For instance, the Xilinx platform excels in CNN inference tasks dominated by fixed-point operations, leveraging its highly efficient DSP architecture. HARFLOW3D [135] achieved 99.61% DSP utilization on the Xilinx ZCU102 platform, while FMM-X3D [136] reached a high throughput of 119.83 GOPS on the same platform, both demonstrating the advantages of Xilinx devices in building specialized computing engines. Conversely, Altera’s DSP modules leverage their floating-point capabilities and HyperFlex architecture to compete effectively in scenarios requiring mixed-precision computation and high clock frequencies. SpAtten [137] achieved a 13.8x throughput improvement on the Stratix 10 platform by harnessing its massive parallel computing power.

Based on the above analysis, the following platform selection recommendations are provided for different application scenarios:

(1) For edge computing and low-power scenarios, prioritize Xilinx Zynq or Altera Cyclone series SoC FPGA. These heterogeneous platforms, integrated with ARM processors, support efficient hardware/software task partitioning and are ideal for achieving high-energy-efficient embedded systems.

(2) The Xilinx platform is recommended to leverage its DSP and AIE efficiency advantages in fixed-point operations for high-performance and cloud computing scenarios involving fixed-point computation-intensive tasks. However, the Altera platform may be more suitable for large-model inference requiring mixed-precision computation and cloud applications demanding extremely high peak performance.

(3) For large-model inference scenarios, select the Virtex or Stratix series with integrated HBM to meet the memory bandwidth demands of high-resolution, high-volume tasks.

In summary, there is no universal optimal FPGA solution. A successful design must be deeply customized to fit the specific characteristics of the target FPGA platform. As Xilinx’s Versal series continues to enhance its AIE capabilities and Intel’s Agilex series innovates in computational density and energy efficiency, FPGA platforms will deliver increasingly robust hardware-software co-development environments for CNN acceleration.

3.2.5. Summary

This section systematically reviews hardware-level optimization techniques for FPGA-based CNN, presenting qualitative and quantitative analyses of these methods in Table 14, as shown in Table 14. The acceleration achieved by each method was estimated based on results reported in the literature. FPGA selection does not constitute a design strategy and is therefore excluded from comparative analysis.

In summary, a top-tier FPGA-based CNN accelerator is not the result of a single optimization, but rather the synergistic effect of multiple technologies. The most efficient designs typically adopt a hardware architecture combining hybrid parallelism with deep pipelining. This core architecture is underpinned by complex dataflow optimization strategies such as loop slicing and double buffering, ensuring a continuous and efficient data supply to the massive array of parallel processing units. Ultimately, the selection and tuning of these strategies depend entirely on the specific application’s comprehensive requirements for throughput, latency, power consumption, and flexibility.

4. Algorithm–Hardware Co-Optimization

To meet the growing demands for complexity and accuracy in neural architectures, the number of parameters and computational complexity of CNN models have increased significantly. While enhancing algorithm performance, this also imposes stringent requirements on computational and storage resources, posing challenges for deployment on edge devices constrained by power consumption, storage capacity, and computational power.

A fundamental contradiction exists between algorithm design and hardware implementation: algorithms, designed to solve complex problems, tend to adopt flexible, unrestricted computation and memory access patterns; however, hardware prioritizes efficiency by favoring algorithms with fewer computations and memory accesses, more balanced ratios, and more regular patterns. Historically, rapid advancements in hardware technology have masked this conflict. However, as Moore’s Law slows, general-purpose processors increasingly face challenges when handling irregular computations and memory accesses, manifesting as computational load imbalances and memory access contention.

4.1. Algorithm–Hardware Co-Optimization Method

Implementing CNN acceleration on FPGAs demonstrates advanced algorithms through hardware-software co-design. This collaborative design strategy involves two complementary aspects: hardware-oriented algorithm optimization and algorithm-oriented hardware customization.

(1) Hardware-Oriented Algorithm Optimization: At the algorithmic level, strategies such as network pruning and model quantization are employed to reduce computational complexity and storage requirements, making CNN models more suitable for efficient execution on resource-constrained FPGAs. For instance, quantization converts data into 8-bit fixed-point numbers, while hardware-assisted pruning methods are designed to balance model accuracy and power consumption. Concurrently, network architecture optimization is crucial for maximizing the utilization of the FPGA’s inherent parallel processing capabilities.

(2) Algorithm-Centric Hardware Customization: At the hardware level, leverage the programmability of FPGAs to perform hardware-level optimizations for specific CNN models or algorithmic features. This involves designing dedicated accelerators and optimizing memory architectures by utilizing on-chip memory and deep pipelining techniques to minimize access latency to external memory and enhance system throughput.

In the algorithm–hardware co-design process, balancing the design objectives of algorithm effectiveness and hardware efficiency is crucial. The success of an algorithm is measured by upper-layer application metrics, such as accuracy and recall rate; conversely, the quality of hardware depends on underlying efficiency indicators, including latency, throughput, and power consumption. Section 6 will provide a detailed introduction to these evaluation metrics.

4.2. Algorithm–Hardware Co-Optimization Framework

In FPGA-based CNN acceleration research, the algorithm–hardware co-design framework can be divided into three interrelated layers: algorithm mapping and automated toolflow, design space exploration (DSE), and performance evaluation and modeling. Its architecture is illustrated in Figure 14. These three aspects collectively form a full-process optimization loop, spanning from high-level algorithm descriptions to low-level hardware implementations, ensuring design efficiency and practicality.

4.2.1. Algorithm Mapping and Automated Toolflow

Algorithm mapping serves as the cornerstone of hardware–software co-design, with its core task being the efficient conversion of high-level CNN algorithms into concrete hardware structures executable on FPGAs. The primary objective of this step is to ensure the algorithm runs efficiently on the hardware.

Depending on the optimization objective, mapping strategies primarily fall into two categories: one is customized mapping that pursues ultimate performance for specific models by designing highly optimized dedicated hardware logic to achieve maximum efficiency; the other is generalized mapping that emphasizes flexibility and development efficiency by constructing reusable, configurable hardware frameworks to accommodate multiple CNN models without requiring hardware recompilation.

In current practice, various highly automated CNN acceleration toolflows have achieved this complex mapping process. Table 15 summarizes mainstream FPGA-based CNN toolflows. These toolflows encapsulate the intricate technologies spanning from deep learning frameworks to hardware acceleration, enabling users to generate application-specific optimized accelerators without requiring deep hardware expertise.

As application demands and hardware capabilities evolve, the CNN toolflow continues to drive synergistic innovation between algorithms and hardware. Addressing the lengthy development cycles and optimization challenges of traditional manual 3D-CNN accelerator design, HARFLOW3D—an open-source FPGA toolflow for 3D-CNN proposed in [135]—enables automatic deployment of new models like X3D-M onto diverse FPGAs. It outperforms manual designs in latency, energy efficiency, and model coverage. Despite the efficiency gains from automated toolflows, resource imbalance and performance bottlenecks persist when mapping complex networks to heterogeneous FPGA platforms. Addressing this, ref. [138] employs SDF graph-driven partitioning, reconfiguration, and automatic DSE through pgaHART to map 3D-CNN across diverse FPGA. Furthermore, addressing the high cost of cross-platform deployment, the FPG-AI framework proposed in [139], leverages end-to-end automation to achieve high-precision, low-resource, and low-latency CNN deployment across multiple FPGA platforms. It demonstrates industry-leading cross-platform portability and scalability.

Table 16 further summarizes these innovative tool flows. The data analysis in Table 16 reveals that current automation tool flows exhibit significant performance stratification.

In terms of throughput, significant performance variations were observed across different toolflows. Among them, AWARE-CNN achieved the highest throughput of 271 GOPS on AlexNet, while FPG-AI reached only 7.30 GOPS on MobileNet-V1, highlighting disparities in network adaptability and optimization strategies among toolflows. Notably, FMM-X3D, designed for 3D-CNN, achieved 119.83 GOPS throughput on the X3D-M model, demonstrating the advantage of specialized tools on specific architectures. Regarding resource utilization, HARFLOW3D achieved 99.61% DSP utilization, approaching the theoretical upper bound, while mNet2FPGA reached only 31%. This highlights significant differences in the maturity of resource scheduling and allocation algorithms across different toolflows.

Through observation, it can be seen that newer tool flows, represented by HARFLOW3D and FMM-X3D, maintain high throughput while achieving resource utilization rates exceeding 80%. This indicates that automated tool flows are gradually shifting from a single-minded focus on performance to a development path that emphasizes both performance and efficiency.

In summary, the emergence of these innovative tool flows has significantly lowered the barrier to FPGA development, enabling users to generate customized CNN hardware implementations without requiring deep hardware expertise. This has powerfully advanced the integration and application of FPGA throughout the entire deep learning ecosystem.

4.2.2. Design Space Exploration

In FPGA-based CNN acceleration designs, Design Space Exploration (DSE) plays a critical role as an automated optimization process. Its core objective is to identify the optimal accelerator configuration for specific CNN models and application requirements from a vast array of hardware parameter combinations, while strictly adhering to the target FPGA’s constraints on resources, power consumption, and bandwidth.

The flexibility and reconfigurability of FPGAs provide a vast design space, but also make it exceptionally difficult to find optimal solutions manually. Automated search algorithms, however, can systematically and efficiently explore this vast design space to find optimal or near-optimal solutions. These methods can be primarily categorized as follows:

(1) Exhaustive Search: Exhaustive search ensures the discovery of the global optimum by traversing all possible parameter combinations within the design space. However, due to the exponential growth of parameter combinations with increasing dimensionality, it is only applicable to highly simplified or constrained design spaces.

(2) Heuristic Search: As the most mainstream DSE strategy, it aims to find high-quality near-optimal solutions within vast search spaces efficiently. Standard heuristic algorithms include:

Simulated Annealing: Simulated annealing is a probabilistic algorithm for global optimization. It avoids getting stuck in local optima by selectively accepting “inferior” solutions early in the search, gradually converging as iterations progress.

Genetic Algorithms: Genetic algorithms are heuristic search methods that mimic biological evolution. Through repeated cycles of evaluation, selection, crossover, and mutation, they iteratively evolve toward optimal or near-optimal configurations.

(3) Deterministic Approach: This method directly derives the required hardware parameters from given high-level constraints through a series of mathematical equations and analytical models. It eliminates the need for exhaustive search, offering high computational efficiency. However, its accuracy heavily depends on the precision of the underlying models.

(4) Hybrid Strategy: Hybrid strategies combine the above approaches, typically starting with low-cost methods to rapidly narrow the search space before conducting detailed evaluations on a small number of high-quality candidate solutions. This approach strikes a balance between search efficiency and solution quality.

This paper systematically summarizes the aforementioned methods in Table 17.

4.2.3. Performance Evaluation and Modeling

Within the algorithm–hardware co-design framework, performance evaluation serves as a critical validation step. Its core objective is to verify whether the hardware implementation meets predefined performance metrics and resource constraints. This is not a single test but rather a systematic engineering process comprising three core steps: establishing standards, constructing models, and applying optimizations. It provides reliable evaluation criteria for optimizing the entire co-design workflow.

Step 1: Establish evaluation criteria. This forms the cornerstone of the entire process—defining the multidimensional metrics used to measure design quality. These metrics typically span both software and hardware dimensions: encompassing application-level indicators such as accuracy and recall that gauge algorithm effectiveness, as well as physical-layer metrics like latency, throughput, power consumption, and resource utilization that assess hardware efficiency.

Step 2: Construct evaluation models. Resource and performance models are typically employed to evaluate designs rapidly during the DSE process. Resource models estimate hardware requirements for the target FPGA architecture, while performance models predict system performance based on hardware configuration and algorithmic characteristics. In practical applications, researchers construct objective functions using these models to holistically consider factors such as performance, resources, and power consumption. This enables systematic exploration of the design space to identify optimal configurations.

Performance Modeling: Performance models predict key metrics of CNN accelerators under specific design parameter combinations, such as throughput, latency, and power consumption. Addressing the challenge of accurately estimating overall latency when heterogeneous computing units collaborate—a difficulty for traditional evaluation methods—[150] models the accelerator’s total latency from perspectives including computational latency and DRAM transfer latency. Different computational approaches are adopted based on whether data can be fully cached on-chip. To further address the challenge of inaccurate latency prediction for convolutional computations in cryptographic scenarios, ref. [151] constructs a performance model tailored for homomorphic encrypted convolutions. This model quantifies computational latency through cryptographic parameters, convolutional parameters, and hardware specifications, guiding layer-wise algorithm selection. Ultimately, it enables the design of a low-latency, unified hardware architecture for HE-CNN inference acceleration.

Resource Modeling: Resource models predict the consumption of FPGA hardware resources—such as DSP, BRAM, and LUT—under specific design parameter combinations. Addressing the inaccuracies and post-physical synthesis dependency of traditional resource estimation methods, reference [152] employs parametric expressions to map structural parameters of convolutional and fully-connected layers to FPGA resource usage (BRAM, DSP, LUT, FF, etc.), enabling hardware parameter optimization and design space exploration. To enhance the accuracy and practicality of resource evaluation, reference [153] constructs a high-level resource model based on actual physical synthesis results, enabling rapid assessment of CNN accelerator power consumption and area, along with design space exploration.

Additionally, to address the issue of limited optimization effectiveness caused by the isolation between performance and resource assessments, references [109,148,154] integrate performance and resource modeling to establish a unified optimization framework. These models predict DSP and BRAM utilization, execution cycles, and throughput through mathematical modeling, thereby maximizing resource utilization and computational efficiency.

Step 3: Apply Improvement Strategies. Based on the precise evaluation results from the model, designers can identify performance bottlenecks and implement targeted improvement strategies to enhance overall performance. Through continuous iterative optimization of the design, it ultimately meets all predefined evaluation criteria.

Throughout the co-optimization process, these three elements are not isolated but form a complete closed-loop system: design space exploration guides algorithmic mapping, performance evaluation and modeling provide feedback for design space exploration, and the effective implementation of the entire process relies on robust automated toolchain support. Decomposing the algorithm–hardware co-optimization framework into these three core components helps designers systematically tackle complex design challenges, ultimately achieving efficient CNN hardware accelerators.

4.2.4. Case Studies of Collaborative Design

Although the methodology sounds reasonable, the actual value of the collaborative design framework must be demonstrated through real-world deployment across diverse platforms, networks, and constraints to verify whether it can consistently produce energy-efficient, low-latency, high-precision accelerators.

Several representative studies in recent years have collectively established a cross-scale chain of evidence, demonstrating the framework’s formidable capabilities. These cases span sparse 3D video networks, ultra-low-bit-width CNN, and even lightweight Transformers for visual tasks. Collectively, they demonstrate how collaborative design can simultaneously boost energy efficiency and throughput by 2 to 13 times while maintaining accuracy loss below 1%. This provides robust empirical support for the framework’s applicability across diverse scenarios—from cloud to edge computing, ASIC to automotive-grade chips. Specifically:

Addressing the computational intensity and complex memory access patterns of traditional 3D-CNN, HARFLOW3D-X3D maps 3D-CNN onto the ZCU102 platform as time-multiplexed subgraphs, reducing theoretical computational load by 31%. Subsequently, a genetic algorithm explored 270,000 configuration combinations within 6 h, ultimately identifying a design achieving 99.6% DSP utilization and 20.0 GOPS/W energy efficiency. This solution reduces latency by 42% compared to the manually designed baseline [135].

To address the computational overhead challenge of Transformer attention mechanisms, SpAtten discovered during the algorithm mapping phase that computational load could be reduced by 8 times when attention head sparsity exceeded 75%. The joint optimization of sparsity thresholds and on-chip cache depth using simulated annealing during the DSE phase, a 13.8-fold throughput improvement was achieved on Stratix-10 hardware. At the same time, the F1 score declined by only 0.3% [137].

Eyeriss v2 achieves 16x structural pruning on a 65nm process through budget-aware regularization for deployment scenarios with stringent resource constraints on edge devices. Its DSE phase employs a two-step “power filtering + PE enumeration strategy,” ultimately delivering 2.4 TOPS/W energy efficiency, 38% area reduction, and a 0.7% improvement in Top-1 accuracy [155].

To explore the balance between performance and precision under extreme quantization, FINN-R quantizes weights to 1 bit on the Zynq-7020 platform. Subsequently, through exhaustive search combined with Pareto frontier analysis, it evaluated 21,000 configurations within 30 min. At 12.3 GOPS performance and 0.8W power consumption, MNIST accuracy remains at 99.2%, with throughput increased by 2.7 times compared to the 8-bit design [156].

With stringent demands of low latency and high reliability in mobile visual tasks, MobileViT-v3 introduces Adaptive Token Sampling (ATS) and a linear attention mechanism to reduce computational complexity significantly. Within its DSE, this approach performs Pareto optimization between latency and accuracy by leveraging heuristic search strategies based on deep performance profiling of target platforms. Ultimately, experiments on the ImageNet dataset demonstrate that this design achieves up to 2.3 times faster inference speed than standard MobileViT-v2 with only a 0.4% loss in Top-1 accuracy, while exhibiting outstanding energy efficiency [157].

The specific analysis of the five case studies is shown in Table 18. The data reveals that while these studies exhibit distinct characteristics in target platforms, algorithm mapping strategies, and DSE methods, they achieve significant performance improvements through the collaborative design framework. Notably, all cases substantially enhance energy efficiency and throughput while maintaining accuracy loss at an extremely low level (≤1%), fully demonstrating the effectiveness of the collaborative design framework in balancing performance and accuracy.

The five empirical cases spanning diverse network architectures and hardware platforms—from 3D-CNN to Transformers, and from cloud-based FPGA to embedded SoC—collectively provide robust cross-platform and cross-model support for the algorithm–hardware co-design framework. These quantitative results validate the framework’s universality and effectiveness and establish a reusable engineering paradigm for future accelerator innovations targeting large models, multimodal data, and heterogeneous chiplets. In other words, the era of algorithm–hardware co-optimization has arrived, and “joint optimization” centred on three-stage collaboration is becoming the inevitable choice for continuously unlocking the potential of AI hardware.

5. Extended Analysis for Diverse Embedded Applications

Although this paper focuses on FPGA-based CNN hardware acceleration, the design principles and optimization methods discussed here possess broad applicability and hold significant reference value for other computationally intensive embedded system designs. To validate the universality of this optimization technique, this section systematically analyses its application value and optimization strategies in critical embedded AI scenarios such as medical inspection, industrial maintenance, and radar sensing.

Medical Diagnostics Domain: To address the demand for low-power, scalable, and high-precision auxiliary diagnostic solutions in the early diagnosis of respiratory diseases, Reference [158] proposes a multimodal CNN architecture named RespiratorNet. At the algorithmic level, this solution employs low-bit quantization and hyperparameter optimization; At the hardware level, it achieves ultra-low power consumption of just 245 mW on the Xilinx Artix-7100T FPGA through an expandable parallel processing engine, delivering an energy efficiency of 7.3 GOPS/W—4.3 times better than the then-state-of-the-art solution. Furthermore, when deployed on the NVIDIA Jetson TX2 SoC, the model demonstrates excellent cross-platform scalability.

Industrial Predictive Maintenance Domain: Addressing the challenges of low maintenance efficiency for traditional bearings, unreliable data acquisition, and insufficient real-time analysis capabilities in Industry 4.0, Reference [159] developed an FPGA-based industrial IoT edge device. Algorithmically, it employs an optimized CNN for high-precision fault prediction. Hardware-wise, it integrates an MFCC preprocessing IP core onto the Pynq-Z2 FPGA board to convert vibration signals into two-dimensional features. Leveraging the Intel NCS2 accelerator for CNN inference, it achieves collaborative diagnosis through software-hardware synergy. Ultimately, the system achieves end-to-end latency below 2 milliseconds, meeting the stringent real-time demands of industrial scenarios.

Radar Perception Domain: To address computational and energy efficiency challenges in radar signal recognition under low-power and high-real-time scenarios, Reference [160] constructs a hybrid neural network model combining a one-dimensional convolutional neural network (1D-CNN) with a long short-term memory (LSTM) network. At the algorithmic level, this approach employs 16-bit fixed-point representation for data and parameters, utilising lookup tables to implement nonlinear activation functions. It features a 32 × 64 pulsed array design at the hardware level that reuses the same computational unit for parallel processing of convolutions and matrix multiplications. A 9-instruction dedicated instruction supports its set, enabling flexible model deployment. Implemented on an Xilinx XCKU040 FPGA, the system achieves 7.34 GOPS throughput at just 5.022 W power consumption, delivering an energy efficiency of 1.46 GOPS/W. This represents 73 times the efficiency of CPUs and 9 times that of GPUs, making it suitable for demanding scenarios such as satellite-borne and portable applications.

An analysis of the aforementioned application scenarios can derive a universal co-design methodology for embedded FPGA acceleration. As shown in Table 19, different application domains prioritize distinct algorithm selections and optimization strategies, yet all adhere to the core principle of unified algorithm–hardware co-optimization.

Most CNN acceleration applications face similar challenges: achieving real-time large-scale data processing under strict power and latency constraints. This section’s analysis demonstrates that algorithm–hardware co-design possesses strong universality and scalability. Its core philosophy—optimising algorithms based on hardware characteristics and customising hardware according to algorithm features—effectively adapts to various embedded AI application scenarios. This approach provides an effective solution to CNN deployment challenges and offers a systematic design methodology for achieving high performance in other cutting-edge embedded applications.

6. Evaluation Framework

In the accelerated design of CNN, a comprehensive set of evaluation metrics is central to verifying their ultimate effectiveness. For FPGA-based CNN acceleration solutions, the evaluation framework typically encompasses two dimensions: network model performance assessment and hardware acceleration performance assessment.

6.1. Network Model Performance Evaluation

(1) Accuracy

Accuracy is the core performance metric for evaluating the effectiveness of an optimized model. When deploying CNN models onto FPGA platforms, optimization techniques such as model compression and data quantization are typically employed to reduce computational complexity and the number of parameters. While these techniques enhance hardware efficiency, they often come at the cost of sacrificing some model accuracy, potentially diminishing predictive capabilities. Therefore, precisely evaluating the accuracy of optimized models to ensure they meet application requirements is of paramount importance.

By definition, accuracy refers to the proportion of correctly predicted samples in a CNN model relative to the total number of samples. Its general calculation formula is as follows:

A c c u r a c y = \frac{T P + T N}{T P + T N + F P + F N} \times 100 %

(3)

Among these,

T P

and

T N

represent samples correctly predicted by the model, while

F P

and

F N

represent samples incorrectly predicted by the model.

For multi-classification tasks, accuracy evaluation typically employs two more granular metrics to comprehensively assess the degree of alignment between model predictions and accurate labels:

Top-1 Accuracy: The proportion of instances where the model’s top-ranked prediction category matches the accurate label exactly. This is the most stringent and commonly used metric for evaluating accuracy.

Top-5 Accuracy: Refers to the proportion of the actual label included among the top five categories with the highest predicted probabilities. This metric is more inclusive in scenarios where categories share semantic similarity or where prediction results allow for a certain degree of ambiguity.

This paper compares the two in Table 20.

Although accuracy is a highly intuitive metric, its application has limitations. When dealing with imbalanced class distributions in the dataset, relying solely on accuracy may fail to reflect the model’s actual performance fully and could even be misleading. Therefore, when evaluating model performance, it is often necessary to incorporate additional metrics, such as precision and recall, for a comprehensive assessment.

(2) Precision and Recall

To address the limitations of accuracy in scenarios such as data imbalance, precision and recall must be introduced to provide a more comprehensive evaluation of model performance.

Precision is a key metric for evaluating the accuracy of a model’s predictions. It is defined as the proportion of samples that are true positives among all samples predicted as positive by the model:

P r e c i s i o n = \frac{T P}{T P + F P}

(4)

However, precision focuses solely on samples predicted as positive examples, ignoring those incorrectly predicted as negative examples (

F N

). This limitation prevents it from measuring whether the model fails to detect positive examples that should have been identified. In critical tasks where missed detections carry extremely high costs—such as early lesion diagnosis in medical fields or severe defect detection in industrial settings—high precision alone is far from sufficient to guarantee system reliability.

To address this issue, recall is used to measure a model’s ability to identify all true positive samples. It is defined as the proportion of all true positive examples that are successfully predicted by the model, calculated as follows:

R e c a l l = \frac{T P}{T P + F N}

(5)

In summary, precision and recall evaluate model performance from different perspectives, with each constraining the other. Pursuing high precision may lead to a decline in recall, and vice versa. Therefore, in practical applications, optimization goals should be set based on specific business requirements and risk preferences, or comprehensive metrics such as the F1 score should be used to seek the optimal balance.

(3) F1 Score

There exists an inherent trade-off between precision and recall, where one improves at the expense of the other. From the perspective of a model’s decision threshold, relaxing the criteria to capture more positive examples (increasing recall) almost inevitably introduces more false positives, thereby harming precision. Conversely, tightening the requirements to ensure each optimistic prediction is as accurate as possible (increasing precision) inevitably leads to missing more genuinely positive examples that are difficult to classify, resulting in a decrease in recall.

To scientifically balance this trade-off and comprehensively evaluate a model’s overall performance using a single metric, the F1-score [166] was introduced. The F1 score is defined as the harmonic mean of precision and recall. Compared to the simple arithmetic mean, the harmonic mean imposes stricter penalties on extreme cases—meaning that when either precision or recall performs poorly, the F1 score will correspondingly become very low.

The specific calculation formula is as follows:

F 1 = 2 \times \frac{Pr e c i s i o n \times Re c a l l}{Pr e c i s i o n + Re c a l l}

(6)

A high F1 score indicates that the model has achieved a good balance between precision and recall, serving as strong evidence of its overall performance and robustness.

6.2. Hardware Acceleration Performance Evaluation

After ensuring the model’s accuracy meets application requirements, the focus of evaluation shifts to the execution efficiency and resource consumption of the hardware acceleration itself. To this end, a series of hardware performance metrics—such as throughput, latency, power consumption, energy efficiency, and resource utilization—are widely employed to quantify and compare FPGA-based CNN acceleration designs.

(1) Latency

Latency is the core metric for measuring the responsiveness of CNN hardware accelerators. It is defined as the total time elapsed from data input to the output of the result. In terms of composition, the total latency is the sum of the delays across all layers in the network model:

L a t e n c y_{t o t a l} = \sum_{i = 1}^{L} L a t e n c y_{l a y e r_{i}}

(7)

Here, L represents the total number of layers in the neural network, and

L a t e n c y_{l a y e r_{i}}

denotes the time required to process the i-th layer.

For single-layer computation, its latency can be further decomposed into:

L a t e n c y_{l a y e r} = T_{o f f - c h i p} + T_{o n - c h i p} + T_{c o m p u t e}

(8)

Among these,

L a t e n c y_{o f f - c h i p}

represents off-chip transmission delay, which is the time required for model parameters and input/output feature maps to transfer between off-chip memory and the FPGA chip. This is jointly determined by the model parameters and memory interface bandwidth.

L a t e n c y_{o n - c h i p}

denotes on-chip transmission delay, which is the data transfer time between different functional modules within the chip. It is influenced by the efficiency of the hardware interconnect architecture and data routing strategies.

L a t e n c y_{c o m p u t e}

is computational latency, representing the time required to execute core operations such as convolution or pooling within the layer. This depends on the number of operations demanded by the model and the degree of parallelism achievable through the hardware architecture.

(2) Throughput

Throughput is a core metric for measuring the processing capability of hardware accelerators, which defines the number of complete inference tasks that can be completed within a specified time unit. This metric, together with latency, forms the basis for evaluating system performance. Its upper limit is constrained by multiple physical factors, including the device’s operating frequency, memory bandwidth, and internal computational resources. Typically, throughput can be categorized into theoretical peak throughput, actual computational throughput, and application throughput. Together, these three elements form a comprehensive evaluation framework that spans from the hardware’s theoretical potential to design efficiency, and ultimately to real-world application performance.

① Theoretical Peak Throughput (Unit: TOPS)

This metric serves as both the baseline and upper limit of the evaluation system, measuring the theoretical peak computational capability of an FPGA accelerator design. It represents the maximum number of operations the hardware can execute per unit time without any bottlenecks. This figure is entirely determined by the FPGA hardware design itself.

The calculation formula is as follows:

P e a k_T O P S = \frac{N_{D S P} \times F_{o p} \times f_{c l k}}{10^{12}}

(9)

Here,

N_{D S P}

represents the total number of DSP units used for core computation in the FPGA design;

F_{o p}

denotes the number of equivalent operations each DSP can complete within one clock cycle;

F_{c l k}

represents the actual operating clock frequency of the FPGA design. Additionally, if the denominator is changed from

10^{12}

to

10^{9}

, the unit changes from TOPS (trillion operations per second) to GOPS (billion operations per second).

② Actual Computed Throughput (Unit: TOPS)

The actual throughput calculation serves as an intermediate step in the evaluation system, measuring the average computational speed achieved by the hardware when running a specific neural network model. Due to practical bottlenecks such as memory bandwidth and control logic overhead, this value typically falls below the theoretical peak.

The calculation formula is as follows:

A c t u a l_T O P S = \frac{\begin{matrix} T o t a l & O p e r a t i o n s \end{matrix}}{L a t e n c y}

(10)

Total Operations

refers to the total computational effort required to complete a single model inference, determined by the model itself.

L a t e n c y

denotes the actual measured time taken to complete this inference.

③ Application Throughput (Unit: FPS)

Application throughput is the ultimate output of the evaluation system and the metric most closely aligned with the end-user experience. It measures the number of complete inference tasks an FPGA-accelerated design can complete per second—for example, the number of images it can process per second.

The single-frame processing formula is as follows:

A p p l i c a t i o n T h r o u g h p u t F P S = \frac{1}{L a t e n c y_{f r a m e}}

(11)

The batch processing formula is as follows:

A p p l i c a t i o n T h r o u g h p u t F P S = \frac{\begin{matrix} B a t c h & S i z e \end{matrix}}{L a t e n c y_{b a t c h}}

(12)

Here,

L atenc y_{f r a m e}

represents single-frame latency, which refers to the total time required to process a single image from input to output.

B a t c h S i z e

denotes batch size, indicating the number of pictures packaged and sent to the accelerator for processing at once.

L atenc y_{b a t c h}

signifies batch latency, representing the total time needed to process an entire batch of images. While batch processing typically leverages hardware parallel computing capabilities more effectively, significantly boosting FPS, it comes at the cost of potentially increased latency per individual image.

When comprehensively evaluating FPGA CNN accelerators, these three throughput metrics are complementary and indispensable for accurate assessment. Theoretical peak throughput reveals the theoretical upper limit and design potential of the hardware architecture. Actual computational throughput quantifies how well this potential is realized under real-world workloads. Finally, application throughput directly reflects its real-world operational efficiency in target applications. Together, they form a comprehensive performance view that spans from theory to practice.

(3) Computational Efficiency

Theoretical peak throughput represents the ideal performance ceiling of hardware accelerators, yet this ceiling often remains unattainable in practical CNN inference tasks. Two primary reasons contribute to this gap: First, algorithm–hardware mismatch occurs when the parallelism of network layer parameters fails to align perfectly with the parallelism of processing units in the design, leading to uneven computational resource loading and underutilization. Second, system-level bottlenecks—particularly memory bandwidth constraints—prevent the timely delivery of data to computational units, resulting in significant idle cycles.

To precisely quantify the actual performance of hardware designs under such real-world constraints, computational efficiency [167] is introduced. This metric is defined as the ratio of actual computational throughput to the theoretical peak throughput of the design. Its core principle lies in measuring the proportion of cycles genuinely spent on practical computations across all computational cycles, serving as a crucial benchmark for evaluating the actual effectiveness of hardware accelerators.

The calculation formula is as follows:

E f f i c i e n c y_{c o m p u t e} = \frac{A c t u a l_T O P S}{P e a k_T O P S} \times 100 %

(13)

Computational efficiency, as an evaluation metric, possesses a crucial characteristic: it essentially isolates the influence of specific FPGA device scale, instead focusing on the intrinsic quality of the accelerator architecture itself. This includes the effectiveness of its data flow scheduling, parallel processing, and memory access strategies. This makes it an invaluable metric for conducting fair and reliable performance comparisons across different hardware architectures.

(4) Resource Utilization Rate

Resource utilization is a fundamental and critical evaluation metric that quantifies the proportion of various hardware resources consumed by FPGA-accelerated designs on the chip.

The calculation formula applies to all types of FPGA resources:

U t i l i z a t i o n = \frac{U s e d}{A v a i l a b l e} \times 100 %

(14)

This formula is used to evaluate the three key resources in an FPGA: logic resources (LUT and FF), memory resources (BRAM/URAM), and digital signal processing resources (DSP).

Under the premise of meeting performance requirements, the ultimate goal of resource utilization optimization is to minimize overall resource consumption. A design with low resource consumption not only enables deployment on more minor, lower-cost FPGA chips but also directly impacts power consumption and thermal management costs. More importantly, it reserves valuable hardware space for future functional upgrades or performance expansions.

(5) Energy Efficiency Ratio

The energy efficiency ratio (EER) serves as the core metric for evaluating the computational efficiency versus power consumption relationship of CNN hardware accelerators. It is defined as the amount of practical computations completed per unit of power consumed.

The calculation formula is as follows:

E f f i c i e n c y_{E n e r g y} T O P S / W = \frac{T h r o u g h p u t (T O P S)}{P o w e r (W)}

(15)

T h r o u g h p u t (T O P S)

represents the actual computational throughput, while

P o w e r (W)

denotes the exact power consumption during accelerator operation. This value can be estimated using FPGA development tools or precisely measured with physical instruments.

With the explosive growth of computing demands, power management has become a critical challenge across both large-scale data centers and resource-constrained edge devices. Particularly in edge computing and embedded applications, where devices are susceptible to power consumption and thermal management, the energy efficiency ratio often emerges as one of the most decisive evaluation metrics. Optimizing energy efficiency not only directly reduces electricity expenses and operational costs but also effectively alleviates thermal stress, thereby safeguarding overall system stability and long-term reliability. In this metric, FPGA demonstrate exceptional competitiveness due to their relatively low static power consumption and high degree of customizability.

(6) Clock Efficiency

Clock efficiency is a critical timing performance metric that quantifies the ratio between the actual operating frequency achievable by an FPGA design and the theoretical maximum frequency of the chip it utilizes.

The calculation formula is as follows:

E f f i c i e n c y_{c l o c k} = \frac{f_{o p e r a t i n g}}{f_{max}} \times 100 %

(16)

Among these,

f_{o p e r a t i n g}

represents the actual operating frequency—the highest stable frequency determined by the FPGA development tool after completion of placement and routing, based on timing analysis reports.

f_{max}

denotes the theoretical maximum frequency, which is the upper limit of theoretical speed specified in the FPGA chip’s official data sheet.

A high clock efficiency indicates healthy design timing, short critical paths, and full utilization of the hardware’s speed advantages. Conversely, low efficiency clearly points to timing bottlenecks within the design. Therefore, enhancing clock efficiency through code optimization, improved layout and routing strategies, or strengthened power and thermal management is a crucial step in maximizing hardware computational potential and achieving superior performance in demanding applications, such as real-time processing.

(7) Overall Efficiency

To comprehensively evaluate the overall performance of FPGA-based CNN acceleration designs and avoid the limitations of single metrics, the concept of Overall Efficiency is proposed. The core value of this metric lies in providing a normalized single value that comprehensively quantifies the effectiveness of accelerator designs in leveraging the theoretical computational potential of target FPGA chips, thereby enabling a macro-level assessment of design quality.

Overall efficiency is defined as the percentage of sustained performance achieved by the accelerated design relative to the theoretical peak performance of the deployed FPGA device. The most ingenious aspect of overall efficiency is that it can be decomposed into the product of the aforementioned core efficiency metrics, thereby clearly revealing the composition and bottlenecks of performance.

The calculation formula is as follows:

E f f i c e n c y_{o v e r a l l} = E f f i c e n c y_{c o m p u t e} \times U t i l i z a t i o n \times E f f i c i e n c y_{c l o c k}

(17)

Computational efficiency reflects the alignment between architecture and algorithms, while resource utilization measures the proportion of core computational resources occupied by the chip. Clock efficiency, on the other hand, indicates the extent to which the actual operating frequency leverages the chip’s speed potential.

These three components reveal a common trade-off strategy in high-performance FPGA design: excessively high resource utilization may cause routing congestion, thereby reducing clock efficiency; conversely, pursuing higher clock frequencies may necessitate sacrificing some resource parallelism. It is worth noting that when calculating theoretical peak performance, logic resources (LUT) are typically excluded. This is because implementing computations using LUTs significantly complicates performance evaluation models, and the substantial amount of logic resources used for auxiliary functions themselves negatively impact the design’s maximum operating frequency.

In summary, overall efficiency provides a unified view of hardware performance, effectively helping designers identify imbalances among architecture, resources, and timing. Through this comprehensive evaluation framework, researchers can conduct fairer quantitative comparisons of existing designs and gain insights into emerging trends and optimization directions for CNN hardware accelerator design.

7. Challenges and Future Research Directions

Recent advancements in CNN acceleration and optimization on FPGA platforms mark significant milestones in this field. Despite notable achievements, numerous challenges persist as AI applications continue to evolve. Future development will increasingly focus on system-level co-optimization of algorithms, hardware, and toolchains. This section first outlines existing challenges before discussing future research trends.

7.1. Challenges

(1) Trade-offs between power consumption, computation, and resources

When deploying CNN accelerators on FPGAs, the primary challenge lies in balancing power consumption, computational performance, and the limited hardware resources available. For instance, high-throughput designs typically demand more hardware resources, which can become a bottleneck in resource-constrained devices. Particularly in portable devices, achieving efficient resource utilization without compromising model performance remains a critical challenge.

(2) Memory and Bandwidth Bottlenecks

Large CNN models possess an enormous number of parameters, far exceeding the limited on-chip memory (BRAM) capacity of FPGAs. This necessitates frequent data loading from off-chip memory (DRAM), resulting in high latency, power consumption, and bandwidth constraints. Although techniques like double buffering and data tiling can mitigate these issues, these approaches also increase design complexity and introduce additional power consumption.

(3) Hardware Utilization and Design Complexity

The CNN model exhibits significant variations in layer-specific dimensions and computational patterns. A fixed PE architecture optimized for a particular layer may result in suboptimal resource utilization or even PE idling when processing other layers. Furthermore, large-scale parallelization introduces timing challenges, and designing efficient pulsed arrays is no simple task. While High-Level Synthesis (HLS) tools can streamline the design process, the resulting hardware efficiency may fall short of that achieved with manually written RTL code.

(4) The Complexity of Design Space Exploration

The high flexibility of FPGA results in an enormous number of hardware parameter combinations (such as parallelism, buffer size, and data precision), while CNN model architectures also vary widely. This makes finding the optimal configuration for a specific model exceptionally challenging. Although automated design space exploration tools and analysis models (such as Roofline) can assist, the exploration process remains complex and time-consuming.

(5) High development barriers

Traditional FPGA development relies on hardware description languages like Verilog and VHDL, presenting a relatively high programming barrier. Although development environments such as HLS and SDSoC have reduced complexity to some extent, compared to mature frameworks like CUDA on GPUs, the FPGA development toolchain still lags in terms of ease of use and ecosystem maturity. Achieving high-performance optimization still requires deep hardware expertise.

(6) Load Balancing in Multi-FPGA Systems

For ultra-large-scale CNN, a single FPGA struggles to meet both computational and storage demands, necessitating the use of multi-FPGA systems. However, this introduces new challenges such as load balancing, task partitioning, and inter-chip interconnects. If not properly managed, this may result in idle FPGAs, reducing overall computational efficiency.

7.2. Future Research Directions

Based on a systematic analysis of existing technologies, this paper identifies several key areas where FPGA-based CNN acceleration techniques are expected to encounter significant development opportunities in the future. Recommendations are provided for researchers and practitioners in this field:

(1) Toward Dynamic Reconfigurable and Hybrid Architectures

To address the trade-offs between power consumption, computation, and resources, future research could focus on developing universal accelerators capable of running multiple CNN models on a single FPGA platform. Examples include hybrid mapping mechanisms that dynamically switch data flow strategies based on the characteristics of the layers. Simultaneously, specific reconfiguration techniques hold promise for customizing data flow and tiling strategies according to the requirements of each layer, thereby enhancing efficiency.

(2) Building a Deeply Optimized Memory Hierarchy

To address memory and bandwidth bottlenecks, future efforts should focus on optimizing memory systems and improving their performance. For instance, implementing a hierarchical cache architecture can significantly reduce off-chip accesses and enhance on-chip storage utilization. Additionally, effectively leveraging multi-level memory resources such as HBM, URAM, and BRAM, alongside developing multi-level data tiling strategies, will represent key research directions.

(3) Deepening the Co-design of Algorithms and Hardware

To enhance hardware utilization and reduce design complexity, we must promote deep integration between algorithms and hardware. For instance, this involves developing intelligent algorithms that automatically adapt to hardware constraints or designing specialized hardware units for specific network operators. Such collaborative design enables unified code generation, loop conversion, and memory mapping across different networks while minimizing user intervention.

(4) Develop Fully Automated and Intelligent Design Tools

To address the complexity of design space exploration, future efforts can focus on developing fully integrated toolchains capable of automatically mapping, optimizing, and deploying CNN models to FPGA bitstreams. These toolchains should incorporate intelligent design space exploration engines—such as machine learning-based methods—to efficiently identify near-optimal hardware configurations, thereby significantly boosting development efficiency.

(5) Advancing the Collaborative Optimization of Advanced Synthesis and Hardware-Friendly Algorithms

To lower development barriers, future efforts should focus on refining HLS technology to enable the direct generation of high-performance hardware using high-level languages, such as C++. Concurrently, promote hardware-friendly neural network design by optimizing computational and data layouts through techniques such as model compression, sparsification, and quantization.

(6) Integrated Heterogeneous Computing Platform

To address the challenge of load balancing across multiple FPGAs, FPGAs can be integrated with processors such as CPUs and GPUs to form a heterogeneous computing platform. FPGAs can serve as coprocessors dedicated to computationally intensive tasks, while CPUs handle control logic and task scheduling. Through advanced interconnect protocols and a unified programming model, automatic task allocation and data communication can be achieved, simplifying the multi-chip development process.

Despite facing numerous challenges in resource constraints, design complexity, and development barriers, FPGA-based CNN acceleration offers irreplaceable advantages in AI acceleration due to its high energy efficiency, high parallelism, flexibility, and customizability. Looking ahead, with the refinement of automated toolchains, advancements in heterogeneous computing architectures, progress in dynamic reconfigurable technologies, and innovations in hardware-friendly algorithms, FPGAs are poised to play an increasingly central role in both cloud and edge computing. They will deliver more powerful and energy-efficient computational support for AI applications.

7.3. Limitations of This Paper

Beyond its systematic review and outlook, this study also has several limitations, which are crucial for objectively understanding the review’s conclusions and guiding future work. First, as a review article, this paper’s contribution lies in analyzing and synthesizing existing work rather than presenting original experimental validation. The discussion primarily focuses on CNN. While some design concepts have been extended to other embedded applications, accelerated techniques for other rapidly evolving neural network models—such as Transformers—on FPGA were not explored with equal depth. Second, the quantitative analysis relies entirely on data reported in published literature. Heterogeneity in testing platforms and evaluation methods across studies makes rigorous, comparable analysis challenging. Additionally, missing data in some publications contributes to the incompleteness of the analysis. Finally, given the rapid technological iteration in both the AI and FPGA fields, this review reflects only the state of the art at the time of writing. Readers are encouraged to consider these limitations when interpreting the conclusions presented herein fully.

8. Summary

FPGA-based neural network design plays a crucial role in deep learning model application and deployment. Although FPGAs offer outstanding resource utilization efficiency and high flexibility, the inherent differences between model and hardware design still pose significant challenges for developing FPGA-based neural network acceleration systems. This paper systematically reviews CNN model acceleration techniques based on FPGA, analyzing optimization strategies from both algorithmic and hardware dimensions. Building upon this foundation, it presents an integrated optimization framework for co-designing software and hardware. Furthermore, the relevant methodological approaches are extended to broader embedded system application scenarios. Key evaluation metrics for verifying system effectiveness are summarized, providing a clear technical roadmap for research and practice in this field. Despite ongoing challenges, the continuous evolution of hardware architectures and compilation tools, coupled with deepening interdisciplinary collaboration, holds promise for further unlocking the potential of FPGA in deep learning acceleration. This advancement will drive the application and development of efficient intelligent computing across more practical scenarios.

Author Contributions

Conceptualization: Z.L.; Methodology: L.G.; Investigation: L.W.; Writing—original draft preparation: L.G.; Writing—review and editing: Z.L.; Supervision: Z.L.; Project administration: Z.L.; Funding acquisition: Z.L. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported in part by the Exploration and Practice of the Path to Improve the Quality of Master’s Degree Cultivation of Electronic Information Students Empowered by Numerical Intelligence JG202405, in part by the National Natural Science Foundation of China under Grant 61801319, in part by the Sichuan Science and Technology Program under Grant 2020JDJQ0061 and 2021YFG0099, in part by the Scientific Research and Innovation Team Program of Sichuan University of Science and Engineering under Grant SUSE652A011, in part by the Postgraduate Innovation Fund Project of Sichuan University of Science and Engineering under Grant Y2025081.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

No new data were created or analyzed in this study.

Conflicts of Interest

The authors declare no conflict of interest.

References

Hu, Y.; Liu, Y.; Liu, Z. A Survey on Convolutional Neural Network Accelerators: GPU, FPGA and ASIC. In Proceedings of the 2022 14th International Conference on Computer Research and Development (ICCRD), Shenzhen, China, 7–9 January 2022; pp. 100–107. [Google Scholar]
Li, Z.; Li, H.; Meng, L. Model compression for deep neural networks: A survey. Computers 2023, 12, 60. [Google Scholar] [CrossRef]
Kok, C.L.; Zhao, B.; Heng, J.; Teo, T.H. Dynamic Quantization and Pruning for Efficient CNN-Based Road Sign Recognition on FPGA. In Proceedings of the 2025 IEEE International Symposium on Circuits and Systems (ISCAS), London, UK, 25–28 May 2025; pp. 1–5. [Google Scholar]
Mishra, J.; Sharma, R. Optimized FPGA Architecture for CNN-Driven Voice Disorder Detection. Circuits Syst. Signal Process. 2025, 44, 4455–4467. [Google Scholar] [CrossRef]
Balasubramanian, K.; Baragur, A.G.; Donadel, D.; Sahabandu, D.; Brighente, A.; Ramasubramanian, B.; Conti, M.; Poovendran; Radha. CANLP: Intrusion Detection for Controller Area Networks using Natural Language Processing and Embedded Machine Learning. IEEE Trans. Dependable Secur. Comput. 2025; early access. [Google Scholar]
Wu, Y. Review on FPGA-Based Accelerators in Deep learning. In Proceedings of the 2023 IEEE 6th Information Technology, Networking, Electronic and Automation Control Conference (ITNEC), Chongqing, China, 24–26 February 2023; Volume 6, pp. 452–456. [Google Scholar]
Humaidi, A.J.; Kadhim, T.M.; Hasan, S.; Ibraheem, I.K.; Azar, A.T. A generic izhikevich-modelled FPGA-realized architecture: A case study of printed english letter recognition. In Proceedings of the 2020 24th International Conference on System Theory, Control and Computing (ICSTCC), Sinaia, Romania, 8–10 October 2020; pp. 825–830. [Google Scholar]
Gołka, Ł.; Poczekajło, P.; Suszyński, R. Implementation of a universal replicable DSP core in FPGA devices for cascade audio signal processing applications. Procedia Comput. Sci. 2024, 246, 2429–2438. [Google Scholar] [CrossRef]
Zayed, N.; Tawfik, N.; Mahmoud, M.M.; Fawzy, A.; Cho, Y.I.; Abdallah, M.S. Accelerating Deep Learning-Based Morphological Biometric Recognition with Field-Programmable Gate Arrays. AI 2025, 6, 8. [Google Scholar] [CrossRef]
Guo, K.; Zeng, S.; Yu, J.; Wang, Y.; Yang, H. [DL] A survey of FPGA-based neural network inference accelerators. ACM Trans. Reconfigurable Technol. Syst. 2019, 12, 1–26. [Google Scholar] [CrossRef]
Shawahna, A.; Sait, S.M.; El-Maleh, A. FPGA-Based Accelerators of Deep Learning Networks for Learning and Classification: A Review. IEEE Access 2019, 7, 7823–7859. [Google Scholar] [CrossRef]
Mittal, S. A survey of FPGA-based accelerators for convolutional neural networks. Neural Comput. Appl. 2020, 32, 1109–1139. [Google Scholar] [CrossRef]
Capra, M.; Bussolino, B.; Marchisio, A.; Masera, G.; Martina, M.; Shafique, M. Hardware and Software Optimizations for Accelerating Deep Neural Networks: Survey of Current Trends, Challenges, and the Road Ahead. IEEE Access 2020, 8, 225134–225180. [Google Scholar] [CrossRef]
Wu, R.; Guo, X.; Du, J.; Li, J. Accelerating neural network inference on FPGA-based platforms—A survey. Electronics 2021, 10, 1025. [Google Scholar] [CrossRef]
Liu, T.D.; Zhu, J.W.; Zhang, Y.W. A Survey on FPGA-Based Acceleration of Deep Learning. Comput. Sci. Explor. 2021, 15, 2093–2104. [Google Scholar]
Wang, C.; Luo, Z. A review of the optimal design of neural networks based on FPGA. Appl. Sci. 2022, 12, 10771. [Google Scholar] [CrossRef]
Dhilleswararao, P.; Boppu, S.; Manikandan, M.S.; Cenkeramaddi, L.R. Efficient Hardware Architectures for Accelerating Deep Neural Networks: Survey. IEEE Access 2022, 10, 131788–131828. [Google Scholar] [CrossRef]
Amin, R.A.; Obermaisser, R. Towards Resource Efficient and Low Latency CNN Accelerator for FPGAs: Review and Evaluation. In Proceedings of the 2024 3rd International Conference on Embedded Systems and Artificial Intelligence (ESAI), Fez, Morocco, 19–20 December 2024; pp. 1–10. [Google Scholar]
Hong, H.; Choi, D.; Kim, N.; Lee, H.; Kang, B.; Kang, H.; Kim, H. Survey of convolutional neural network accelerators on field-programmable gate array platforms: Architectures and optimization techniques. J. Real-Time Image Process. 2024, 21, 64. [Google Scholar] [CrossRef]
Jiang, J.; Zhou, Y.; Gong, Y.; Yuan, H.; Liu, S. FPGA-based Acceleration for Convolutional Neural Networks: A Comprehensive Review. arXiv 2025, arXiv:2505.13461. [Google Scholar]
Li, R. Dataflow & Tiling Strategies in Edge-AI FPGA Accelerators: A Comprehensive Literature Review. arXiv 2025, arXiv:2505.08992. [Google Scholar]
Ma, Y.; Cao, Y.; Vrudhula, S.; Seo, J.s. Optimizing loop operation and dataflow in FPGA acceleration of deep convolutional neural networks. In Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, Monterey, CA, USA, 22–24 February 2017; pp. 45–54. [Google Scholar]
Fukushima, K. Neocognitron: A self-organizing neural network model for a mechanism of pattern recognition unaffected by shift in position. Biol. Cybern. 1980, 36, 193–202. [Google Scholar] [CrossRef]
Arshad, M.A.; Shahriar, S.; Sagahyroon, A. On the use of FPGAs to implement CNNs: A brief review. In Proceedings of the 2020 International Conference on Computing, Electronics & Communications Engineering (iCCECE), Southend, UK, 17–18 August 2020; pp. 230–236. [Google Scholar]
LeCun, Y.; Boser, B.; Denker, J.S.; Henderson, D.; Howard, R.E.; Hubbard, W.; Jackel, L.D. Backpropagation applied to handwritten zip code recognition. Neural Comput. 1989, 1, 541–551. [Google Scholar] [CrossRef]
Krizhevsky, A.; Sutskever, I.; Hinton, G.E. Imagenet classification with deep convolutional neural networks. Adv. Neural Inf. Process. Syst. 2017, 60, 84–90. [Google Scholar] [CrossRef]
Peng, X.; Yu, J.; Yao, B.; Liu, L.; Peng, Y. A review of FPGA-based Custom computing architecture for convolutional neural network inference. Chin. J. Electron. 2021, 30, 1–17. [Google Scholar]
Simonyan, K.; Zisserman, A. Very deep convolutional networks for large-scale image recognition. arXiv 2014, arXiv:1409.1556. [Google Scholar]
Szegedy, C.; Liu, W.; Jia, Y.; Sermanet, P.; Reed, S.; Anguelov, D.; Erhan, D.; Vanhoucke, V.; Rabinovich, A. Going deeper with convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 1–9. [Google Scholar]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You only look once: Unified, real-time object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 779–788. [Google Scholar]
Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv 2020, arXiv:2010.11929. [Google Scholar]
Lee, C.Y.; Gallagher, P.W.; Tu, Z. Generalizing pooling functions in convolutional neural networks: Mixed, gated, and tree. In Proceedings of the Artificial Intelligence and Statistics, Cadiz, Spain, 9–11 May 2016; pp. 464–472. [Google Scholar]
Chakraborty, S.; Paul, S.; Hasan, K.A. A transfer learning-based approach with deep cnn for covid-19-and pneumonia-affected chest x-ray image classification. SN Comput. Sci. 2022, 3, 17. [Google Scholar] [CrossRef] [PubMed]
Han, G.; Huang, S.; Ma, J.; He, Y.; Chang, S.F. Meta faster r-cnn: Towards accurate few-shot object detection with attentive feature alignment. In Proceedings of the AAAI Conference on Artificial Intelligence, Online, 22 February–1 March 2022; Volume 36, pp. 780–789. [Google Scholar]
Valdez-Rodriguez, J.E.; Calvo, H.; Felipe-Riveron, E.; Moreno-Armendáriz, M.A. Improving depth estimation by embedding semantic segmentation: A hybrid CNN model. Sensors 2022, 22, 1669. [Google Scholar] [CrossRef]
Gyawali, D. Comparative analysis of cpu and gpu profiling for deep learning models. arXiv 2023, arXiv:2309.02521. [Google Scholar] [CrossRef]
Zhang, Q.; Zhang, M.; Chen, T.; Sun, Z.; Ma, Y.; Yu, B. Recent advances in convolutional neural network acceleration. Neurocomputing 2019, 323, 37–51. [Google Scholar] [CrossRef]
Beldianu, S.F.; Ziavras, S.G. ASIC design of shared vector accelerators for multicore processors. In Proceedings of the 2014 IEEE 26th International Symposium on Computer Architecture and High Performance Computing, Paris, France, 22–24 October 2014; pp. 182–189. [Google Scholar]
Jose, A.; Alense, K.; Gijo, L.; Jacob, J. FPGA Implementation of CNN Accelerator with Pruning for ADAS Applications. In Proceedings of the 2024 IEEE 9th International Conference for Convergence in Technology (I2CT), Pune, India, 5–7 April 2024; pp. 1–6. [Google Scholar]
Wang, J.; He, Z.; Zhao, H.; Liu, R. Low-Bit Mixed-Precision Quantization and Acceleration of CNN for FPGA Deployment. IEEE Trans. Emerg. Top. Comput. Intell. 2024, 9, 2597–2617. [Google Scholar] [CrossRef]
Lo, C.Y. A Comprehensive Study of FPGA Accelerators for Machine Learning Applications. Ph.D. Thesis, The University of Auckland, Auckland, New Zealand, 2025. [Google Scholar]
Fata, J.; Elmannai, W.; Elleithy, K. Balancing Performance and Cost—FPGA-Based CNN Accelerators for Edge Computing: Status Quo, Key Challenges, and Prospective Innovations. IEEE Access 2025, 13, 142852–142877. [Google Scholar] [CrossRef]
Chen, Y.; Luo, T.; Liu, S.; Zhang, S.; He, L.; Wang, J.; Li, L.; Chen, T.; Xu, Z.; Sun, N.; et al. Dadiannao: A machine-learning supercomputer. In Proceedings of the 2014 47th Annual IEEE/ACM International Symposium on Microarchitecture, Cambridge, UK, 13–17 December 2014; pp. 609–622. [Google Scholar]
Shao, W.; Chen, M.; Zhang, Z.; Xu, P.; Zhao, L.; Li, Z.; Zhang, K.; Gao, P.; Qiao, Y.; Luo, P. Omniquant: Omnidirectionally calibrated quantization for large language models. arXiv 2023, arXiv:2308.13137. [Google Scholar]
Liu, Z.; Oguz, B.; Zhao, C.; Chang, E.; Stock, P.; Mehdad, Y.; Shi, Y.; Krishnamoorthi, R.; Chandra, V. Llm-qat: Data-free quantization aware training for large language models. arXiv 2023, arXiv:2305.17888. [Google Scholar]
Zhu, C.; Han, S.; Mao, H.; Dally, W.J. Trained ternary quantization. arXiv 2016, arXiv:1612.01064. [Google Scholar]
Kosuge, A.; Hsu, Y.C.; Hamada, M.; Kuroda, T. A 0.61-μJ/frame pipelined wired-logic DNN processor in 16-nm FPGA using convolutional non-linear neural network. IEEE Open J. Circuits Syst. 2021, 3, 4–14. [Google Scholar] [CrossRef]
Liang, S.; Yin, S.; Liu, L.; Luk, W.; Wei, S. FP-BNN: Binarized neural network on FPGA. Neurocomputing 2018, 275, 1072–1086. [Google Scholar] [CrossRef]
Xu, W.; Li, F.; Jiang, Y.; Yong, A.; He, X.; Wang, P.; Cheng, J. Improving extreme low-bit quantization with soft threshold. Neurocomputing 2022, 33, 1549–1563. [Google Scholar] [CrossRef]
Chen, P.; Zhuang, B.; Shen, C. FATNN: Fast and accurate ternary neural networks. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 10–17 October 2021; pp. 5219–5228. [Google Scholar]
Zhong, K.; Ning, X.; Dai, G.; Zhu, Z.; Zhao, T.; Zeng, S.; Wang, Y.; Yang, H. Exploring the potential of low-bit training of convolutional neural networks. IEEE Trans. Comput.-Aided Des. Integr. Circuits Syst. 2022, 41, 5421–5434. [Google Scholar] [CrossRef]
Liu, F.; Zhao, W.; Wang, Z.; Chen, Y.; He, Z.; Jing, N.; Liang, X.; Jiang, L. Ebsp: Evolving bit sparsity patterns for hardware-friendly inference of quantized deep neural networks. In Proceedings of the 59th ACM/IEEE Design Automation Conference, San Francisco, CA, USA, 10–14 July 2022; pp. 259–264. [Google Scholar]
Yuan, Y.; Chen, C.; Hu, X.; Peng, S. CNQ: Compressor-Based Non-uniform Quantization of Deep Neural Networks. Chin. J. Electron. 2020, 29, 1126–1133. [Google Scholar] [CrossRef]
Qiu, S.; Zaheer, Q.; Hassan Shah, S.M.A.; Ai, C.; Wang, J.; Zhan, Y. Vector-Quantized Variational Teacher and Multimodal Collaborative Student Based Knowledge Distillation Paradigm for Cracks Segmentation. 2024. Available online: https://ascelibrary.org/doi/10.1061/JCCEE5.CPENG-6339 (accessed on 3 September 2025).
Wen, W.; Wu, C.; Wang, Y.; Chen, Y.; Li, H. Learning structured sparsity in deep neural networks. Adv. Neural Inf. Process. Syst. 2016, 29. [Google Scholar]
Huang, Z.; Wang, N. Data-driven sparse structure selection for deep neural networks. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 304–320. [Google Scholar]
Lin, S.; Ji, R.; Yan, C.; Zhang, B.; Cao, L.; Ye, Q.; Huang, F.; Doermann, D. Towards optimal structured cnn pruning via generative adversarial learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 2790–2799. [Google Scholar]
Lemaire, C.; Achkar, A.; Jodoin, P.M. Structured pruning of neural networks with budget-aware regularization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 9108–9116. [Google Scholar]
Chen, C.; Tung, F.; Vedula, N.; Mori, G. Constraint-aware deep neural network compression. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 400–415. [Google Scholar]
Dong, X.; Chen, S.; Pan, S. Learning to prune deep neural networks via layer-wise optimal brain surgeon. Adv. Neural Inf. Process. Syst. 2017, 30. [Google Scholar]
Liu, Z.; Xu, J.; Peng, X.; Xiong, R. Frequency-domain dynamic pruning for convolutional neural networks. Adv. Neural Inf. Process. Syst. 2018, 31. Available online: https://www.semanticscholar.org/paper/Frequency-Domain-Dynamic-Pruning-for-Convolutional-Liu-Xu/c1cbfe55a04916bf65bb7134b2ee02c3c099fd56 (accessed on 14 October 2025).
Lin, T.; Stich, S.U.; Barba, L.; Dmitriev, D.; Jaggi, M. Dynamic model pruning with feedback. arXiv 2020, arXiv:2006.07253. [Google Scholar] [CrossRef]
Kim, K.; Kakani, V.; Kim, H. Automatic Pruning and Quality Assurance of Object Detection Datasets for Autonomous Driving. Electronics 2025, 14, 1882. [Google Scholar] [CrossRef]
Wang, Z.; Xu, K.; Wu, S.; Liu, L.; Liu, L.; Wang, D. Sparse-YOLO: Hardware/software co-design of an FPGA accelerator for YOLOv2. IEEE Access 2020, 8, 116569–116585. [Google Scholar] [CrossRef]
Ma, X.; Lin, S.; Ye, S.; He, Z.; Zhang, L.; Yuan, G.; Tan, S.H.; Li, Z.; Fan, D.; Qian, X.; et al. Non-structured DNN weight pruning—Is it beneficial in any platform? IEEE Trans. Neural Netw. Learn. Syst. 2021, 33, 4930–4944. [Google Scholar] [CrossRef]
Howard, A.G.; Zhu, M.; Chen, B.; Kalenichenko, D.; Wang, W.; Weyand, T.; Andreetto, M.; Adam, H. Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv 2017, arXiv:1704.04861. [Google Scholar] [CrossRef]
Sandler, M.; Howard, A.; Zhu, M.; Zhmoginov, A.; Chen, L.C. Mobilenetv2: Inverted residuals and linear bottlenecks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 4510–4520. [Google Scholar]
Howard, A.; Sandler, M.; Chu, G.; Chen, L.C.; Chen, B.; Tan, M.; Wang, W.; Zhu, Y.; Pang, R.; Vasudevan, V.; et al. Searching for mobilenetv3. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 1314–1324. [Google Scholar]
Zhang, X.; Zhou, X.; Lin, M.; Sun, J. Shufflenet: An extremely efficient convolutional neural network for mobile devices. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 6848–6856. [Google Scholar]
Han, K.; Wang, Y.; Tian, Q.; Guo, J.; Xu, C.; Xu, C. Ghostnet: More features from cheap operations. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 1580–1589. [Google Scholar]
Iandola, F.N.; Han, S.; Moskewicz, M.W.; Ashraf, K.; Dally, W.J.; Keutzer, K. SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and< 0.5 MB model size. arXiv 2016, arXiv:1602.07360. [Google Scholar]
Tan, M.; Le, Q. Efficientnet: Rethinking model scaling for convolutional neural networks. In Proceedings of the International Conference on Machine Learning, Long Beach, CA, USA, 9–15 June 2019; pp. 6105–6114. [Google Scholar]
Hinton, G.; Vinyals, O.; Dean, J. Distilling the knowledge in a neural network. arXiv 2015, arXiv:1503.02531. [Google Scholar] [CrossRef]
Son, W.; Na, J.; Choi, J.; Hwang, W. Densely guided knowledge distillation using multiple teacher assistants. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 9395–9404. [Google Scholar]
Lan, X.; Zeng, Y.; Wei, X.; Zhang, T.; Wang, Y.; Huang, C.; He, W. Counterclockwise block-by-block knowledge distillation for neural network compression. Sci. Rep. 2025, 15, 11369. [Google Scholar] [CrossRef]
Chen, L. Knowledge Distillation Research on CNN Imageclassification Task for Resource-Constrained Devices. Master’s Thesis, Sichuan University, Chengdu, China, 2023. [Google Scholar]
Luo, D.Y.; Guo, Q.X.; Zhang, H.C. An FPGA implement of ECG classifier using quantized CNN based on knowl-edge distillation. Appl. Electron. Tech. 2024, 50, 97–101. [Google Scholar] [CrossRef]
Zhu, K.; Yang, H.; Wang, Y. A lightweight fault diagnosis network based on FPGA and feature knowledge distillation. J. Phys. Conf. Ser. 2025, 3073, 012002. [Google Scholar] [CrossRef]
Qian, W.; Zhu, Z.; Zhu, C.; Zhu, Y. FPGA-based accelerator for YOLOv5 object detection with optimized computation and data access for edge deployment. Parallel Comput. 2025, 124, 103138. [Google Scholar] [CrossRef]
Zha, Y.; Cai, X. FPGA-Accelerated Lightweight CNN in Forest Fire Recognition. Forests 2025, 16, 698. [Google Scholar] [CrossRef]
Wu, C.B.; Wu, R.F.; Chan, T.W. Hetero layer fusion based architecture design and implementation for of deep learning accelerator. In Proceedings of the 2022 IEEE International Conference on Consumer Electronics-Taiwan, Taipei, Taiwan, 6–8 July 2022; pp. 63–64. [Google Scholar]
Nguyen, D.T.; Kim, H.; Lee, H.J. Layer-specific optimization for mixed data flow with mixed precision in FPGA design for CNN-based object detectors. IEEE Trans. Circuits Syst. Video Technol. 2020, 31, 2450–2464. [Google Scholar] [CrossRef]
Liu, S.; Fan, H.; Luk, W. Accelerating fully spectral CNNs with adaptive activation functions on FPGA. In Proceedings of the 2021 Design, Automation & Test in Europe Conference & Exhibition (DATE), Grenoble, France, 1–5 February 2021; pp. 1530–1535. [Google Scholar]
Xie, Y.; Chen, H.; Zhuang, Y.; Xie, Y. Fault classification and diagnosis approach using FFT-CNN for FPGA-based CORDIC processor. Electronics 2023, 13, 72. [Google Scholar] [CrossRef]
Malathi, L.; Bharathi, A.; Jayanthi, A. FPGA design of FFT based intelligent accelerator with optimized Wallace tree multiplier for image super resolution and quality enhancement. Biomed. Signal Process. Control 2024, 88, 105599. [Google Scholar] [CrossRef]
Meng, Y.; Wu, J.; Xiang, S.; Wang, J.; Hou, J.; Lin, Z.; Yang, C. A high-throughput and flexible CNN accelerator based on mixed-radix FFT method. IEEE Trans. Circuits Syst. I Regul. Pap. 2024, 72, 816–829. [Google Scholar] [CrossRef]
Lavin, A.; Gray, S. Fast algorithms for convolutional neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 4013–4021. [Google Scholar]
Li, M.; Li, P.; Yin, S.; Chen, S.; Li, B.; Tong, C.; Yang, J.; Chen, T.; Yu, B. WinoGen: A Highly Configurable Winograd Convolution IP Generator for Efficient CNN Acceleration on FPGA. In Proceedings of the 61st ACM/IEEE Design Automation Conference, San Francisco, CA, USA, 23–27 June 2024; pp. 1–6. [Google Scholar]
Vardhana, M.; Pinto, R. High-Performance Winograd Based Accelerator Architecture for Convolutional Neural Network. Comput. Archit. Lett. 2025, 24, 21–24. [Google Scholar]
Jaderberg, M.; Vedaldi, A.; Zisserman, A. Speeding up convolutional neural networks with low rank expansions. arXiv 2014, arXiv:1405.3866. [Google Scholar] [CrossRef]
Chen, Z.; Chen, Z.; Lin, J.; Liu, S.; Li, W. Deep neural network acceleration based on low-rank approximated channel pruning. IEEE Trans. Circuits Syst. I Regul. Pap. 2020, 67, 1232–1244. [Google Scholar] [CrossRef]
Yu, Z.; Bouganis, C.S. Svd-nas: Coupling low-rank approximation and neural architecture search. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Waikoloa, HI, USA, 2–7 January 2023; pp. 1503–1512. [Google Scholar]
Yu, Z.; Bouganis, C.S. Streamsvd: Low-rank approximation and streaming accelerator co-design. In Proceedings of the 2021 International Conference on Field-Programmable Technology (ICFPT), Auckland, New Zealand, 6–10 December 2021; pp. 1–9. [Google Scholar]
Zhou, S.; Kannan, R.; Prasanna, V.K. Accelerating low rank matrix completion on FPGA. In Proceedings of the 2017 International Conference on ReConFigurable Computing and FPGAs (ReConFig), Cancun, Mexico, 4–6 December 2017; pp. 1–7. [Google Scholar]
Yang, M.; Cao, S.; Zhang, W.; Li, Y.; Jiang, Z. Loop-tiling based compiling optimization for cnn accelerators. In Proceedings of the 2023 IEEE 15th International Conference on ASIC (ASICON), Nanjing, China, 24–27 October 2023; pp. 1–4. [Google Scholar]
Huang, H.; Hu, X.; Li, X.; Xiong, X. An efficient loop tiling framework for convolutional neural network inference accelerators. IET Circuits, Devices Syst. 2022, 16, 116–123. [Google Scholar] [CrossRef]
Nguyen, D.T.; Nguyen, T.N.; Kim, H.; Lee, H.J. A high-throughput and power-efficient FPGA implementation of YOLO CNN for object detection. IEEE Trans. Very Large Scale Integr. (VLSI) Syst. 2019, 27, 1861–1873. [Google Scholar] [CrossRef]
Basalama, S.; Sohrabizadeh, A.; Wang, J.; Guo, L.; Cong, J. FlexCNN: An end-to-end framework for composing CNN accelerators on FPGA. ACM Trans. Reconfigurable Technol. Syst. 2023, 16, 1–32. [Google Scholar] [CrossRef]
Deng, Y. Research on FPGA-Based Accelerator forConvolutional Neural Networks. Master’s Thesis, Changchun University of Technology, Changchun, China, 2025. [Google Scholar]
Liu, Y.; Ma, Y.; Zhang, B.; Liu, L.; Wang, J.; Tang, S. Improving the computational efficiency and flexibility of FPGA-based CNN accelerator through loop optimization. Microelectron. J. 2024, 147, 106197. [Google Scholar] [CrossRef]
Zhao, H.; Wang, Y.c.; Zhao, J. Issm: An Incremental Space Search Method for Loop Unrolling Parameters in Fpga-Based Cnn Accelerators. Available online: https://papers.ssrn.com/sol3/papers.cfm?abstract_id=4994345 (accessed on 3 September 2025).
Sengupta, A.; Chourasia, V.; Anshul, A.; Kumar, N. Robust Watermarking of Loop Unrolled Convolution Layer IP Design for CNN using 4-variable Encoded Register Allocation. In Proceedings of the 2024 International Conference on Consumer Electronics-Taiwan (ICCE-Taiwan), Taichung, Taiwan, 9–11 July 2024; pp. 589–590. [Google Scholar]
Zhang, C.; Li, P.; Sun, G.; Guan, Y.; Xiao, B.; Cong, J. Optimizing FPGA-based accelerator design for deep convolutional neural networks. In Proceedings of the 2015 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, Monterey, CA, USA, 22–24 February 2015; pp. 161–170. [Google Scholar]
Wang, H.; Zhao, Y.; Gao, F. A convolutional neural network accelerator based on FPGA for buffer optimization. In Proceedings of the 2021 IEEE 5th Advanced Information Technology, Electronic and Automation Control Conference (IAEAC), Chongqing, China, 12–14 March 2021; Volume 5, pp. 2362–2367. [Google Scholar]
Fan, R. Design of Mobilenet Neural Network AcceleratorBased on FPGA. Master’s Thesis, University of Electronic Science and Technology of China, Chengdu, China, 2025. [Google Scholar]
Li, G.; Liu, Z.; Li, F.; Cheng, J. Block convolution: Toward memory-efficient inference of large-scale CNNs on FPGA. IEEE Trans.-Comput.-Aided Des. Integr. Circuits Syst. 2021, 41, 1436–1447. [Google Scholar] [CrossRef]
Fan, H.; Ferianc, M.; Que, Z.; Li, H.; Liu, S.; Niu, X.; Luk, W. Algorithm and hardware co-design for reconfigurable cnn accelerator. In Proceedings of the 2022 27th Asia and South Pacific Design Automation Conference (ASP-DAC), Taipei, Taiwan, 17–20 January 2022; pp. 250–255. [Google Scholar]
Liu, S.; Fan, H.; Niu, X.; Ng, H.c.; Chu, Y.; Luk, W. Optimizing CNN-based segmentation with deeply customized convolutional and deconvolutional architectures on FPGA. ACM Trans. Reconfigurable Technol. Syst. 2018, 11, 1–22. [Google Scholar] [CrossRef]
Mao, N.; Yang, H.; Huang, Z. A parameterized parallel design approach to efficient mapping of cnns onto fpga. Electronics 2023, 12, 1106. [Google Scholar] [CrossRef]
Lai, Y.K.; Lin, C.H. An Efficient Reconfigurable Parameterized Convolutional Neural Network Accelerator on FPGA Platform. In Proceedings of the 2025 IEEE International Conference on Consumer Electronics (ICCE), Las Vegas, NV, USA, 11–14 January 2025; pp. 1–2. [Google Scholar]
Liu, S.; Fan, H.; Ferianc, M.; Niu, X.; Shi, H.; Luk, W. Toward full-stack acceleration of deep convolutional neural networks on FPGAs. IEEE Trans. Neural Netw. Learn. Syst. 2021, 33, 3974–3987. [Google Scholar] [CrossRef] [PubMed]
Zhou, Z.; Duan, X.; Han, J. A design framework for generating energy-efficient accelerator on fpga toward low-level vision. IEEE Trans. Very Large Scale Integr. (VLSI) Syst. 2024, 32, 1485–1497. [Google Scholar] [CrossRef]
Yu, Y.; Wu, C.; Zhao, T.; Wang, K.; He, L. OPU: An FPGA-based overlay processor for convolutional neural networks. IEEE Trans. Very Large Scale Integr. (VLSI) Syst. 2019, 28, 35–47. [Google Scholar] [CrossRef]
Wu, T.H.; Shu, C.; Liu, T.T. An efficient FPGA-based dilated and transposed convolutional neural network accelerator. IEEE Trans. Circuits Syst. I Regul. Pap. 2024, 71, 5178–5186. [Google Scholar] [CrossRef]
Khan, F.H.; Pasha, M.A.; Masud, S. Towards designing a hardware accelerator for 3D convolutional neural networks. Comput. Electr. Eng. 2023, 105, 108489. [Google Scholar] [CrossRef]
Li, Z.; Zhang, Z.; Hu, J.; Meng, Q.; Shi, X.; Luo, J.; Wang, H.; Huang, Q.; Chang, S. A high-performance pixel-level fully pipelined hardware accelerator for neural networks. IEEE Trans. Neural Netw. Learn. Syst. 2024, 36, 7970–7983. [Google Scholar] [CrossRef]
Dai, K.; Xie, Z.; Liu, S. DCP-CNN: Efficient acceleration of CNNs with dynamic computing parallelism on FPGA. IEEE Trans. Comput.-Aided Des. Integr. Circuits Syst. 2024, 44, 540–553. [Google Scholar] [CrossRef]
Feng, L. FPGA-Based Design and Implementation for HumanPose Estimation Algorithms. 2025. Available online: https://link.cnki.net/doi/10.27005/d.cnki.gdzku.2025.002354 (accessed on 14 October 2025).
Fan, H.; Liu, S.; Que, Z.; Niu, X.; Luk, W. High-performance acceleration of 2-D and 3-D CNNs on FPGAs using static block floating point. IEEE Trans. Neural Netw. Learn. Syst. 2021, 34, 4473–4487. [Google Scholar] [CrossRef]
Shah, N.; Chaudhari, P.; Varghese, K. Runtime programmable and memory bandwidth optimized FPGA-based coprocessor for deep convolutional neural network. IEEE Trans. Neural Netw. Learn. Syst. 2018, 29, 5922–5934. [Google Scholar] [CrossRef]
Zhao, R.; Ng, H.C.; Luk, W.; Niu, X. Towards efficient convolutional neural network for domain-specific applications on FPGA. In Proceedings of the 2018 28th International Conference on Field Programmable Logic and Applications (FPL), Dublin, Ireland, 27–31 August 2018; pp. 147–1477. [Google Scholar] [CrossRef]
Seto, K.; Nejatollahi, H.; An, J.; Kang, S.; Dutt, N. Small memory footprint neural network accelerators. In Proceedings of the 20th International Symposium on Quality Electronic Design (ISQED), Santa Clara, CA, USA, 6–7 March 2019; pp. 253–258. [Google Scholar] [CrossRef]
Cui, J.; Zhou, Y.; Zhang, F. Field programmable gate array implementation of a convolutionalneural network based on a pipeline architecture. J. Beijing Univ. Chem. Technol. (Nat. Sci. Ed.) 2021, 48, 111–118. [Google Scholar] [CrossRef]
Li, T.; Zhang, F.; Wang, S.; Cao, W.; Chen, L. FPGA-Based Unified Accelerator for Convolutional Neural Networkand Vision Transformer. J. Electron. Inf. Technol. 2024, 46, 2663–2672. [Google Scholar]
Meng, H.; Liu, W. A FPGA-based convolutional neural network training accelerator. J. Nanjing Univ. (Nat. Sci.) 2021, 57, 1075–1082. [Google Scholar] [CrossRef]
Choudhury, Z.; Shrivastava, S.; Ramapantulu, L.; Purini, S. An FPGA overlay for CNN inference with fine-grained flexible parallelism. ACM Trans. Archit. Code Optim. 2022, 19, 1–26. [Google Scholar] [CrossRef]
Liu, Z.; Liu, Q.; Yan, S.; Cheung, R.C. An efficient FPGA-based depthwise separable convolutional neural network accelerator with hardware pruning. ACM Trans. Reconfig. Technol. Syst. 2024, 17, 1–20. [Google Scholar] [CrossRef]
Baskin, C.; Liss, N.; Zheltonozhskii, E.; Bronstein, A.M.; Mendelson, A. Streaming architecture for large-scale quantized neural networks on an FPGA-based dataflow platform. In Proceedings of the 2018 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW), Vancouver, BC, Canada, 21–25 May 2018; pp. 162–169. [Google Scholar]
Qu, X.; Xu, Y.; Huang, Z.; Cai, G.; Fang, Z. QU Xinyuan, XU Yu, HUANG Zhihong, CAI Gang, FANG Zhen. J. Electron. Inf. Technol. 2022, 44, 1503–1512. [Google Scholar]
Huang, W.; Luo, C.; Zhao, B.; Jiao, H.; Huang, Y. HCG: Streaming DCNN Accelerator With a Hybrid Computational Granularity Scheme on FPGA. IEEE Trans. Neural Netw. Learn. Syst. 2025, 36, 18681–18695. [Google Scholar] [CrossRef]
Jia, X.; Zhang, Y.; Liu, G.; Yang, X.; Zhang, T.; Zheng, J.; Xu, D.; Liu, Z.; Liu, M.; Yan, X.; et al. XVDPU: A high-performance cnn accelerator on the Versal platform powered by the AI engine. ACM Trans. Reconfig. Technol. Syst. 2024, 17, 1–24. [Google Scholar] [CrossRef]
Rigoni, A.; Manduchi, G.; Luchetta, A.; Taliercio, C.; Schröder, T. A framework for the integration of the development process of Linux FPGA System on Chip devices. Fusion Eng. Des. 2018, 128, 122–125. [Google Scholar] [CrossRef]
Vaithianathan, M. The Future of Heterogeneous Computing: Integrating CPUs GPUs and FPGAs for High-Performance Applications. Int. J. Emerg. Trends Comput. Sci. Inf. Technol. 2025, 1, 12–23. [Google Scholar] [CrossRef]
Toupas, P.; Montgomerie-Corcoran, A.; Bouganis, C.S.; Tzovaras, D. Harflow3d: A latency-oriented 3d-cnn accelerator toolflow for har on fpga devices. In Proceedings of the 2023 IEEE 31st Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM), Marina Del Rey, CA, USA, 8–11 May 2023; pp. 144–154. [Google Scholar] [CrossRef]
Toupas, P.; Bouganis, C.S.; Tzovaras, D. FMM-X3D: FPGA-based modeling and mapping of X3D for Human Action Recognition. In Proceedings of the 2023 IEEE 34th International Conference on Application-specific Systems, Architectures and Processors (ASAP), Porto, Portugal, 19 July–21 July 2023; pp. 119–126. [Google Scholar] [CrossRef]
Wang, H.; Zhang, Z.; Han, S. Spatten: Efficient sparse attention architecture with cascade token and head pruning. In Proceedings of the 2021 IEEE International Symposium on High-Performance Computer Architecture (HPCA), Seoul, Republic of Korea, 27 February–3 March 2021; pp. 97–110. [Google Scholar] [CrossRef]
Toupas, P.; Bouganis, C.S.; Tzovaras, D. fpgahart: A toolflow for throughput-oriented acceleration of 3d cnns for har onto fpgas. In Proceedings of the 2023 33rd International Conference on Field-Programmable Logic and Applications (FPL), Gothenburg, Sweden, 4–8 September 2023; pp. 86–92. [Google Scholar] [CrossRef]
Pacini, T.; Rapuano, E.; Fanucci, L. FPG-AI: A technology-independent framework for the automation of CNN deployment on FPGAs. IEEE Access 2023, 11, 32759–32775. [Google Scholar] [CrossRef]
Venieris, S.I.; Bouganis, C.S. f-CNNx: A toolflow for mapping multiple convolutional neural networks on FPGAs. In Proceedings of the 2018 28th International Conference on Field Programmable Logic and Applications (FPL), Dublin, Ireland, 27–31 August 2018; pp. 381–3817. [Google Scholar] [CrossRef]
Sanchez, J.; Sawant, A.; Neff, C.; Tabkhi, H. AWARE-CNN: Automated workflow for application-aware real-time edge acceleration of CNNs. IEEE Internet Things J. 2020, 7, 9318–9329. [Google Scholar] [CrossRef]
Sledevič, T.; Serackis, A. mNet2FPGA: A design flow for mapping a fixed-point CNN to Zynq SoC FPGA. Electronics 2020, 9, 1823. [Google Scholar] [CrossRef]
Mousouliotis, P.G.; Petrou, L.P. Cnn-grinder: From algorithmic to high-level synthesis descriptions of cnns for low-end-low-cost fpga socs. Microprocess. Microsyst. 2020, 73, 102990. [Google Scholar] [CrossRef]
Wan, Y.; Xie, X.; Yi, L.; Jiang, B.; Chen, J.; Jiang, Y. Pflow: An end-to-end heterogeneous acceleration framework for CNN inference on FPGAs. J. Syst. Archit. 2024, 150, 103113. [Google Scholar] [CrossRef]
Korol, G.; Jordan, M.G.; Rutzig, M.B.; Castrillon, J.; Beck, A.C.S. Design space exploration for cnn offloading to fpgas at the edge. In Proceedings of the 2023 IEEE Computer Society Annual Symposium on VLSI (ISVLSI), Foz do Iguacu, Brazil, 20–23 June 2023; pp. 1–6. [Google Scholar] [CrossRef]
Lu, W.; Hu, Y.; Ye, J.; Li, X. Throughput-oriented Automatic Design of FPGA Accelerator for Convolutional Neural Networks. J. Comput.-Aided Des. Comput. Graph. 2018, 30, 2164–2173. [Google Scholar]
Lu, L.; Zheng, S.; Xiao, Q.; liang, Y. FPGA Design for Convolutional Neural Networks. Sci. Sin. 2019, 49, 277–294. [Google Scholar] [CrossRef]
Wu, R.; Liu, B.; Fu, P.; ji, X.; Lu, W. Convolutional Neural Network Accelerator Architecture Designfor Ultimate Edge Computing Scenario. J. Electron. Inf. Technol. 2023, 45, 1933–1943. [Google Scholar]
Xu, Y.; Luo, J.; Sun, W. Flare: An FPGA-Based Full Precision Low Power CNN Accelerator with Reconfigurable Structure. Sensors 2024, 24, 2239. [Google Scholar] [CrossRef]
Ma, Y.; Cao, Y.; Vrudhula, S.; Seo, J.S. Performance modeling for CNN inference accelerators on FPGA. IEEE Trans. Comput.-Aided Des. Integr. Circuits Syst. 2019, 39, 843–856. [Google Scholar] [CrossRef]
Ye, T.; Kuppannagari, S.R.; Kannan, R.; Prasanna, V.K. Performance modeling and FPGA acceleration of homomorphic encrypted convolution. In Proceedings of the 2021 31st International Conference on Field-Programmable Logic and Applications (FPL), Dresden, Germany, 30 August–3 September 2021; pp. 115–121. [Google Scholar] [CrossRef]
Zhao, R.; Niu, X.; Wu, Y.; Luk, W.; Liu, Q. Optimizing CNN-based object detection algorithms on embedded FPGA platforms. In Proceedings of the International Symposium on Applied Reconfigurable Computing, Delft, The Netherlands, 3–7 April 2017; Springer: Cham, Switzerland, 2017; pp. 255–267. [Google Scholar]
Juracy, L.R.; Moreira, M.T.; de Morais Amory, A.; Hampel, A.F.; Moraes, F.G. A high-level modeling framework for estimating hardware metrics of CNN accelerators. IEEE Trans. Circuits Syst. I Regul. Pap. 2021, 68, 4783–4795. [Google Scholar] [CrossRef]
Wu, R.; Liu, B.; Fu, P.; Chen, H. A Resource Efficient CNN Accelerator for Sensor Signal Processing Based on FPGA. J. Circuits, Syst. Comput. 2023, 32, 2350075. [Google Scholar] [CrossRef]
Chen, Y.H.; Yang, T.J.; Emer, J.; Sze, V. Eyeriss v2: A flexible accelerator for emerging deep neural networks on mobile devices. IEEE J. Emerg. Sel. Top. Circuits Syst. 2019, 9, 292–308. [Google Scholar] [CrossRef]
Blott, M.; Preußer, T.B.; Fraser, N.J.; Gambardella, G.; O’brien, K.; Umuroglu, Y.; Leeser, M.; Vissers, K. FINN-R: An end-to-end deep-learning framework for fast exploration of quantized neural networks. ACM Trans. Reconfig. Technol. Syst. 2018, 11, 1–23. [Google Scholar] [CrossRef]
Wadekar, S.N.; Chaurasia, A. Mobilevitv3: Mobile-friendly vision transformer with simple and effective fusion of local, global and input features. arXiv 2022, arXiv:2209.15159. [Google Scholar]
Adibi, S.; Rajabifard, A.; Islam, S.M.S.; Ahmadvand, A. The Science Behind the COVID Pandemic and Healthcare Technology Solutions; Springer: Cham, Switzerland, 2022. [Google Scholar]
Bengherbia, B.; Tobbal, A.; Chadli, S.; Elmohri, M.A.; Toubal, A.; Rebiai, M.; Toumi, Y. Design and hardware implementation of an intelligent industrial iot edge device for bearing monitoring and fault diagnosis. Arab. J. Sci. Eng. 2024, 49, 6343–6359. [Google Scholar] [CrossRef]
Wu, B.; Wu, X.; Li, P.; Gao, Y.; Si, J.; Al-Dhahir, N. Efficient FPGA implementation of convolutional neural networks and long short-term memory for radar emitter signal recognition. Sensors 2024, 24, 889. [Google Scholar] [CrossRef]
Majidinia, H.; Khatib, F.; Seyyed Mahdavi Chabok, S.J.; Kobravi, H.R.; Rezaeitalab, F. Diagnosis of Parkinson’s Disease Using Convolutional Neural Network-Based Audio Signal Processing on FPGA. Circuits Syst. Signal Process. 2024, 43, 4221–4238. [Google Scholar] [CrossRef]
Liu, Z.; Ling, X.; Zhu, Y.; Wang, N. FPGA-based 1D-CNN accelerator for real-time arrhythmia classification. J. Real-Time Image Process. 2025, 22, 66. [Google Scholar] [CrossRef]
Vitolo, P.; De Vita, A.; Di Benedetto, L.; Pau, D.; Licciardo, G.D. Low-power detection and classification for in-sensor predictive maintenance based on vibration monitoring. IEEE Sens. J. 2022, 22, 6942–6951. [Google Scholar] [CrossRef]
Liu, S.; Li, K.; Luo, J.; Li, X. Research on Low-Power FPGA Accelerator for Radar Detection. In Proceedings of the 2024 IEEE 7th International Conference on Electronic Information and Communication Technology (ICEICT), Xi’an, China, 31 July–2 August 2024; pp. 196–199. [Google Scholar] [CrossRef]
Ahmed, T.S.; Ahmed, F.M.; Elbahnasawy, M.; Youssef, A. Optimized Multi-Radar Hand Gesture Recognition: Robust MIMO-CNN Framework with FPGA Deployment. In Proceedings of the 2025 15th International Conference on Electrical Engineering (ICEENG), Cairo, Egypt, 12–15 May 2025; pp. 1–6. [Google Scholar] [CrossRef]
Yacouby, R.; Axman, D. Probabilistic extension of precision, recall, and f1 score for more thorough evaluation of classification models. In Proceedings of the First Workshop on Evaluation and Comparison of NLP Systems, Online, 20 November 2020; pp. 79–91. [Google Scholar] [CrossRef]
Wu, E.; Zhang, X.; Berman, D.; Cho, I.; Thendean, J. Compute-efficient neural-network acceleration. In Proceedings of the 2019 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, Seaside, CA, USA, 24–26 February 2019; pp. 191–200. [Google Scholar]

Figure 1. Paper structure flowchart.

Figure 2. Conceptual diagram of convolution operations.

Figure 3. FPGA core architecture.

Figure 4. FPGA-based CNN acceleration technology.

Figure 5. Pruning: (a) pruning process, (b) different sparse structures.

Figure 6. Knowledge distillation framework diagram.

Figure 7. Low-rank approximation of 2D convolutional layers.

Figure 8. Loop tiling.

Figure 9. Loop unrolling.

Figure 10. Double buffering.

Figure 11. Parallel computing architecture diagram.

Figure 12. Energy efficiency comparison of CNN accelerator architectures.

Figure 13. Hardware accelerator architecture.

Figure 14. Algorithm–hardware co-design architecture.

Table 1. Comparison of existing review optimization methods.

Literature	Year	Platform Comparison	Quantification, Pruning	Model Architecture Optimization			Reduce Computation	Low-Rank Approximation	Data Flow Optimization	Double Buffering	Hardware Architecture Optimization			Algorithm- Hardware Co-Optimization	Evaluation
Literature	Year	Platform Comparison	Quantification, Pruning	Lightweight	Knowledge Distillation	Layer Integration	Reduce Computation	Low-Rank Approximation	Data Flow Optimization	Double Buffering	Parallelism	Assemblyline	Hardware Accelerator	Algorithm- Hardware Co-Optimization	Evaluation
[10]	2019		√			√	√		√					√
[11]	2019		√				√	√			√	√		√
[12]	2020		√			√	√		√		√			√
[13]	2020		√				√		√	√		√
[14]	2021		√						√		√			√	√
[15]	2021		√						√		√	√
[16]	2022	√	√				√			√		√	√
[17]	2022	√							√		√	√	√
[2]	2023		√	√	√			√
[18]	2024		√				√		√		√	√
[19]	2024		√			√	√		√	√
[20]	2025	√	√	√	√		√		√		√	√	√
[21]	2025								√	√
ours	2025	√	√	√	√	√	√	√	√	√	√	√	√	√	√

√: indicates that the paper has made corresponding contributions.

Table 2. Development stages of convolutional neural networks.

Stage	Time	Primary Contributor	Content	Significance
Theoretical Emergence	1950s–1980	David Hubel, Torsten Wiesel	Local receptive field	Establish a theoretical foundation
Theoretical Emergence	1980	Kunihiko Fukushima	Neural cognitive machine	Establish a theoretical foundation
Modern Foundation	1998	Yann LeCun	LeNet-5	The first modern CNN to achieve end-to-end training
Stagnation Period	2000s–2011	Deep learning is currently in a trough phase, yet related research continues to advance, laying the theoretical groundwork for future breakthroughs
Deep Revolution	2012	Alex Krizhevsky	AlexNet	Demonstrating the dominance of deep learning, ushering in a new era
Diversified Development	2014–present	VGGNet Team	DataVGGNet	Architecture is mature
		GooleNet Team	GooleNet
		Kaiming He et al.	ResNet
		Joseph Redmon	YOLO	Application deepening
		Major institutions	MobileNet	Application deepening
		Alexey Dosovitskiy	VIT	Paradigm extension

Table 3. Pseudo-code for convolution operations.

1:s = stride # Step length

# Nested loops for implementing convolution calculations

2:for n in range(N): # Loop 1: Output feature map channels

3: for c in range(C): # Loop 2:Output feature map row dimension

4: for r in range(R): # Loop 3: Output feature map column dimensions

5: for m in range(M): # Loop 4:Input feature map channel

6: for x in range(Kx): # Loop 5:Convolution kernel row dimension

7: for y in range(Ky): # Loop 6:Convolution kernel column dimension

Output _ Feature _ Map[n][r][c] +=

Input _ Feature _ Map[m][r*s + x][c*s + y] * Filter[n][m][x][y]

Table 4. Comparison of computing platform features.

Platform	Energy Efficiency	Parallel Capability	Flexibility	Development Cycle	Development Costs
CPU	Low	Low	Maximum	Shortest	High
GPU	Midterm	Extremely high	Medium	Short	Maximum
ASIC	Highest	Maximum	None	Long	Lower after mass production
FPGA	High	High	High	Medium	Relatively low

Table 5. Complexity trade-offs between linear and nonlinear quantization.

Characteristics	Linear Quantization	Nonlinear Quantization
Implementation complexity	Low (simple mapping rules)	High (requires more complex algorithms)
Hardware-friendly	High (can directly reuse INT ALU)	Moderate (often requires custom logic units)
Hardware-friendly	Moderate (significant loss of precision below 4 digits)	Extremely high (supports binary/ternary and other extreme compression)
Accuracy Retention	Better (8-bit has been widely validated)	Depending on the method, significant loss of accuracy may occur in extreme cases
Application Maturity	Widespread, technologically mature	Emerging, in the research and exploration phase

Table 6. Comparison of quantitative methods.

Literature	Quantitative Methods	Model	Achievement
[44]	16 Bit	CNN	Convert the 32-bit floating-point model to a 16-bit fixed-point model
[43]	8 Bit	CNN	Convert the 32-bit floating-point model to an 8-bit fixed-point model
[46]	4 Bit	CNN	Data-agnostic knowledge distillation
[47]	Binary Quantification	DNN	XNOR and shift elimination: bottlenecks in multiplication
[49]	Binary Quantification	CNN	Weighting is dualized and implemented using hardware-direct-connect logic
[50]	Triple Quantification	CNN	Soft threshold training framework
[51]	Triple Quantification	CNN	Implementation through specific 2-bit encoding combined with bit instructions
[52]	Quadratic Quantization	CNN	Designing a multi-level scaling tensor data format
[53]	Quadratic Quantization	CNN	Train hardware-friendly bit-level sparse patterns
[54]	K-means Quantification	CNN	Nonlinear quantization using compressor functions
[55]	K-means Quantification	CNN	Teacher model learning generates discrete codebooks

Table 7. Comparison of selected pruning techniques.

Lerature	Literature	Category	Model	Dataset	ComputationalComplexity (FLOPs)/Compression Ratio(CR)	Accuracy	Method
[56]	SSL	model	AlexNet	ImageNet	—	Top-1↓0.9%	Structured regularization using Group-Lasso
[56]	SSL	model	ResNet-20	CIFAR-10	—	↓1.42%	Structured regularization using Group-Lasso
[57]	SSS	model	ResNet-50	ImageNet	FLOPs↓43%	Top-1↑4.3%	Introduce a learnable scaling factor $λ$ for L1 regularization
[57]	SSS	model	ResNet-164	CIFAR-10	FLOPs↓60%	Top-1 ↑1%
[58]	GAL	model	ResNet-50	ImageNet	FLOPs↓72%	Top-5↑3.75%	Soft mask and Label-free GAN simultaneous pruning
			ResNet-56	CIFAR-10	FLOPs↓60%	↑1.7%
			VGGNet	CIFAR-10	FLOPs↓45%	↑0.5%
[59]	BAR	model	Wide-ResNet	CIFAR-10	FLOPs↓ $16 \times$	Top-1 ↑0.7%	Apply display constraints using modified functions
[59]	BAR	model	ResNet-50	Mio-TCD	FLOPs↓ $200 \times$	—	Apply display constraints using modified functions
[60]	ECCV	model	AlexNet	ImageNet	CR $20 \times$	Top-1↑3.1%	Correct pruning errors through feedback mechanisms
[61]	L-BS	model	LeNet-300-100	MNIST	CR $14 \times$	↓0.3%	Pruning via second derivative of hierarchical error
			LeNet-5	MNIST	CR $14 \times$	↓0.3%
			VGG-16	ImageNet	CR $12.5 \times$	Top-1 ↑0.3%
[62]	BA-FDNP	model	LeNet-5	MNIST	CR $150 \times$	—	Frequency domain transformation, dynamic pruning, band- adaptive pruning
			AlexNet	ImageNet	CR $22.6 \times$	—
			ResNet-110	CIFAR-10	CR $8.4 \times$	↑12%
			ResNet-20	CIFAR-10	CR $6.5 \times$	—
[63]	DPF	model	LeNet-50	ImageNet	82.6% Sparse	Top-1↓47%	Correct pruning errors through feedback mechanisms

↑: Increase; ↓: Decrease; FLOPs↓

x %

: Computational load reduced by

x %

; CR

X \times

: Compressed to

1 / X

of original volume.

Table 8. Comparison of algorithmic approaches.

Type	Literature	Method	Result
Quantification	[43]	Quantification, Compression	Memory usage reduced: 66% DSP usage reduction: 78%
	[46]	Knowledge Distillation, Quantification	Average accuracy increased: 64.5%
	[50]	Soft-Threshold Quantization Network	Performance improvement: $3.96 \times$ ; LUT usage reduction: 30%
Pruning	[57]	Sparse Regularization, Pruning	ResNet-164 Acceleration: $2.5 \times$ ; ResNeXt-164 FLOPs Reduction: 60%
	[59]	Knowledge Distillation, Regularized Pruning	CIFAR-10 FLOPs reduction: $64 \times$ ; CIFAR-100 FLOPs reduction: $60 \times$
	[62]	Dynamic Pruning, Frequency Domain Transformation	LeNet-5 parameter reduction: $150 \times$ ; AlexNet parameter reduction: $22.6 \times$
Model Architecture Optimization	[76]	Partitioned Knowledge Distillation, Compression Strategies	Accuracy: 88.6%;
	[78]	Quantization, Knowledge Distillation	Accuracy: 98.5%; LUT usage: 21.52%
	[80]	Winograd, PE Optimization	Inference latency: 21.4 ms
Reduction in Computational Load	[86]	FFT, Multiplier Optimization	Accuracy: 98%
Reduction in Computational Load	[90]	Winograd, PE Optimization	Resource usage reduction: 38%
Low-Rank Approximation	[92]	Low-rank Approximations, Channel Pruning	Parameter compression: $34.89 \times$
Low-Rank Approximation	[93]	Low-rank Approximations, Channel Pruning	FLOPs reduction: 59.17%; parameter reduction: 66.77%

×: represents the multiplier

Table 9. Comparison of optimization strategies.

Type	Design Methodology	Key Trade-Offs	Specificity	Research Areas
Computational Optimization	Quantization, Pruning, Convolution Algorithms	Accuracy vs. Compression Rate	Algorithm level	Model compression, NAS
Storage and Data Flow Optimization	Loop optimization, Double buffering	Throughput vs. On-chip resource	Microarchitecture level	Storage subsystem, Compilation scheduling
Hardware Architecture Optimization	Deep Pipeline, Computing Unit Array	Performance vs. Logic resources	Macro-architectural level	FPGA/ASIC architecture, EDA tools

Table 10. Performance comparison of existing CNN acceleration designs at different parallelism levels.

Literature	Parallelism	Model	Delay (ms)	Throughput (GOP)/S	Computational Efficiency	DSP Resource Utilization Rate	Clock Frequency (MHz)	Energy Efficiency (GOPS/W)
Pk, Pv, Pf	[109]	U-Net	58.4	107	—	71%	200	11.1
Pc, Pf	[110]	LeNet-5	0.25	3.35	—	54.55%	100	—
	[111]	Image Fusion	192	17.7	—	<40%	100	—
	[112]	ResNet-50	5.07	1519	92.7%	97%	200	33.8
	[113]	FSRCNN-light	—	458	—	—	167	174.9
	[114]	VGG-16	—	397	97.79%	97.26%	200	16.5
Pc, Pf, Pv	[115]	VGG-16	—	2766	88.9%	71%	190	72.7
	[116]	I3D	178.3	684	99%	96%	200	26.3
	[117]	MobileNetV1	0.6	4787.15	—	82%	211	121.3
	[100]	YOLOv4-tiny	427	12.01	—	99%	110	4.558
	[118]	VGG16	0.48	807	98.5%	88.9%	200	128.1
Pc, Pf, Pv, Pk	[119]	OpenPose	117.07	288.3	—	23.14%	250	70.3
Pc, Pf, Pv, Pk	[120]	ResNet-50	8.36	1330	92.4%	88.6%	220	118.1
Pf, Pv	[121]	CIFAR-10	0.3	162.7	—	37%	150	7.2

Table 11. Comparison of assembly line methods.

Type	Description	Particle Size Distribution	Advantages	Applicable Scenarios
Arithmetic Pipeline	Processing complex computations between processing units	Fine-grained	Accelerate the computational process between units	Image classification, object detection
Inter-layer Assembly Line	Enable different layers of the model to work in parallel	Coarse-grained	Reduce end-to-end latency across the entire network	Autonomous driving, radar monitoring
Instruction Pipeline	Implementation in phases directivep	Medium-grained	High flexibility and versatility	Real-time cascaded network applications (such as license plate recognition)
Hybrid Assembly Line	Utilizing different pipeline strategies	Mixed particle size	Balance performance, resources, and flexibility	Robotic grasping, remote sensing image processing

Table 12. Hardware accelerator architecture comparison.

Type	Advantage	Disadvantage	Applicable Scenarios
Single-engine Architecture	High flexibility and versatility, high hardware reuse rate	Complex control logic and potential performance bottlenecks	Cloud data center
Streaming Architecture	Extremely high throughput, relatively straightforward design, stable and predictable performance	Poor flexibility and versatility, uneven resource utilization, and potentially higher resource overhead	Video surveillance, industrial inspection

Table 13. Differences in CNN acceleration designs across mainstream FPGA platforms.

Characteristic Dimension	Xilinx Platform	Altera Platform	Impact on CNN Acceleration Design
Core Computing Resources	DSP58 Module	DSP Block	Designers should optimize quantization strategies based on the native precision supported by the DSP
On-Chip Storage Architecture	BRAM	M20K Block	Caching strategy directly affecting weights and feature maps
Heterogeneous System Integration	Zynq/Versal Series	Agilex SoC Series	Hardened ARM processors are ideally suited for implementing efficient hardware-software co-design
High-level Design Tool	Vitis AI	OpenVINO for FPGA	Directly impacts development efficiency and the ability to rapidly iterate algorithms

Table 14. Comparison of hardware policies.

Type	Calculate Parallelism	Memory Access Efficiency	Energy Efficiency Potential	Resource Consumption
Data Flow Optimization	√	√ √ √	√ √	√
Hardware Architecture Optimization	√ √ √	√ √	√ √	√ √
Hardware Accelerator	√ √	√ √ √	√ √ √	√

√: Qualitatively indicates that the corresponding hardware strategy holds an advantage in this metric.

Table 15. Summary of mainstream toolflows.

Name	Organization	Organization	Advantage	Disadvantage	Applicable Scenarios
Vitis-AI	AMD-Xilinx	Provide a complete model optimization, compilation, and deployment workflow	High development efficiency, mature ecosystem	Less flexible, dependent on DPU architecture	Industrial Vision, ADAS, Medical imaging
Intel oneAPI	Intel	Cross-platform inference engine	High platform portability with unified toolchain	Performance may be inferior to solutions optimized for specific purposes	Edge servers, Industrial PC
Deep Learning HDL Toolbox	MathWorks	Provide model-to-FPGA prototyping and HDL code generation	Deeply integrated with MATLAB (R2020b and later versions)/software co-simulation.	Relying on the MATLAB commercial software ecosystem	Algorithm–Hardware co-Verification
TVM + VTA	Apache	An open-source end-to-end deep learning compilation stack	Fully open-source and highly customizable	Toolchain configuration is complex	Projects requiring deep customization of accelerators
hls4ml	Open source	Open-source, from model to automatic HLS code generation	Highly automated, focused on ultra-low latency	Less versatile, with optimization potentially falling short of manual design	Particle physics trigger system, Low-Latency edge inference
FINN	Xilinx Research Labs	Open-source, specializing in deep quantitative models	Exceptionally high resource efficiency and performance	Supports only highly quantifiable models, with a high barrier to entry	Embedded FPGA, Micro UAV
DNNBuilder	Tsinghua University, MSRA	Automation tools designed to enable an end-to-end workflow for high-performance FPGA implementation.	Highly automated, no RTL programming required	Lack of ongoing commercial-grade support and comprehensive documentation	CNN inference acceleration service
FP-DNN	Peking University, UCLA	Automatically generate hardware implementations based on RTL-HLS hybrid templates	End-to-end automation; supports multiple network types with good versatility	Performance optimization may not be as effective as architectures specifically designed for CNN	Large-scale research models, data center FPGA clusters
Caffeine	Peking University, UCLA	A Hardware/Software Co-design Library	Focus on optimizing memory bandwidth	Not user-friendly for users of other frameworks	Cloud-Side inference acceleration
DeepBurning	Tsinghua University	Automated neural network accelerator generation framework	Highly automated, supporting multiple networks such as CNN and RNN	Lacks robustness and support for the latest networks	Verification projects requiring rapid migration to different FPGA

Table 16. Summary of selected innovation toolflows.

Literature	Name	Development Board	Network Model	Throughput (GOPS)	DSP Resource Utilization
[140]	f-CNN	Zynq XC7Z045 SoC	VGG16	79.63	—
[141]	AWARE-CNN	Xilinx ZCU102	AlexNet	271	—
[142]	mNet2FPGA	Zynq Z-7020	VGG16	8.4	31%
[143]	CNN-Grinder	Xilinx XC7Z020	SqueezeNet v1.1	14.18	78.18%
[135]	HARFLOW3D	Xilinx ZCU102	X3D-M	56.14	9.61%
[136]	FMM-X3D	Xilinx ZCU102	X3D-M	119.83	85%
[138]	fpgaHART	Xilinx ZCU102	X3D	85.96	84.43%
[144]	Pflow	Ethinker A8000	SSD-MobileNetV1-300	196.08	49%
[139]	FPG-AI	Xilinx XC7Z045	MobileNet-V1	7.3	—

Table 17. Summary of DSE strategies.

Method	Core	Advantage	Disadvantage	Literature
Exhaustive Search	Exhaustively enumerate all discrete parameter combinations	Ensure global optimization	Long search time, only suitable for extremely small spaces	[125,144,145]
Heuristic Search	Simulated Annealing/Genetic Algorithm Gradually Approaches Optimal Solutions	Moderate complexity yields near-optimal results	May get stuck in local optima, resulting in poor interpretability	[135,136,138,146]
Deterministic Approach	Rapid estimation using analytical models and mathematical equations	Fast execution	Less flexible	[141,147,148]
Mixed Methods	Combining the strengths of multiple strategies to compensate for the shortcomings of a single approach	Fast execution	High implementation complexity	[130,149]

Table 18. Summary of collaborative design cases.

Case Study	Target Platform	Algorithm Mapping Critical Decisions	DSE Decision	Performance Evaluation Results	Literature
HARFLOW3D-X3D	ZCU102	3D-CNN decomposed into 3D-Conv + Pool + PW	Genetic Algorithm	Energy efficiency $↑ 1.9 \times$ DSP utilization 99.6% Frame latency $↓ 42 %$	[135]
SpAtten	Stratix-10 GX	Per-head sparsity threshold for Transformers	Simulated Annealing	Throughput $↑ 13.8 \times$ Energy efficiency $↑ 2.1 \times$ BERT-Large accuracy $↓ 0.3 %$	[137]
Eyeriss v2	65 nm ASIC	16× structured channel pruning	Mixed Strategy	Energy efficiency $↑ 3.2 \times$ Area $↓ 38 %$ ResNet-50 only $↓ 0.7 %$	[155]
FINN-R	Zynq-7020	Extreme 1-bit weight + 2-bit activation	Exhaustive Search	Energy efficiency $↑ 2.7 \times$ Power consumption 0.8 W MNIST accuracy 99.2%	[156]
MobileViT-v3	A14 Bionic	ATS+ Linear Attention	Heuristic Search	Inference speed $↑ 2.3 \times$ ; Top-1 accuracy $↓ 0.4 %$	[157]

↑: Increase; ↓: Decrease; ×: represents the multiplier.

Table 19. Analysis of embedded applications across different fields.

Application Areas	Research Areas	Hardware Platform	Algorithm Optimization	Hardware Optimization	Key Performance Indicators	Literature
Medical Testing	Classification of Respiratory Symptoms	Xilinx Artix-7100t	Low-rank Quantization, Hyperparameter Optimization	Parallel Computing, Logic Unit Optimization	Accuracy	[158]
	Diagnosis of Parkinson’s Disease	Xilinx Zynq-7020	Model Compression	Parallel Processing Engine	Accuracy, Latency	[161]
	Classification of Arrhythmias	Xilinx Zynq 7020	Network Architecture Optimization	No-multiplier convolutional unit	Accuracy, Real-time Capability%	[162]
Industrial Maintenance	Bearing Monitoring	Pynq-Z2 FPGA	Partial Binarization	Asynchronous Processing Pipeline	Response Time, System Integration	[159]
Industrial Maintenance	Sensor-Based Predictive Maintenance	Xilinx Artix-7	8-bit quantization, layer fusion	Double-buffered	Power Consumption, Area	[163]
Radar Perception	Radar Transmitter Signal Identification	Xilinx XCKU040	16-bit fixed-point quantization	Pulsed Array, Dual-Buffered	Recognition Rate, Energy Efficiency, Throughput	[161]
	Radar Target Detection	PYNQ-Z1	16-bit fixed-point quantization	Loop Optimization	Power Consumption; Accuracy	[164]
	Multi-radar Gesture Recognition	Xilinx Kria KR260	INT8 quantization	DPU Deployment	Accuracy, Speed	[165]

Table 20. Comparison of accuracy evaluation metrics.

Indicator	Calculation Logic	Applicable Scenarios	Relationship with Standard Accuracy
Top-1 Accuracy	The model’s predicted category with the highest probability must perfectly match the actual label to be considered correct	The model must provide a single correct answer	Equivalent to the traditional definition of accuracy
Top-5 Accuracy	A true label is considered correct if it appears among the top five categories with the highest predicted probabilities by the model	More tolerant of ambiguous categories	Extended form of traditional accuracy

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Gao, L.; Luo, Z.; Wang, L. Convolutional Neural Network Acceleration Techniques Based on FPGA Platforms: Principles, Methods, and Challenges. Information 2025, 16, 914. https://doi.org/10.3390/info16100914

AMA Style

Gao L, Luo Z, Wang L. Convolutional Neural Network Acceleration Techniques Based on FPGA Platforms: Principles, Methods, and Challenges. Information. 2025; 16(10):914. https://doi.org/10.3390/info16100914

Chicago/Turabian Style

Gao, Li, Zhongqiang Luo, and Lin Wang. 2025. "Convolutional Neural Network Acceleration Techniques Based on FPGA Platforms: Principles, Methods, and Challenges" Information 16, no. 10: 914. https://doi.org/10.3390/info16100914

APA Style

Gao, L., Luo, Z., & Wang, L. (2025). Convolutional Neural Network Acceleration Techniques Based on FPGA Platforms: Principles, Methods, and Challenges. Information, 16(10), 914. https://doi.org/10.3390/info16100914

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Convolutional Neural Network Acceleration Techniques Based on FPGA Platforms: Principles, Methods, and Challenges

Abstract

1. Introduction

2. Background Information

2.1. CNN

2.1.1. The Evolution of CNN

2.1.2. CNN Calculation

2.2. FPGA

2.2.1. Fundamental Principles of FPGA Technology

2.2.2. Comparison of FPGA with Other Platforms

3. Optimization Technology

3.1. Algorithm-Level Optimization

3.1.1. Model Quantification

3.1.2. Pruning

3.1.3. Model Architecture Optimization

3.1.4. Reduction in Computational Load

3.1.5. Low-Rank Approximation

3.1.6. Summary

3.2. Hardware-Level Optimization

3.2.1. Data Flow Optimization

3.2.2. Hardware Architecture Optimization

3.2.3. Hardware Accelerator

3.2.4. Impact of Different FPGA Platforms on CNN Acceleration Performance

3.2.5. Summary

4. Algorithm–Hardware Co-Optimization

4.1. Algorithm–Hardware Co-Optimization Method

4.2. Algorithm–Hardware Co-Optimization Framework

4.2.1. Algorithm Mapping and Automated Toolflow

4.2.2. Design Space Exploration

4.2.3. Performance Evaluation and Modeling

4.2.4. Case Studies of Collaborative Design

5. Extended Analysis for Diverse Embedded Applications

6. Evaluation Framework

6.1. Network Model Performance Evaluation

6.2. Hardware Acceleration Performance Evaluation

7. Challenges and Future Research Directions

7.1. Challenges

7.2. Future Research Directions

7.3. Limitations of This Paper

8. Summary

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI