Next Article in Journal
RDT-YOLO: An Improved Lightweight Model for Fish Maw Authenticity Detection
Next Article in Special Issue
Analyzing the Impact of Kernel Fusion on GPU Tensor Operation Performance: A Systematic Performance Study
Previous Article in Journal
A Low-Cost Magnetic 2D Tracking System for Mobile Devices as an Alternative to Large Interactive Tabletops
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Layer-Pipelined CNN Accelerator Design on 2.5D FPGAs

School of Microelectronics, Fudan University, Shanghai 200433, China
*
Author to whom correspondence should be addressed.
Electronics 2025, 14(23), 4587; https://doi.org/10.3390/electronics14234587
Submission received: 22 October 2025 / Revised: 14 November 2025 / Accepted: 20 November 2025 / Published: 23 November 2025
(This article belongs to the Special Issue Advances in High-Performance and Parallel Computing)

Abstract

With the rapid advancement of 2.5D FPGA technology, the integration of multiple FPGA dies enables larger design capacity and higher computing power. This progress provides a high-speed hardware platform well-suited for neural network acceleration. In this paper, we present a high-performance accelerator design for large-scale neural networks on 2.5D FPGAs. First, we propose a layer pipeline architecture that utilizes multiple accelerator cores, each equipped with individual high-bandwidth DDR memory. To address inter-die data dependencies, we introduce a block convolution mechanism that enables independent and efficient computation across dies. Furthermore, we propose a design space exploration scheme to optimize computational efficiency under resource constraints. Experimental results demonstrate that our proposed accelerator achieves 4860.87 GOPS when running VGG-16 on the Alveo U250 board, significantly outperforming existing layer pipeline designs on the same platform.

1. Introduction

Neural networks are increasingly supplanting numerous traditional algorithms owing to their superior performance. For instance, the YOLO series of object detection networks introduced by Redmon et al. [1,2,3] has become a representative approach in real-time object detection, with the latest YOLOv10 [4] achieving a mean average precision (mAP) of 55.2 at IoU thresholds ranging from 0.5 to 0.95 on the COCO benchmark. Similarly, the Wasserstein Generative Adversarial Network (WGAN) proposed by Miuccio et al. [5] demonstrates a lower Symbol Error Rate (SER) under low signal-to-noise ratio (SNR) conditions, outperforming conventional cascaded modulationdemodulation and channel coding schemes. Furthermore, Transformer-based models such as DeepSeek [6], Qwen [7,8], and LLaMA [9,10,11] have exhibited superhuman efficiency in tasks including machine translation, code generation, and complex planning.
As applications continue to advance, neural network architectures are trending toward greater complexity and increased depth. Recent convolutional neural networks (CNNs) such as those presented in [12,13,14,15,16] employ convolutional kernels larger than 7 × 7 to enhance image recognition capabilities and achieve higher accuracy. However, this architectural evolution entails substantial computational and memory overhead. Consequently, there is growing impetus for the development of dedicated neural network accelerators, among which FPGA-based accelerators have emerged as a research hotspot owing to their flexible reconfigurability and high energy efficiency.
Deploying neural networks on FPGAs is fundamentally constrained by the finite on-chip compute (DSP) and on-chip storage (BRAM) resources. The limited counts of both DSP slices and BRAM blocks in conventional FPGA devices impose a tight upper bound on the attainable throughput, preventing accelerators from achieving their theoretical peak performance.
To achieve higher performance, many researchers have turned to more advanced 2.5D FPGAs with substantially greater resource budgets. These 2.5D FPGAs leverage 2.5D stacked silicon interconnect (SSI) technology to integrate multiple super logic regions (SLRs) within a single package and incorporate multiple DDR memory controllers. Compared to traditional monolithic-die FPGAs, 2.5D FPGAs provide approximately 34× more DSP resources and significantly higher memory bandwidth. This architectural enhancement effectively mitigates the computational and memory bottlenecks inherent in conventional single-die FPGAs, thereby facilitating the rapid implementation and deployment of high-performance neural network accelerators.
Nevertheless, the abundance of compute and storage resources in 2.5D FPGAs also introduces new challenges for accelerator design. Although silicon interposer or advanced interconnect technologies significantly increase resource density, inter-SLR (Super Logic Region) communication latency and limited inter-SLR bandwidth have emerged as critical bottlenecks that constrain overall accelerator performance. Existing FPGA-based CNN accelerators are predominantly optimized for traditional monolithic-die FPGAs and have not been specifically tailored to exploit the architectural characteristics of 2.5D FPGAs. Consequently, when deployed on 2.5D FPGAs, these designs often suffer from suboptimal resource utilization and performance inefficiencies, leaving substantial room for improvement.
For example, Blevec et al. [17] proposed a scalable heterogeneous pipelined architecture employing multiple layer-specific hardware-accelerated cores. Although their approach aimed to harness the multi-resource capabilities of 2.5D FPGAs, the unstructured data transfers across different dies introduced significant pipeline startup latency, ultimately degrading overall throughput. Similarly, Kim et al. [18] implemented multiple compute cores distributed across distinct SLRs, with each core capable of executing an entire convolutional computation task. These SLRs were connected to a shared external DRAM controller via a common bus. Their VGG-16 implementation on the VCU118 development platform achieved a throughput of 402 GOPS. While mapping each core entirely within a single SLR helped mitigate timing closure issues associated with cross-SLR placement, their reliance on a shared bus for both inter-SLR data synchronization and DRAM access failed to address the system-level latency arising from weight caching across multiple SLRs.
Moreover, other studies deploying neural networks on 2.5D FPGAs such as those by [19,20,21,22,23] have largely overlooked the clock frequency degradation induced by cross-SLR data transfer architectures. This oversight has led to severely underutilized DSP resources; for instance, Park et al. [21] reported DSP utilization as low as 7%, highlighting a significant opportunity for architectural co-design and optimization specific to 2.5D FPGA platforms.
To address this issue, in this paper, we propose a layer-pipelined neural network accelerator design for 2.5D FPGAs. The primary contributions of this paper are as follows:
  • We propose a hierarchical pipeline accelerator tailored for 2.5D FPGA characteristics, implementing one acceleration core per SLR. The silicon interposer between SLRs enables cross-layer transmission of output feature maps, reducing the impact of inter-die communication latency on accelerator performance.
  • We design a partitioned convolution scheme, leveraging a block-based convolutional computing dataflow to enhance DDR bandwidth utilization in the accelerator. Furthermore, we establish a cross-SLR data transfer temporal model to quantitatively analyze the impact of the convolution partitioning strategy on system performance.
  • We propose a design space exploration algorithm for the 2.5D layer pipeline. By defining mathematical models for configuration parameters, resource consumption, and computation time, we optimize the accelerator with the objectives of minimizing overall inference time and ensuring consistent pipeline initiation intervals. This algorithm searches for the optimal configuration and block instruction parameters for each acceleration core, maximizing pipeline efficiency.
The rest of this paper is organized as follows. Section 2 presents the proposed neural network accelerator and its design details. Section 3 analyzes the experimental results. Section 4 summarizes this paper.

2. The Multi-Core Accelerator Architecture for 2.5D FPGA

In this section, we first present the computing architecture of the proposed CNN accelerator. Then, we introduce a block convolution scheme for the computing dataflow. Finally, we present the design space exploration to set the optimal configuration in detail.

2.1. Computing Architecture

This section introduces the architecture of a pipeline-based CNN accelerator. As a case study, we consider a 2.5D FPGA comprising four dies to illustrate the design of a layer-pipelined convolutional neural network accelerator. Each of the four SLRs in the 2.5D FPGA is equipped with a dedicated DDR4 controller. Adjacent SLRs are interconnected via thousands of Super Logic Links (SLLs), whose latency is approximately 4 to 8 times higher than that of intra-SLR communication.
As shown in Figure 1a, a configurable accelerator core is deployed on each SLR. For each core, illustrated in Figure 1b, the input feature maps and weights are stored in the local DDR of the current SLR, while the output feature maps are written to the DDR of the adjacent SLR. In particular, the output feature maps generated by the accelerator core on the final SLR (SLR3) are stored in the DDR0 of SLR0. As a result, the input and output of the accelerator cores across the four dies are connected in an end-to-end manner, forming a ring structure. This ring architecture serves as the hardware foundation for implementing the layer pipeline. The rationale behind adopting this ring structure is that the output bandwidth requirement of each accelerator core is relatively low. By confining each core to its respective SLR, we can maximize resource utilization within each die while minimizing the performance impact caused by the higher latency of inter-SLR communication through SLLs.
Data in any DDR can be accessed via the PCIe bus. During CNN model deployment, convolutional layers are assigned sequentially to each die. Each accelerator core processes one convolutional layer at a time, and data for the subsequent layer is loaded only after the computation of the current layer is completed.
Each accelerator core comprises a set of processing units (PUs), an input feature map (IFM) buffer, multiple weight buffers, an output feature map (OFM) buffer, and a control module. The IFM and OFM buffers are shared among all PUs and are managed by an instruction controller. Each PU is equipped with its own dedicated weight buffer. Inside each PU, a three-stage processing pipeline is implemented, consisting of convolution, activation, and pooling units, as illustrated in Figure 2.
In our design, the number of PEs within each accelerator core, the size of each buffer, and the number of DSPs within each PE are all adjusted based on the characteristics of the convolutional layers running on each accelerator core. This customization helps balance the diverse computational demands across layers and effectively mitigates DSP idle time issues.

2.2. Computational Dataflow for Multi-Stage Pipelines

In our design, we propose the use of block convolution to address data dependencies across dies. The core idea of this block convolution optimization is to partition the feature maps and weights of convolutional layers into independent blocks. Block convolution adopts a split-and-concatenate computing mechanism, where the convolution results of adjacent spatial blocks are independent and free of data dependencies.
We divide each convolutional layer into multiple convolution blocks following the procedure outlined in Algorithm 1, and then perform computations block by block. Specifically, the IFMs are pre-partitioned into data blocks of size b c × b h × b w , and the weights are divided into blocks of size b n × b c × k y × k x , based on the layer parameters H ,   W ,   C ,   N ,   k x , and k y . Using these data blocks as basic computation units, the entire convolutional layer is partitioned into several independent convolution blocks. At runtime, one convolution block is computed at a time, and the next block is loaded only after the current one is completed. Since it is not feasible to load all parameters of the entire convolutional layer into on-chip buffers simultaneously, block-wise partitioning significantly reduces buffer space requirements. Moreover, this block-based execution enables each accelerator core to begin computation with only a subset of the input feature maps, thereby facilitating efficient pipelined processing.
Algorithm 1 Partition the convolution layer into several convolution blocks
  • for  h = 0 ;   h < H ;   h = h + b h ;   do
  •        for w = 0 ;   w < W ;   w = w + b w ; do
  •               for n = 0 ;   n < N ;   n = n + b n ; do
  •                      for c = 0 ;   c < C ;   c = c + b c ; do
  •                      ▷ Partition IFM data blocks and load to IFM buffer;
  •                             i f m [ b c i f m ] [ b h i f m ] [ b w i f m ] I F M [ C ] [ H ] [ W ] ;
  •                            ▷ Partition weight data blocks and load to weight buffer, each weight buffer loads a data block;
  •                             w e i g h t [ b n w e i g h t ] [ b c w e i g h t ] [ k x ] [ k y ] w e i g h t [ N ] [ C ] [ k x ] [ k y ] ;
  •                            ▷ We set b n equal to the number of PEs. All PE units share the IFM block;
  •                             o f m [ b n o f m ] [ b h o f m ] [ b w o f m ] = c o n v ( i f m , w e i g h t ) ;
  •                            if c = = C then
  •                            ▷ When the C dimension computation is complete, the OFM is stored in the adjacent SLR’s DDR;
  •                                  O F M [ N ] [ H o f m ] [ W o f m ] o f m [ b n o f m ] [ b h o f m ] [ b w o f m ] ;
  •                            else
  •                            ▷ Otherwise, it is stored as a partial sum in the OFM buffer;
  •                                    o f m [ b n o f m ] [ b h o f m ] [ b w o f m ] o f m [ b n o f m ] [ b h o f m ] [ b w o f m ] ;
  •                            end if
  •                      end for
  •               end for
  •        end for
  • end for
We adopt a computation priority order of H W N C . This ordering is chosen because the partial results along the C dimension must be accumulated to generate a complete output along the N dimension, which enables the subsequent layer to initiate its computation. In practice, after loading a data block of the input feature map (IFM), we repeatedly load the corresponding weights required for that block. By leveraging the high memory bandwidth provided by the 2.5D FPGA, we are able to sustain continuous pipeline execution without stalling.
At the beginning of computation, the IFM buffer loads a data block of size b c × b h × b w , while the weight buffer loads the corresponding weights of size b n × b c × k y × k x . Computation proceeds by keeping the b h × b w spatial window fixed and iterating over new b c slices until the entire C dimension is processed. Once this is complete, a new b n weight block is loaded, the IFM block is reloaded, and the resulting output feature map is written from the output buffer to DDR memory. After completing computation along the N dimension, the input feature map window shiftsfirst along the W axis, then along the H axisand the same computation process is repeated. At this point, the convolution core on the adjacent SLR can begin processing the next convolutional layer. Through this iterative mechanism, a block-based layer pipeline is established across SLRs.

2.3. Design Space Exploration

Since each acceleration core processes distinct convolutional layers, the variance in computational workload among convolutional layers leads to divergent startup times across acceleration cores, thereby impacting the overall efficiency of the accelerator. To minimize the overall inference cycle of the accelerator, we develop a resource-constrained model and an inference cycle model, and propose a dedicated design space exploration algorithm to identify the optimal set of configurable parameters. Our optimization objective is to minimize the total number of computation cycles required for inference. To this end, we take a holistic approach that considers the computation latency of individual convolutional layers, the synchronization overhead incurred by inter-die data communication, and the impact of resource utilization. Accordingly, we construct a global computation cycle model for the entire accelerator, as well as a per-die resource consumption model for each accelerator core.

2.3.1. Pipeline Stage Time Modeling

Assuming each acceleration core corresponds to a pipeline stage, its computational cycle is expressed in Equation (1), where p e s ,   p u s denotes the pe/pu numbers allocated to the s-th core, and C y c l e c o m p represents the function mapping resource allocation to computational cycle for each layer.
C y c l e c o m p = K x × K y × H b h × W b w × C p e s × N p u s s = ( k 1 ) % 4 [ 0 , 1 , 2 , 3 ]
Given that adjacent dies exhibit data dependency between convolutional layers, we define the data preparation time between neighboring dies as C y c l e c o m m ( k ) , which corresponds to the initiation interval between consecutive pipeline stages. The block-based convolution scheme adopted in our design requires that subsequent layer computation can only commence when the current input feature map block i f m [ C ] [ b h i f m ] [ b w i f m ] is no longer involved in computations. Consequently, C y c l e c o m m ( k ) is quantified by dividing the total number of multiplications required for the i f m [ C ] [ b h i f m ] [ b w i f m ] block by the parallel computing resources of the acceleration core. As formulated in Equation (2), C ( k ) and N ( k ) denote the channel and batch parameters of the k-th convolutional layer, respectively, while K x and K y represent the convolutional kernel dimensions.
C y c l e c o m m ( k ) = K x × K y × b h × b w × C ( k ) × N ( k ) R s ( d s p )
The total computation cycle is jointly determined by the bottleneck stage of the pipeline and the initiation intervals between different layers, as formulated in Equation (3), where N denotes the total number of layers in the convolutional neural network.
C y c l e t o t a l = max 1 k N C y c l e c o m p ( k ) + i = 1 k 1 C y c l e c o m m ( i )

2.3.2. Resource Constraint

In FPGA-based convolutional neural network acceleration designs, the resource constraints of each die must strictly adhere to the physical limitations of BRAM and DSP resources. For DSP utilization, the total consumption R t o t a l D S P in a single die is governed by the product of the parallelism in the channel (C) and batch (N) dimensions, as expressed by Equation (4). This product determines the required number of DSP units to sustain concurrent multiply–accumulate operations, ensuring R s ( d s p ) R t o t a l D S P , where R t o t a l D S P represents the die’s maximum DSP capacity.
R s ( d s p ) = p u s × p e s
Similarly, the BRAM resources per die must adhere to the constraint R s ( b r a m ) R t o t a l B R A M . The BRAM consumption of each core, R s ( b r a m ) , corresponds to the cumulative allocation for input feature map (IFM), output feature map (OFM), and weight buffers. The quantitative relationship for this allocation can be calculated using Equation (5).
R s ( b r a m ) = R s i f m ( b r a m ) + R s o f m ( b r a m ) + R s w e i g h t ( b r a m )
The individual components R s i f m R s i f m ( b r a m ) , R s o f m ( b r a m ) and R s w e i g h t ( b r a m ) are further defined in Equations (6), (7) and (8), respectively.
R s i f m ( b r a m ) = b c × b h × b w × 8   bits 18   kb
R s o f m ( b r a m ) = p u s × b h × b w × ( 16 + l o g 2 ( p e s ) ) bits 18   kb
R s w e i g h t ( b r a m ) = p u s × ( b c × K x × K y × 8   bits ) 18   kb

2.3.3. Objective Optimization Function

Both resource types (BRAM and DSP) should avoid over-subscription during hardware mapping while maximizing computational throughput. Since resource allocation and computation cycles are subject to coupled constraints, we formulate the revised objective optimization function as shown in Equation (9).
min C y c l e t o t a l = max 1 k N C y c l e c o m p ( k ) + ( N 1 ) C y c l e c o m m s . t R s ( d s p ) R t o t a l D S P R s ( b r a m ) R t o t a l B R A M
The optimized objective function promotes uniform initiation intervals across different layers, which contributes to minimizing variations in critical path delays. Under the premise of maintaining a fixed accelerator core architecture, the adjustment of initiation intervals can be achieved solely by modifying the tiling parameters for each layer.
The optimized objective function facilitates uniformity in initiation intervals across network layers, thereby minimizing discrepancies in critical path variations. Under the condition of maintaining an unchanged accelerator core architecture, this approach enables adjustment of initiation intervals solely through layer-wise tiling parameter optimization. To enhance computational tractability, we constrain p e and p u as powers of two, with b c = C 2 n , and employ a simulated annealing algorithm to obtain the global optimum of the objective function. The solution yields optimal p u / p e counts per accelerator core and specifies input feature map tiling parameters ( b c × b h × b w ) for each convolutional layer.

3. Evaluation Results

3.1. Experimental Setup

Our accelerator design is generated by the DSE algorithm, and it is implemented with Vivado 2023.2 on AMD Alveo™ U250 data center acceleration cards. All implemented models adopted 8-bit quantization with single-image testing. The U250 accelerator card communicates with the host via PCIe X16 interface, while data preprocessing and computational flow control software operate on an Intel(R) Core(TM) i7-7700K CPU@4.20GHz.

3.2. FPGA Implementation Results

Figure 3 illustrates the DSP efficiency, bandwidth utilization, and launch interval latency cycles between accelerator cores for each convolutional layer of VGG-16 deployed on the U250 platform, along with an analysis of the relationship between inter-pipeline launch intervals, DSP efficiency, and bandwidth utilization. The DSP utilization curve in the figure demonstrates that, despite the varying bandwidth and computational demands across layers of the VGG-16 model, the DSE algorithm effectively optimizes the convolutional dataflow to maintain DSP utilization above 80% for each accelerator core. Furthermore, by tailoring the dataflow schedule, the DSE algorithm confines the initiation interval deviation among different accelerator cores to within 18K clock cycles, thereby enhancing pipeline efficiency and consequently boosting DSP utilization. As the network depth increases, the frequency of weight switching rises, leading to higher DDR bandwidth utilization per accelerator core. The experimental results confirm that the proposed DSE algorithm significantly improves both DSP efficiency and DDR bandwidth utilization, underscoring its effectiveness in co-optimizing computation and memory access for deep CNNs on FPGA platforms.
Table 1 presents the key accelerator configuration parameters generated by DSE when deploying CNN architectures, including YOLOv3, VGG-16, ResNet-18, and MobileNetV2. It demonstrates that based on the layer-wise characteristics of different CNN models, DSE adaptively adjusts the p e , p u allocation ratio and BRAM utilization for convolutional layers processed on each accelerator core to achieve efficient inference. Experimental results demonstrate that the DSE algorithm is capable of automatically generating an optimal pipelined architecture tailored to diverse CNN models. In the baseline configurationusing default tiling parametersthe performance corresponds to the entry labeled Performance before Initiation Interval (II) Optimization in Table 1. Building upon this initial architecture, the DSE algorithm dynamically determines layer-specific tiling parameters to maximize the inference efficiency of each layer on its assigned accelerator core, yielding the optimized results reported as Performance after II Minimization in the same table.
The experiments show that, after computing the optimal architectural configuration, the DSE algorithm further refines the pipeline initiation intervals by adjusting tile sizes across layers. This optimization leads to a 31–64% improvement in overall accelerator throughput. Specifically, on the Xilinx U250 FPGA platform, the optimized implementation achieves a peak throughput of 4860.87 GOPS for VGG-16 at a clock frequency of 250 MHz, and 7528.4 GOPS for MobileNetV2 at 266 MHz. These results underscore the effectiveness of the DSE-driven co-optimization of tiling strategies and pipeline scheduling in unlocking high computational efficiency on FPGA-based accelerators.

3.3. Compare with GPU

To evaluate the computational performance gap between the 2.5D FPGA-based accelerator and GPU, this work deploys various CNNs on both NVIDIA A100 GPU and a Xilinx U250 FPGA platforms, and measures inference latency and power consumption under identical experimental conditions. The results are analyzed to assess platform-specific trade-offs in efficiency and speed. Table 2 presents the performance comparison between FPGA accelerators and GPUs for common CNN architectures. The GPU platform utilizes an NVIDIA A100 GPU with 40 GB VRAM, while the FPGA platform employs an Xilinx Alveo U250 accelerator card. Both platforms connect to a host server via PCIe X16 interfaces, with the server configured with an Intel E5-2620 CPU (2.10 GHz base frequency) and 128 GB system memory. All benchmark networks are implemented using the Caffe deep learning framework [24]. For each model, we measure the end-to-end inference latency and average power consumption when processing a single input image. To ensure a fair comparison, both the GPU and FPGA implementations use INT8 quantization, preserving consistent numerical precision. The evaluation focuses on key metrics including per-image inference latency and average power draw, enabling a direct assessment of computational throughput and energy efficiency across the two hardware platforms.
Experimental results reveal significant disparities in neural network inference performance and energy efficiency between the proposed 2.5D FPGA multi-stage pipelined architecture and the NVIDIA A100 GPU under single-image inference scenarios. Regarding inference latency, the FPGA accelerator demonstrates superior performance for lightweight networks (e.g., MobileNetV3) and shallow architectures (e.g., VGG-16), achieving a 14% latency reduction compared to the GPU in VGG-16 deployment. This advantage stems from two key factors: (1) the multi-stage pipelined architecture leverages 2.5D FPGA’s abundant computational resources and parallel processing capabilities, (2) the DSE algorithm proves particularly effective in optimizing pipeline initiation intervals for shallow network structures. In energy efficiency metrics (FPS/W), the FPGA solution consistently outperforms the A100 across all tested networks, reaching 2× higher efficiency in DarkNet-53 deployment. This highlights the significant benefits of a pipelined architecture combined with an efficient dataflow strategy in substantially reducing redundant data accesses.
Experimental results also reveal certain exceptions. Specifically, for the extremely deep network DenseNet-264, the latency on FPGA is slightly higher than that on GPU, indicating that the DSE methodology has limited capability in optimizing the pipeline initiation interval for very deep neural networks. This phenomenon arises because the architectural characteristics of individual layers vary significantly across ultra-deep networks. In particular, the channel dimension (C) of later convolutional layers can be more than 60 times larger than that of earlier layers. Consequently, the number of channel-wise data tiles for feature maps in deeper layers typically increases by a factor of 48, and the bandwidth demand for weight switching rises by 24x. Under such highly irregular layer-wise resource and bandwidth requirements, it becomes increasingly difficult for the accelerator core to balance computation and data transfer times through tile size tuning alone, thereby degrading its ability to maintain a consistent and optimal pipeline initiation interval.
The findings demonstrate that layer-pipelined accelerators combined with conventional DSE algorithms achieve higher energy efficiency and lower system latency when mapping shallow or lightweight neural networks. However, the current multi-pipeline DSE optimization strategy exhibits limited effectiveness in optimizing the initiation interval for deep neural networks, highlighting a need for more adaptive scheduling and resource allocation mechanisms tailored to heterogeneous layer structures.

3.4. Comparison with the Existing Pipeline Accelerator

To validate the performance differences between the DSE algorithm’s automated neural network deployment and existing design methodologies, this paper selects accelerator designs employing layer-pipelined architectures in recent years as comparative benchmarks. As shown in Table 3, the performance comparison reveals that Huang et al. [25] implemented VGG-16 on the VX980T development board using matrix multiplications and implemented a row-level pipeline, achieving 1000 GOPS throughput. Yu et al. [26] proposed a software-defined computational instruction set on fixed accelerator architectures to deploy various CNN networks, optimizing DSP efficiency and bandwidth utilization, attaining 0.197 GOPS/DSP throughput for VGG-16 on XC7Z100. Meanwhile, Yuan et al. [27] and Kim et al. [18] map each convolutional layer to an individual processing element and assemble the complete network on a 2.5D FPGA, but both solutions neglected cross-die communication latency in 2.5D FPGAs, resulting in less than 30% utilization of DSP resources.
The experimental results demonstrate that our proposed 4-core pipelined architecture on the Alveo U250 platform exhibits significant performance advantages for VGG-16 deployment. With 8-bit quantization precision, the architecture achieves 4860.87 GOPS at 250 MHz, outperforming Kim et al.’s layer-pipelined solution on 3SLR FPGAs by 5.8×. By employing cross-die deep pipelining to eliminate inter-die synchronization communication and reduce long-path dependencies, our design achieves a 0.766 GOPS/DSP performance metric through DSE algorithm-optimized dataflow and other technical optimizations, significantly enhancing operational frequency compared to Yuan et al.’s mapping approach.
These findings confirm that our layer-pipelined accelerator architecture and DSE algorithm effectively mitigate cross-die data supply bottlenecks in multi-die FPGAs, validating the necessity of spatiotemporal joint optimization in high-throughput CNN accelerator design. The proposed 2.5D FPGA-oriented neural network accelerator fully leverages the advantages of high bandwidth and abundant resources in 2.5D FPGAs, expanding their application scenarios and providing valuable design insights for accelerating other computation-intensive algorithms.

4. Conclusions

This paper proposes a layer-pipelined neural network accelerator design for 2.5D FPGAs. We exploit the multi-die architecture of 2.5D FPGAs by implementing an accelerator core on each die, thereby creating a ring structure. Combined with a block-based dataflow, this approach enables the realization of a layer-pipeline accelerator. Additionally, we introduce a DSE algorithm that automatically searches for optimal configuration parameters, facilitating the deployment of various neural networks on specific development boards. Experimental results show that our DSE algorithm effectively maps different CNN architectures onto 2.5D FPGA-based accelerators. Notably, the implementation achieves a VGG-16 inference frame rate of 237.93 FPS on the U250 FPGA platform.

Author Contributions

Conceptualization, C.W.; Methodology, M.W.; Software, M.W.; Data curation, M.W.; Writing—original draft, M.W.; Writing—review & editing, M.W. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The data presented in this study are available on request from the corresponding author due to legal and confidentiality agreements with industry partners.

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. Redmon, J.; Farhadi, A. YOLO9000: Better, Faster, Stronger. arXiv 2016, arXiv:1612.08242. [Google Scholar] [CrossRef]
  2. Redmon, J.; Farhadi, A. YOLOv3: An Incremental Improvement. arXiv 2018, arXiv:1804.02767. [Google Scholar] [CrossRef]
  3. Wang, C.Y.; Yeh, I.H.; Liao, H.Y.M. YOLOv9: Learning What You Want to Learn Using Programmable Gradient Information. arXiv 2024, arXiv:2402.13616. [Google Scholar] [CrossRef]
  4. Wang, A.; Chen, H.; Liu, L.; Chen, K.; Lin, Z.; Han, J.; Ding, G. YOLOv10: Real-Time End-to-End Object Detection. arXiv 2024, arXiv:2405.14458. [Google Scholar] [CrossRef]
  5. Miuccio, L.; Panno, D.; Riolo, S. A Wasserstein GAN Autoencoder for SCMA Networks. IEEE Wirel. Commun. Lett. 2022, 11, 1298–1302. [Google Scholar] [CrossRef]
  6. DeepSeek-AI; Guo, D.; Yang, D.; Zhang, H.; Song, J.; Zhang, R.; Xu, R.; Zhu, Q.; Ma, S.; Wang, P.; et al. DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning. arXiv 2025, arXiv:2501.12948. [Google Scholar] [CrossRef]
  7. Bai, J.; Bai, S.; Yang, S.; Wang, S.; Tan, S.; Wang, P.; Lin, J.; Zhou, C.; Zhou, J. Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond. arXiv 2023, arXiv:2308.12966. [Google Scholar] [CrossRef]
  8. Bai, J.; Bai, S.; Chu, Y.; Cui, Z.; Dang, K.; Deng, X.; Fan, Y.; Ge, W.; Han, Y.; Huang, F.; et al. Qwen Technical Report. arXiv 2023, arXiv:2309.16609. [Google Scholar] [CrossRef]
  9. Touvron, H.; Lavril, T.; Izacard, G.; Martinet, X.; Lachaux, M.A.; Lacroix, T.; Rozière, B.; Goyal, N.; Hambro, E.; Azhar, F.; et al. LLaMA: Open and Efficient Foundation Language Models. arXiv 2023, arXiv:2302.13971. [Google Scholar] [CrossRef]
  10. Touvron, H.; Martin, L.; Stone, K.; Albert, P.; Almahairi, A.; Babaei, Y.; Bashlykov, N.; Batra, S.; Bhargava, P.; Bhosale, S.; et al. Llama 2: Open Foundation and Fine-Tuned Chat Models. arXiv 2023, arXiv:2307.09288. [Google Scholar] [CrossRef]
  11. Grattafiori, A.; Dubey, A.; Jauhri, A.; Pandey, A.; Kadian, A.; Al-Dahle, A.; Letman, A.; Mathur, A.; Schelten, A.; Vaughan, A.; et al. The Llama 3 Herd of Models. arXiv 2024, arXiv:2407.21783. [Google Scholar] [CrossRef]
  12. Ding, X.; Zhang, Y.; Ge, Y.; Zhao, S.; Song, L.; Yue, X.; Shan, Y. UniRepLKNet: A Universal Perception Large-Kernel ConvNet for Audio, Video, Point Cloud, Time-Series and Image Recognition. arXiv 2024, arXiv:2311.15599. [Google Scholar] [CrossRef]
  13. Ding, X.; Zhang, X.; Zhou, Y.; Han, J.; Ding, G.; Sun, J. Scaling Up Your Kernels to 31 × 31: Revisiting Large Kernel Design in CNN. arXiv 2022, arXiv:2203.06717. [Google Scholar] [CrossRef]
  14. Li, C.; Zhou, A.; Yao, A. Omni-Dimensional Dynamic Convolution. arXiv 2022, arXiv:2209.07947. [Google Scholar] [CrossRef]
  15. Woo, S.; Debnath, S.; Hu, R.; Chen, X.; Liu, Z.; Kweon, I.S.; Xie, S. ConvNeXt V2: Co-designing and Scaling ConvNets with Masked Autoencoders. arXiv 2023, arXiv:2301.00808. [Google Scholar] [CrossRef]
  16. Liu, Z.; Mao, H.; Wu, C.Y.; Feichtenhofer, C.; Darrell, T.; Xie, S. A ConvNet for the 2020s. arXiv 2022, arXiv:2201.03545. [Google Scholar] [CrossRef]
  17. Blevec, H.L.; Léonardon, M.; Tessier, H.; Arzel, M. Pipelined Architecture for a Semantic Segmentation Neural Network on FPGA. In Proceedings of the 2023 30th IEEE International Conference on Electronics, Circuits and Systems (ICECS), Istanbul, Türkiye, 4–7 December 2023; pp. 1–4. [Google Scholar] [CrossRef]
  18. Kim, D.; Jeong, S.; Kim, J.Y. Agamotto: A Performance Optimization Framework for CNN Accelerator with Row Stationary Dataflow. IEEE Trans. Circuits Syst. I Regul. Pap. 2023, 70, 2487–2496. [Google Scholar] [CrossRef]
  19. Zhu, S.; Miao, M.; Zhang, Z.; Duan, X. Research on A Chiplet-based DSA (Domain-Specific Architectures) Scalable Convolutional Acceleration Architecture. In Proceedings of the 2022 23rd International Conference on Electronic Packaging Technology (ICEPT), Dalian, China, 10–13 August 2022; pp. 1–6. [Google Scholar] [CrossRef]
  20. Fang, S.; Zeng, S.; Wang, Y. Optimizing CNN Accelerator With Improved Roofline Model. In Proceedings of the 2020 IEEE 33rd International System-on-Chip Conference (SOCC), Las Vegas, NV, USA, 8–11 September 2020; pp. 90–95. [Google Scholar] [CrossRef]
  21. Park, G.; Taing, T.; Kim, H. High-Speed FPGA-to-FPGA Interface for a Multi-Chip CNN Accelerator. In Proceedings of the 2023 20th International SoC Design Conference (ISOCC), Jeju, Republic of Korea, 25–28 October 2023; pp. 333–334. [Google Scholar] [CrossRef]
  22. Liu, Y.; Ma, Y.; Zhang, B.; Liu, L.; Wang, J.; Tang, S. Improving the Computational Efficiency and Flexibility of FPGA-based CNN Accelerator through Loop Optimization. Microelectron. J. 2024, 147, 106197. [Google Scholar] [CrossRef]
  23. Wang, Z.; Sun, J.; Goksoy, A.; Mandal, S.K.; Liu, Y.; Seo, J.S.; Chakrabarti, C.; Ogras, U.Y.; Chhabria, V.; Zhang, J.; et al. Exploiting 2.5D/3D Heterogeneous Integration for AI Computing. In Proceedings of the 2024 29th Asia and South Pacific Design Automation Conference (ASP-DAC), Incheon, Republic of Korea, 22–25 January 2024; pp. 758–764. [Google Scholar] [CrossRef]
  24. Jia, Y.; Shelhamer, E.; Donahue, J.; Karayev, S.; Long, J.; Girshick, R.; Guadarrama, S.; Darrell, T. Caffe: Convolutional Architecture for Fast Feature Embedding. arXiv 2014, arXiv:1408.5093. [Google Scholar] [CrossRef]
  25. Huang, W.; Wu, H.; Chen, Q.; Luo, C.; Zeng, S.; Li, T.; Huang, Y. FPGA-Based High-Throughput CNN Hardware Accelerator with High Computing Resource Utilization Ratio. IEEE Trans. Neural Netw. Learn. Syst. 2022, 33, 4069–4083. [Google Scholar] [CrossRef]
  26. Yu, Y.; Wu, C.; Zhao, T.; Wang, K.; He, L. OPU: An FPGA-Based Overlay Processor for Convolutional Neural Networks. IEEE Trans. Very Large Scale Integr. (VLSI) Syst. 2020, 28, 35–47. [Google Scholar] [CrossRef]
  27. Yuan, T.; Liu, W.; Han, J.; Lombardi, F. High Performance CNN Accelerators Based on Hardware and Algorithm Co-Optimization. IEEE Trans. Circuits Syst. I Regul. Pap. 2021, 68, 250–263. [Google Scholar] [CrossRef]
Figure 1. The architecture of pipeline-based CNN accelerator. The U250 has four SLRs, each with an independent DDR4 memory. We implement an accelerator core on each SLR, writing the output results to the DDR of the adjacent SLR. This ultimately forms a ring structure.
Figure 1. The architecture of pipeline-based CNN accelerator. The U250 has four SLRs, each with an independent DDR4 memory. We implement an accelerator core on each SLR, writing the output results to the DDR of the adjacent SLR. This ultimately forms a ring structure.
Electronics 14 04587 g001
Figure 2. A process unit (PU) consists of a pair of ping-pong weight caches, a group of PE arrays, an adder tree, an ALU, and pooling logic, all organized within a 3-stage pipeline.
Figure 2. A process unit (PU) consists of a pair of ping-pong weight caches, a group of PE arrays, an adder tree, an ALU, and pooling logic, all organized within a 3-stage pipeline.
Electronics 14 04587 g002
Figure 3. Pipeline initiation interval, DDR bandwidth utilization, and DSP efficiency results for VGG-16 deployment on U250.
Figure 3. Pipeline initiation interval, DDR bandwidth utilization, and DSP efficiency results for VGG-16 deployment on U250.
Electronics 14 04587 g003
Table 1. Implementation results for different CNNS on U250.
Table 1. Implementation results for different CNNS on U250.
ModelYOLOv3VGG-16ResNet-18MobileNetV2
Configuration Type p e , p u BRAM p e , p u BRAM p e , p u BRAM p e , p u BRAM
Core0 Parameters 16 , 64 2134.5 32 , 32 1990 32 , 32 1990 16 , 64 2134.5
Core1 Parameters 32 , 32 1990 32 , 32 1990 16 , 64 2134.5 32 , 64 2002.5
Core2 Parameters 32 , 32 1990 32 , 64 2002.5 32 , 32 1990 32 , 32 1990
Core3 Parameters 64 , 32 1954.5 32 , 64 2002.5 16 , 64 2134.5 32 , 64 2002.5
Resolution416 × 416224 × 224224 × 2241080 × 1080
Frequency (MHz)300250250266
Performce before
II optimization (GOPS)
3055.62965.13255.64742.9
Performce after
II Minimization (GOPS)
4164.3 (↑* 36.28 % )4860.87 ( 63.93 % )4280.4 ( 31.48 % )7528.4 ( 58.73 % )
* ↑ represents the performance improvement (%) achieved by II optimization over the unoptimized design.
Table 2. Performance comparison between U250 layer pipeline accelerator and GPU.
Table 2. Performance comparison between U250 layer pipeline accelerator and GPU.
A100U250
Performance Delay
(ms)
Power
(W)
Energy Efficiency
(FPS/W)
Delay
(ms)
Power
(W)
Energy Efficiency
(FPS/W)
Models
VGG-1610.81860.509.561.71.7
ResNet-506.71650.907.262.82.22
DarkNet-5314.41980.3513.474.61.02
DenseNet-26419.02160.2421.994.60.48
MobileNetV33.61451.923.668.34.03
Table 3. Performance comparison of layer-pipelined accelerator deployment with existing designs for VGG-16.
Table 3. Performance comparison of layer-pipelined accelerator deployment with existing designs for VGG-16.
Huang et al. [25]Yu et al. [26]Yuan et al. [27]Kim et al. [18]This Work
PlatformVX980TXC7Z100VCU118VCU118Alveo U250
Precision8/16 bits8 bits8/4 bits8 bits8 bits
Technology28 nm28 nm16 nm (3SLR)16 nm (3SLR)16 nm (4SLR)
Frequency150 MHz200 MHz150 MHz200 MHz250 MHz
BRAMs1492755204516627985
DSPs33952020409622866343
Performance
(GOPS)
1000397850.54024860.87
Throughput/DSP
(GOPS/DSP)
0.2950.1970.2080.1760.766
Power14.36 W17.7 W--61.7 W
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Wang, M.; Wu, C. Layer-Pipelined CNN Accelerator Design on 2.5D FPGAs. Electronics 2025, 14, 4587. https://doi.org/10.3390/electronics14234587

AMA Style

Wang M, Wu C. Layer-Pipelined CNN Accelerator Design on 2.5D FPGAs. Electronics. 2025; 14(23):4587. https://doi.org/10.3390/electronics14234587

Chicago/Turabian Style

Wang, Mengxuan, and Chang Wu. 2025. "Layer-Pipelined CNN Accelerator Design on 2.5D FPGAs" Electronics 14, no. 23: 4587. https://doi.org/10.3390/electronics14234587

APA Style

Wang, M., & Wu, C. (2025). Layer-Pipelined CNN Accelerator Design on 2.5D FPGAs. Electronics, 14(23), 4587. https://doi.org/10.3390/electronics14234587

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop