1. Introduction
Neural networks are increasingly supplanting numerous traditional algorithms owing to their superior performance. For instance, the YOLO series of object detection networks introduced by Redmon et al. [
1,
2,
3] has become a representative approach in real-time object detection, with the latest YOLOv10 [
4] achieving a mean average precision (mAP) of 55.2 at IoU thresholds ranging from 0.5 to 0.95 on the COCO benchmark. Similarly, the Wasserstein Generative Adversarial Network (WGAN) proposed by Miuccio et al. [
5] demonstrates a lower Symbol Error Rate (SER) under low signal-to-noise ratio (SNR) conditions, outperforming conventional cascaded modulationdemodulation and channel coding schemes. Furthermore, Transformer-based models such as DeepSeek [
6], Qwen [
7,
8], and LLaMA [
9,
10,
11] have exhibited superhuman efficiency in tasks including machine translation, code generation, and complex planning.
As applications continue to advance, neural network architectures are trending toward greater complexity and increased depth. Recent convolutional neural networks (CNNs) such as those presented in [
12,
13,
14,
15,
16] employ convolutional kernels larger than 7 × 7 to enhance image recognition capabilities and achieve higher accuracy. However, this architectural evolution entails substantial computational and memory overhead. Consequently, there is growing impetus for the development of dedicated neural network accelerators, among which FPGA-based accelerators have emerged as a research hotspot owing to their flexible reconfigurability and high energy efficiency.
Deploying neural networks on FPGAs is fundamentally constrained by the finite on-chip compute (DSP) and on-chip storage (BRAM) resources. The limited counts of both DSP slices and BRAM blocks in conventional FPGA devices impose a tight upper bound on the attainable throughput, preventing accelerators from achieving their theoretical peak performance.
To achieve higher performance, many researchers have turned to more advanced 2.5D FPGAs with substantially greater resource budgets. These 2.5D FPGAs leverage 2.5D stacked silicon interconnect (SSI) technology to integrate multiple super logic regions (SLRs) within a single package and incorporate multiple DDR memory controllers. Compared to traditional monolithic-die FPGAs, 2.5D FPGAs provide approximately 34× more DSP resources and significantly higher memory bandwidth. This architectural enhancement effectively mitigates the computational and memory bottlenecks inherent in conventional single-die FPGAs, thereby facilitating the rapid implementation and deployment of high-performance neural network accelerators.
Nevertheless, the abundance of compute and storage resources in 2.5D FPGAs also introduces new challenges for accelerator design. Although silicon interposer or advanced interconnect technologies significantly increase resource density, inter-SLR (Super Logic Region) communication latency and limited inter-SLR bandwidth have emerged as critical bottlenecks that constrain overall accelerator performance. Existing FPGA-based CNN accelerators are predominantly optimized for traditional monolithic-die FPGAs and have not been specifically tailored to exploit the architectural characteristics of 2.5D FPGAs. Consequently, when deployed on 2.5D FPGAs, these designs often suffer from suboptimal resource utilization and performance inefficiencies, leaving substantial room for improvement.
For example, Blevec et al. [
17] proposed a scalable heterogeneous pipelined architecture employing multiple layer-specific hardware-accelerated cores. Although their approach aimed to harness the multi-resource capabilities of 2.5D FPGAs, the unstructured data transfers across different dies introduced significant pipeline startup latency, ultimately degrading overall throughput. Similarly, Kim et al. [
18] implemented multiple compute cores distributed across distinct SLRs, with each core capable of executing an entire convolutional computation task. These SLRs were connected to a shared external DRAM controller via a common bus. Their VGG-16 implementation on the VCU118 development platform achieved a throughput of 402 GOPS. While mapping each core entirely within a single SLR helped mitigate timing closure issues associated with cross-SLR placement, their reliance on a shared bus for both inter-SLR data synchronization and DRAM access failed to address the system-level latency arising from weight caching across multiple SLRs.
Moreover, other studies deploying neural networks on 2.5D FPGAs such as those by [
19,
20,
21,
22,
23] have largely overlooked the clock frequency degradation induced by cross-SLR data transfer architectures. This oversight has led to severely underutilized DSP resources; for instance, Park et al. [
21] reported DSP utilization as low as 7%, highlighting a significant opportunity for architectural co-design and optimization specific to 2.5D FPGA platforms.
To address this issue, in this paper, we propose a layer-pipelined neural network accelerator design for 2.5D FPGAs. The primary contributions of this paper are as follows:
We propose a hierarchical pipeline accelerator tailored for 2.5D FPGA characteristics, implementing one acceleration core per SLR. The silicon interposer between SLRs enables cross-layer transmission of output feature maps, reducing the impact of inter-die communication latency on accelerator performance.
We design a partitioned convolution scheme, leveraging a block-based convolutional computing dataflow to enhance DDR bandwidth utilization in the accelerator. Furthermore, we establish a cross-SLR data transfer temporal model to quantitatively analyze the impact of the convolution partitioning strategy on system performance.
We propose a design space exploration algorithm for the 2.5D layer pipeline. By defining mathematical models for configuration parameters, resource consumption, and computation time, we optimize the accelerator with the objectives of minimizing overall inference time and ensuring consistent pipeline initiation intervals. This algorithm searches for the optimal configuration and block instruction parameters for each acceleration core, maximizing pipeline efficiency.
The rest of this paper is organized as follows.
Section 2 presents the proposed neural network accelerator and its design details.
Section 3 analyzes the experimental results.
Section 4 summarizes this paper.
2. The Multi-Core Accelerator Architecture for 2.5D FPGA
In this section, we first present the computing architecture of the proposed CNN accelerator. Then, we introduce a block convolution scheme for the computing dataflow. Finally, we present the design space exploration to set the optimal configuration in detail.
2.1. Computing Architecture
This section introduces the architecture of a pipeline-based CNN accelerator. As a case study, we consider a 2.5D FPGA comprising four dies to illustrate the design of a layer-pipelined convolutional neural network accelerator. Each of the four SLRs in the 2.5D FPGA is equipped with a dedicated DDR4 controller. Adjacent SLRs are interconnected via thousands of Super Logic Links (SLLs), whose latency is approximately 4 to 8 times higher than that of intra-SLR communication.
As shown in
Figure 1a, a configurable accelerator core is deployed on each SLR. For each core, illustrated in
Figure 1b, the input feature maps and weights are stored in the local DDR of the current SLR, while the output feature maps are written to the DDR of the adjacent SLR. In particular, the output feature maps generated by the accelerator core on the final SLR (SLR3) are stored in the DDR0 of SLR0. As a result, the input and output of the accelerator cores across the four dies are connected in an end-to-end manner, forming a ring structure. This ring architecture serves as the hardware foundation for implementing the layer pipeline. The rationale behind adopting this ring structure is that the output bandwidth requirement of each accelerator core is relatively low. By confining each core to its respective SLR, we can maximize resource utilization within each die while minimizing the performance impact caused by the higher latency of inter-SLR communication through SLLs.
Data in any DDR can be accessed via the PCIe bus. During CNN model deployment, convolutional layers are assigned sequentially to each die. Each accelerator core processes one convolutional layer at a time, and data for the subsequent layer is loaded only after the computation of the current layer is completed.
Each accelerator core comprises a set of processing units (PUs), an input feature map (IFM) buffer, multiple weight buffers, an output feature map (OFM) buffer, and a control module. The IFM and OFM buffers are shared among all PUs and are managed by an instruction controller. Each PU is equipped with its own dedicated weight buffer. Inside each PU, a three-stage processing pipeline is implemented, consisting of convolution, activation, and pooling units, as illustrated in
Figure 2.
In our design, the number of PEs within each accelerator core, the size of each buffer, and the number of DSPs within each PE are all adjusted based on the characteristics of the convolutional layers running on each accelerator core. This customization helps balance the diverse computational demands across layers and effectively mitigates DSP idle time issues.
2.2. Computational Dataflow for Multi-Stage Pipelines
In our design, we propose the use of block convolution to address data dependencies across dies. The core idea of this block convolution optimization is to partition the feature maps and weights of convolutional layers into independent blocks. Block convolution adopts a split-and-concatenate computing mechanism, where the convolution results of adjacent spatial blocks are independent and free of data dependencies.
We divide each convolutional layer into multiple convolution blocks following the procedure outlined in Algorithm 1, and then perform computations block by block. Specifically, the IFMs are pre-partitioned into data blocks of size
, and the weights are divided into blocks of size
, based on the layer parameters
, and
. Using these data blocks as basic computation units, the entire convolutional layer is partitioned into several independent convolution blocks. At runtime, one convolution block is computed at a time, and the next block is loaded only after the current one is completed. Since it is not feasible to load all parameters of the entire convolutional layer into on-chip buffers simultaneously, block-wise partitioning significantly reduces buffer space requirements. Moreover, this block-based execution enables each accelerator core to begin computation with only a subset of the input feature maps, thereby facilitating efficient pipelined processing.
| Algorithm 1 Partition the convolution layer into several convolution blocks |
for do for do for do for do ▷ Partition IFM data blocks and load to IFM buffer; ; ▷ Partition weight data blocks and load to weight buffer, each weight buffer loads a data block; ; ▷ We set equal to the number of PEs. All PE units share the IFM block; ; if then ▷ When the C dimension computation is complete, the OFM is stored in the adjacent SLR’s DDR; ; else ▷ Otherwise, it is stored as a partial sum in the OFM buffer; ; end if end for end for end for end for
|
We adopt a computation priority order of . This ordering is chosen because the partial results along the C dimension must be accumulated to generate a complete output along the N dimension, which enables the subsequent layer to initiate its computation. In practice, after loading a data block of the input feature map (IFM), we repeatedly load the corresponding weights required for that block. By leveraging the high memory bandwidth provided by the 2.5D FPGA, we are able to sustain continuous pipeline execution without stalling.
At the beginning of computation, the IFM buffer loads a data block of size , while the weight buffer loads the corresponding weights of size . Computation proceeds by keeping the spatial window fixed and iterating over new slices until the entire C dimension is processed. Once this is complete, a new weight block is loaded, the IFM block is reloaded, and the resulting output feature map is written from the output buffer to DDR memory. After completing computation along the N dimension, the input feature map window shiftsfirst along the W axis, then along the H axisand the same computation process is repeated. At this point, the convolution core on the adjacent SLR can begin processing the next convolutional layer. Through this iterative mechanism, a block-based layer pipeline is established across SLRs.
2.3. Design Space Exploration
Since each acceleration core processes distinct convolutional layers, the variance in computational workload among convolutional layers leads to divergent startup times across acceleration cores, thereby impacting the overall efficiency of the accelerator. To minimize the overall inference cycle of the accelerator, we develop a resource-constrained model and an inference cycle model, and propose a dedicated design space exploration algorithm to identify the optimal set of configurable parameters. Our optimization objective is to minimize the total number of computation cycles required for inference. To this end, we take a holistic approach that considers the computation latency of individual convolutional layers, the synchronization overhead incurred by inter-die data communication, and the impact of resource utilization. Accordingly, we construct a global computation cycle model for the entire accelerator, as well as a per-die resource consumption model for each accelerator core.
2.3.1. Pipeline Stage Time Modeling
Assuming each acceleration core corresponds to a pipeline stage, its computational cycle is expressed in Equation (
1), where
denotes the pe/pu numbers allocated to the
s-th core, and
represents the function mapping resource allocation to computational cycle for each layer.
Given that adjacent dies exhibit data dependency between convolutional layers, we define the data preparation time between neighboring dies as
, which corresponds to the initiation interval between consecutive pipeline stages. The block-based convolution scheme adopted in our design requires that subsequent layer computation can only commence when the current input feature map block
is no longer involved in computations. Consequently,
is quantified by dividing the total number of multiplications required for the
block by the parallel computing resources of the acceleration core. As formulated in Equation (
2),
and
denote the channel and batch parameters of the
k-th convolutional layer, respectively, while
and
represent the convolutional kernel dimensions.
The total computation cycle is jointly determined by the bottleneck stage of the pipeline and the initiation intervals between different layers, as formulated in Equation (
3), where
N denotes the total number of layers in the convolutional neural network.
2.3.2. Resource Constraint
In FPGA-based convolutional neural network acceleration designs, the resource constraints of each die must strictly adhere to the physical limitations of BRAM and DSP resources. For DSP utilization, the total consumption
in a single die is governed by the product of the parallelism in the channel (C) and batch (N) dimensions, as expressed by Equation (
4). This product determines the required number of DSP units to sustain concurrent multiply–accumulate operations, ensuring
, where
represents the die’s maximum DSP capacity.
Similarly, the BRAM resources per die must adhere to the constraint
. The BRAM consumption of each core,
, corresponds to the cumulative allocation for input feature map (IFM), output feature map (OFM), and weight buffers. The quantitative relationship for this allocation can be calculated using Equation (
5).
The individual components
,
and
are further defined in Equations (
6), (
7) and (
8), respectively.
2.3.3. Objective Optimization Function
Both resource types (BRAM and DSP) should avoid over-subscription during hardware mapping while maximizing computational throughput. Since resource allocation and computation cycles are subject to coupled constraints, we formulate the revised objective optimization function as shown in Equation (
9).
The optimized objective function promotes uniform initiation intervals across different layers, which contributes to minimizing variations in critical path delays. Under the premise of maintaining a fixed accelerator core architecture, the adjustment of initiation intervals can be achieved solely by modifying the tiling parameters for each layer.
The optimized objective function facilitates uniformity in initiation intervals across network layers, thereby minimizing discrepancies in critical path variations. Under the condition of maintaining an unchanged accelerator core architecture, this approach enables adjustment of initiation intervals solely through layer-wise tiling parameter optimization. To enhance computational tractability, we constrain and as powers of two, with , and employ a simulated annealing algorithm to obtain the global optimum of the objective function. The solution yields optimal / counts per accelerator core and specifies input feature map tiling parameters for each convolutional layer.
3. Evaluation Results
3.1. Experimental Setup
Our accelerator design is generated by the DSE algorithm, and it is implemented with Vivado 2023.2 on AMD Alveo™ U250 data center acceleration cards. All implemented models adopted 8-bit quantization with single-image testing. The U250 accelerator card communicates with the host via PCIe X16 interface, while data preprocessing and computational flow control software operate on an Intel(R) Core(TM) i7-7700K CPU@4.20GHz.
3.2. FPGA Implementation Results
Figure 3 illustrates the DSP efficiency, bandwidth utilization, and launch interval latency cycles between accelerator cores for each convolutional layer of VGG-16 deployed on the U250 platform, along with an analysis of the relationship between inter-pipeline launch intervals, DSP efficiency, and bandwidth utilization. The DSP utilization curve in the figure demonstrates that, despite the varying bandwidth and computational demands across layers of the VGG-16 model, the DSE algorithm effectively optimizes the convolutional dataflow to maintain DSP utilization above 80% for each accelerator core. Furthermore, by tailoring the dataflow schedule, the DSE algorithm confines the initiation interval deviation among different accelerator cores to within 18K clock cycles, thereby enhancing pipeline efficiency and consequently boosting DSP utilization. As the network depth increases, the frequency of weight switching rises, leading to higher DDR bandwidth utilization per accelerator core. The experimental results confirm that the proposed DSE algorithm significantly improves both DSP efficiency and DDR bandwidth utilization, underscoring its effectiveness in co-optimizing computation and memory access for deep CNNs on FPGA platforms.
Table 1 presents the key accelerator configuration parameters generated by DSE when deploying CNN architectures, including YOLOv3, VGG-16, ResNet-18, and MobileNetV2. It demonstrates that based on the layer-wise characteristics of different CNN models, DSE adaptively adjusts the
allocation ratio and BRAM utilization for convolutional layers processed on each accelerator core to achieve efficient inference. Experimental results demonstrate that the DSE algorithm is capable of automatically generating an optimal pipelined architecture tailored to diverse CNN models. In the baseline configurationusing default tiling parametersthe performance corresponds to the entry labeled Performance before Initiation Interval (II) Optimization in
Table 1. Building upon this initial architecture, the DSE algorithm dynamically determines layer-specific tiling parameters to maximize the inference efficiency of each layer on its assigned accelerator core, yielding the optimized results reported as Performance after II Minimization in the same table.
The experiments show that, after computing the optimal architectural configuration, the DSE algorithm further refines the pipeline initiation intervals by adjusting tile sizes across layers. This optimization leads to a 31–64% improvement in overall accelerator throughput. Specifically, on the Xilinx U250 FPGA platform, the optimized implementation achieves a peak throughput of 4860.87 GOPS for VGG-16 at a clock frequency of 250 MHz, and 7528.4 GOPS for MobileNetV2 at 266 MHz. These results underscore the effectiveness of the DSE-driven co-optimization of tiling strategies and pipeline scheduling in unlocking high computational efficiency on FPGA-based accelerators.
3.3. Compare with GPU
To evaluate the computational performance gap between the 2.5D FPGA-based accelerator and GPU, this work deploys various CNNs on both NVIDIA A100 GPU and a Xilinx U250 FPGA platforms, and measures inference latency and power consumption under identical experimental conditions. The results are analyzed to assess platform-specific trade-offs in efficiency and speed.
Table 2 presents the performance comparison between FPGA accelerators and GPUs for common CNN architectures. The GPU platform utilizes an NVIDIA A100 GPU with 40 GB VRAM, while the FPGA platform employs an Xilinx Alveo U250 accelerator card. Both platforms connect to a host server via PCIe X16 interfaces, with the server configured with an Intel E5-2620 CPU (2.10 GHz base frequency) and 128 GB system memory. All benchmark networks are implemented using the Caffe deep learning framework [
24]. For each model, we measure the end-to-end inference latency and average power consumption when processing a single input image. To ensure a fair comparison, both the GPU and FPGA implementations use INT8 quantization, preserving consistent numerical precision. The evaluation focuses on key metrics including per-image inference latency and average power draw, enabling a direct assessment of computational throughput and energy efficiency across the two hardware platforms.
Experimental results reveal significant disparities in neural network inference performance and energy efficiency between the proposed 2.5D FPGA multi-stage pipelined architecture and the NVIDIA A100 GPU under single-image inference scenarios. Regarding inference latency, the FPGA accelerator demonstrates superior performance for lightweight networks (e.g., MobileNetV3) and shallow architectures (e.g., VGG-16), achieving a 14% latency reduction compared to the GPU in VGG-16 deployment. This advantage stems from two key factors: (1) the multi-stage pipelined architecture leverages 2.5D FPGA’s abundant computational resources and parallel processing capabilities, (2) the DSE algorithm proves particularly effective in optimizing pipeline initiation intervals for shallow network structures. In energy efficiency metrics (FPS/W), the FPGA solution consistently outperforms the A100 across all tested networks, reaching 2× higher efficiency in DarkNet-53 deployment. This highlights the significant benefits of a pipelined architecture combined with an efficient dataflow strategy in substantially reducing redundant data accesses.
Experimental results also reveal certain exceptions. Specifically, for the extremely deep network DenseNet-264, the latency on FPGA is slightly higher than that on GPU, indicating that the DSE methodology has limited capability in optimizing the pipeline initiation interval for very deep neural networks. This phenomenon arises because the architectural characteristics of individual layers vary significantly across ultra-deep networks. In particular, the channel dimension (C) of later convolutional layers can be more than 60 times larger than that of earlier layers. Consequently, the number of channel-wise data tiles for feature maps in deeper layers typically increases by a factor of 48, and the bandwidth demand for weight switching rises by 24x. Under such highly irregular layer-wise resource and bandwidth requirements, it becomes increasingly difficult for the accelerator core to balance computation and data transfer times through tile size tuning alone, thereby degrading its ability to maintain a consistent and optimal pipeline initiation interval.
The findings demonstrate that layer-pipelined accelerators combined with conventional DSE algorithms achieve higher energy efficiency and lower system latency when mapping shallow or lightweight neural networks. However, the current multi-pipeline DSE optimization strategy exhibits limited effectiveness in optimizing the initiation interval for deep neural networks, highlighting a need for more adaptive scheduling and resource allocation mechanisms tailored to heterogeneous layer structures.
3.4. Comparison with the Existing Pipeline Accelerator
To validate the performance differences between the DSE algorithm’s automated neural network deployment and existing design methodologies, this paper selects accelerator designs employing layer-pipelined architectures in recent years as comparative benchmarks. As shown in
Table 3, the performance comparison reveals that Huang et al. [
25] implemented VGG-16 on the VX980T development board using matrix multiplications and implemented a row-level pipeline, achieving 1000 GOPS throughput. Yu et al. [
26] proposed a software-defined computational instruction set on fixed accelerator architectures to deploy various CNN networks, optimizing DSP efficiency and bandwidth utilization, attaining 0.197 GOPS/DSP throughput for VGG-16 on XC7Z100. Meanwhile, Yuan et al. [
27] and Kim et al. [
18] map each convolutional layer to an individual processing element and assemble the complete network on a 2.5D FPGA, but both solutions neglected cross-die communication latency in 2.5D FPGAs, resulting in less than 30% utilization of DSP resources.
The experimental results demonstrate that our proposed 4-core pipelined architecture on the Alveo U250 platform exhibits significant performance advantages for VGG-16 deployment. With 8-bit quantization precision, the architecture achieves 4860.87 GOPS at 250 MHz, outperforming Kim et al.’s layer-pipelined solution on 3SLR FPGAs by 5.8×. By employing cross-die deep pipelining to eliminate inter-die synchronization communication and reduce long-path dependencies, our design achieves a 0.766 GOPS/DSP performance metric through DSE algorithm-optimized dataflow and other technical optimizations, significantly enhancing operational frequency compared to Yuan et al.’s mapping approach.
These findings confirm that our layer-pipelined accelerator architecture and DSE algorithm effectively mitigate cross-die data supply bottlenecks in multi-die FPGAs, validating the necessity of spatiotemporal joint optimization in high-throughput CNN accelerator design. The proposed 2.5D FPGA-oriented neural network accelerator fully leverages the advantages of high bandwidth and abundant resources in 2.5D FPGAs, expanding their application scenarios and providing valuable design insights for accelerating other computation-intensive algorithms.