AHA: Design and Evaluation of Compute-Intensive Hardware Accelerators for AMD-Xilinx Zynq SoCs Using HLS IP Flow

Berrazueta-Mena, David; Navas, Byron

doi:10.3390/computers14050189

Open AccessArticle

AHA: Design and Evaluation of Compute-Intensive Hardware Accelerators for AMD-Xilinx Zynq SoCs Using HLS IP Flow

by

David Berrazueta-Mena

¹

and

Byron Navas

^1,2,3,*

¹

Departamento de Eléctrica, Electrónica y Telecomunicaciones, Universidad de las Fuerzas Armadas ESPE, Sangolqui 171103, Ecuador

²

Grupo de Investigación en Sistemas Inteligentes (WiCOM-Energy), Universidad de las Fuerzas Armadas ESPE, Sangolqui 171103, Ecuador

³

Grupo de Investigación EmbSys, Universidad de las Fuerzas Armadas ESPE, Sangolqui 171103, Ecuador

^*

Author to whom correspondence should be addressed.

Computers 2025, 14(5), 189; https://doi.org/10.3390/computers14050189

Submission received: 4 March 2025 / Revised: 29 March 2025 / Accepted: 1 April 2025 / Published: 13 May 2025

Download

Browse Figures

Versions Notes

Abstract

The increasing complexity of algorithms in embedded applications has amplified the demand for high-performance computing. Heterogeneous embedded systems, particularly FPGA-based systems-on-chip (SoCs), enhance execution speed by integrating hardware accelerator intellectual property (IP) cores. However, traditional low-level IP-core design presents significant challenges. High-level synthesis (HLS) offers a promising alternative, enabling efficient FPGA development through high-level programming languages. Yet, effective methodologies for designing and evaluating heterogeneous FPGA-based SoCs remain crucial. This study surveys HLS tools and design concepts and presents the development of the AHA IP cores, a set of five benchmarking accelerators for rapid Zynq-based SoC evaluation. These accelerators target compute-intensive tasks, including matrix multiplication, Fast Fourier Transform (FFT), Advanced Encryption Standard (AES), Back-Propagation Neural Network (BPNN), and Artificial Neural Network (ANN). We establish a streamlined design flow using AMD-Xilinx tools for rapid prototyping and testing FPGA-based heterogeneous platforms. We outline criteria for selecting algorithms to improve speed and resource efficiency in HLS design. Our performance evaluation across various configurations highlights performance–resource trade-offs and demonstrates that ANN and BPNN achieve significant parallelism, while AES optimization increases resource utilization the most. Matrix multiplication shows strong optimization potential, whereas FFT is constrained by data dependencies.

Keywords:

accelerator architectures; benchmark; design automation; embedded systems; FPGA-based accelerator; heterogeneous systems; high-level synthesis; high performance computing; system-on-chip; Zynq-based SoCs

1. Introduction

Heterogeneous computing systems can enhance overall computing performance and energy efficiency by leveraging the specific architectural features of various processing cores, such as central processing units (CPUs), graphics processing units (GPUs), and field-programmable gate arrays (FPGAs) [1]. Consequently, software tasks are allocated to hardware cores where execution time is minimized due to the particular characteristics of the core architecture. Over the past decade, the emergence of heterogeneous systems embedded in a system-on-chip (SoC) has accumulated attention in academia and industry. This interest arises from the feasibility of utilizing a small-factor chip while simultaneously meeting the growing demand for compute-intensive applications, which exhibit limited performance when executed on traditional architectures (e.g., symmetric multicore CPUs). For instance, heterogeneous embedded systems are employed in the implementation of applications such as autonomous machines, artificial intelligence, and the Internet of Things [2].

Arguably, FPGAs may be considered the preferred type of heterogeneous processing cores. In a survey of 201 papers on hardware implementation, for example, FPGA, GPU, and application-specific integrated circuits (ASICs), of AI and ML algorithms from 2009 to 2019, Talib et al. [3] demonstrated that the majority of papers (66.7%) utilized FPGA-based systems for complex AI algorithms, owing to their relatively accessible prototyping environment. FPGAs occupy an intermediate position among CPUs, GPUs, and ASICs in terms of their flexibility, reconfigurability, and efficiency [4]. This characteristic renders these accelerators valuable for the design and evaluation of novel accelerators for AI applications.

Hardware acceleration in virtualized radio access networks (vRANs) enhances performance by offloading compute-intensive tasks, such as Layer 1 forward error correction (FEC), enabling flexibility for evolving 4G/5G workloads [5,6]. The transition towards vRAN and Open-RAN (O-RAN) architectures requires integrating FPGAs and other accelerators like ASICs and GPUs [7]. High-level synthesis (HLS) allows software developers to implement complex algorithms without deep hardware expertise, enabling faster deployment and scaling of telecom networks.

Continuous advancements in integration scales within the semiconductor industry have facilitated the incorporation of more processing cores of diverse types (e.g., CPUs, FPGA hardware accelerator cores, and GPUs) within a single FPGA chip. Consequently, the development of FPGA-based multiprocessor systems-on-chip (MPSoCs) has become feasible. However, design complexity in such systems has also increased, particularly when traditional methodologies and hardware description languages (HDLs) remain the primary approach. For instance, it is necessary to incorporate interconnection ports and protocol logic into hardware accelerator cores that communicate with host processor cores through an advanced interconnection fabric (e.g., AMBA AXI). While this process is essential, it is also intricate and time-intensive. Consequently, identifying an automated method to instantiate this additional supporting logic remains an ongoing research challenge.

This increasing complexity has rendered traditional design approaches less feasible. Consequently, academia and industry have dedicated efforts to investigating and developing new design methodologies and tools to elevate the design abstraction level, particularly through high-level languages (HLLs) for hardware design [8,9]. HLS represents an approach in electronic design automation (EDA) that facilitates efficient modeling and design of systems by describing behavior using HLLs (e.g., C, C++, SystemC, OpenCL) [10,11]. Furthermore, HLS can enhance the automation of verification and optimization processes in designs utilizing register transfer level (RTL) descriptions, reducing development time [12].

HLS tools facilitate the automatic or semi-automatic generation of register transfer level (RTL) architectures from HLLs [8]. The resultant RTL design comprises a datapath, which incorporates functional units, such as multipliers, adders, multiplexers, and registers for data processing, alongside a finite state machine (FSM) that governs execution flow and synchronizes operations [13]. The HLS process consists of several stages. Initially, compilation and modeling transform the high-level input description into an intermediate representation, typically a control and data flow graph (CDFG), where nodes denote operations and edges delineate data dependencies [8]. Scheduling then assigns each operation within the CDFG to a specific clock cycle [14], identifying opportunities for parallelism while adhering to data dependencies [15]. Allocation involves selecting the necessary hardware resources, such as registers, arithmetic units, and memory while considering trade-offs in performance, area, and power [8]. Binding maps schedule operations to specific hardware components, facilitating resource sharing through multiplexing techniques [13]. These processes are crucial for converting software-defined algorithms into efficient hardware implementations. Nonetheless, performance optimization remains challenging, as HLS-generated designs require careful application of pragmas and directives to enhance pipelining, loop unrolling, and memory partitioning, thereby improving parallelism and resource utilization [16].

Prior to the late 1960s, integrated circuits (ICs) were designed manually. Subsequently, gate-level simulations emerged, and techniques like place-and-route, formal verification, and static timing analysis were introduced in the 1980s. Hardware description languages (HDLs), such as Verilog and very high-speed integrated circuit HDL (VHDL), facilitated the adoption of simulation tools and logic synthesis. The 1990s saw commercial HLS tools and research in hardware–software codesign. The concept of intellectual property (IP) core and platform-based design also emerged. In the 2000s, there was a shift towards electronic system-level (ESL) design, facilitating exploration, synthesis, and verification of complex system-on-chips (SoCs). This involved using languages with system-level abstractions like SystemC, SpecC, or SystemVerilog, and introducing transaction-level modeling (TLM) [8].

In heterogeneous SoCs based on FPGAs, compute-intensive kernels are synthesized as accelerators. Prior to HLS, hardware developers manually converted compute-intensive HLL functions into custom accelerators using HDL (e.g., VHDL or Verilog). Academic researchers have described hardware accelerators at more abstract levels using HLLs and developing tools to automate HLL to RTL conversion. Currently, compute-intensive code sections are written as functions using a subset of HLL for FPGA implementation. C and C-based languages are prevalent due to widespread adoption.

Recent research focuses on expressing applications in HLLs suitable for various platforms, including domain-specific languages (DSLs) and data-flow representations [17]. DSLs facilitate development by enabling the identification and exploitation of customizations for specific application domains [18]. In the DSL approach, the designer rewrites compute-intensive HLL code using a DSL embedded in an HLL (e.g., C++, Python, Scala, Lua) that generates HLS code (e.g., C++, CUDA, OpenCL). A DSL compiler or synthesizable code generator then translates it to HLS Code for RTL description generation [17]. Notable DSLs for image processing include HeteroHalide [19], GENESIS [20], and Rigel [21]. ML applications include OptiML [22] and HeteroCL [23].

The ultimate goal of an HLS, which searches the design space automatically without the designer’s guidance and produces optimal solutions for various design constraints and technologies [8] remains under investigation. No existing tools can fully automate the deployment of general C/C++ code in a multicore heterogeneous system of FPGAs and CPUs. Generation of synthesizable HLS code from HLL code using pointers to pointers or functions, recursive functions, or dynamic memory allocations exemplifies capability gaps [17].

However, HLS tools (HLS tools and HLS frameworks are often used as alternatives, but a framework is generally a structured software solution that integrates several components, such as compilers, libraries, and HLS tools, into its design flow.), methodologies, languages, and models continue to undergo development. Some HLS are open-source and may be considered academic tools [18,24], such as LegUp, DWARV, BAMBU, Dynamatic [25], ScaleHLS [26], and MLIR [27]. Other solutions are proprietary and expensive but provide more robust design flows and tools (e.g., Stratus HLS) [28]. EDA tools employing HLS with automated flows are limited to a restricted set of libraries and third-party FPGA boards (e.g., MathWorks HDL Coder and National Instruments LabVIEW FPGA) [29].

AMD-Xilinx Vivado HLS is a promising commercial solution that facilitates design, C validation, RTL verification, exploration, and optimization of hardware modules in an FPGA; however, it still requires manual interaction during initial prototype development. It is necessary to explore its design flow and devise methods to reuse and automate parts of this flow to expedite the development process for creating hardware accelerator IP-cores, enabling a wider audience to target FPGA-based heterogeneous platforms.

In this context, this study presents an improved methodology to efficiently develop high-performance algorithm hardware accelerator (AHA) cores and heterogeneous SoCs using Zynq devices and Vivado HLS. Consequently, time can be invested in exploring advanced architectural features of heterogeneous SoCs. The main contributions of this study are as follows:

A suite of five AHA IP-cores for rapid evaluation and benchmarking of Zynq-based heterogeneous SoCs. These IP-cores accelerate a group of customized, compute-intensive algorithms: (i) matrix multiplication, (ii) fast Fourier transform (FFT), (iii) advanced encryption standard (AES), (iv) back-propagation neural network (BPNN), and (v) artificial neural network (ANN).
Performance evaluation and optimization of hardware and software implementation for these algorithms using various configurations of IP-cores in Zynq FPGA-based SoCs.
Development of custom Tool Command Language (Tcl) scripts for the Vivado Design Suite to automate the generation of hardware accelerator IP-cores and Zynq FPGA-based SoCs, reducing the non-recurring engineering (NRE) cost for less experienced designers.
Documented and streamlined design flow for developing heterogeneous SoCs and hardware accelerator IP-cores in Zynq FPGAs using Vivado HLS.
A concise survey of the state-of-the-art in HLS, summarizing current prominent tools and principal HLS concepts.

The remainder of this paper is organized as follows. Related work is discussed in Section 2. Section 3 describes the proposed methodology. Our semi-automated design flow for creating SoCs and hardware accelerator IP cores in Zynq FPGAs is presented in Section 4. A suite of five benchmarking AHA IP cores is introduced in Section 5. The experimental setup, encompassing the configuration and evaluation parameters of the implemented SoC architectures, is presented in Section 6. The performance exploration and results are explicated in Section 7. The discussion and design recommendations are presented in Section 8. Finally, the conclusions and future work are summarized in Section 9.

2. HLS Tools and Related Work

HLS tools may mitigate the problem of generating RTL hardware descriptions. However, the automatic generation of these descriptions from HLLs remains an arduous process. HLS tools have to deal with generating custom and optimized hardware architectures as fully timed implementations for each behavioral description [8], in contrast with compilers that perform the less difficult task of converting HLLs into assembly language for specific computer architectures. Therefore, several methodologies have been developed, and new solutions have been proposed [24,30]. In Section 2.1, we analyze the relevant current tools. In Section 2.2, we discuss studies that evaluate existing HLS tools or developed methods to generate FPGA-based SoCs.

2.1. Relevant High-Level Synthesis Tools

An overview of representative HLS tools is presented in Table 1, highlighting their main features and release years. Seminal works on HLS produced the initial tools in the 1990s. However, these tools failed to achieve widespread adoption due to several limitations, including the use of behavioral HDLs as design entries, poor quality of results (QoRs) (e.g., excessive hardware resource utilization), and complexity in the functional verification of hardware implementations [31]. Nevertheless, during the 2000s, the increasing interest and necessity for developing tools for efficient hardware design facilitated the rapid growth of the HLS market, focusing on ASICs and FPGAs (e.g., AutoESL AutoPilot, Bluespec, Cadence C-to-Silicon, Catapult C Synthesis, Celoxica Agility, Synopsys Synplicity Synplify DSP, and Xilinx AccelDSP) [31].

Currently, HLS is gaining acceptance due to three primary factors. (i) Most tools utilize HLLs or extensions as inputs. These high-level languages are readily accessible to most engineers and researchers, enabling them to create FPGA designs without being limited to HDLs. (ii) The quality of the final implementation is high, and it is possible to optimize the generated RTL description (e.g., performance, resource utilization, and energy consumption) through manipulation of the synthesis process. (iii) The design and verification times have been significantly reduced, which is a crucial factor in addressing challenges such as time-to-market [31]. At present, numerous tools can produce high-quality implementations, and their performance results vary according to their internal optimizations [24]. Some of the prominent commercial tools include Bluespec, Catapult HLS, FPGA SDK for OpenCL, HDL Coder, Stratus HLS, Synphony HLS, and Vivado HLS. Conversely, notable academic tools include Kiwi, LegUp (Acquired by Microchip Technology in 2020), PandA-Bambu, and ROCCC (Table 1).

AMD-Xilinx and Intel (Altera) are currently the two primary vendors of FPGA technology and employ two distinct HLS flow approaches. Intel OpenCL aims to develop accelerators by utilizing software and abstracting hardware details for high-level viewing. Xilinx’s Vivado HLS offers a more efficient method for developing hardware using a familiar software language. Xilinx employs OpenCL for the FPGA–host interface, implementing the compiler as a minimal wrapper around the C++ compiler, while Intel adopts the OpenCL paradigm as its frontend. Vivado HLS enhances control by coupling the HLS source code with hardware, necessitating additional annotations and boilerplate code. Intel OpenCL provides abstracted views, reduces boilerplate code, and implements efficient substitutions by identifying common patterns [32].

In recent years, the advancement of HLS technology has not only been characterized by the development of conventional tools or compilers but has also diversified into several key components to enhance productivity, such as models of computation, architectural templates, HLL, DSL, methodologies, and comprehensive frameworks. Novel frameworks and techniques have been frequently utilized to accelerate design space exploration (DSE), improve optimization, and enhance formal verification, metrics, and power estimation. The design flow of most frameworks incorporates off-the-shelf and academic HLS tools. A comprehensive survey of DSE techniques is available in [33]. A survey of models, methodologies, and frameworks for metric estimation, FPGA-based DSE, and power consumption estimation of FPGA/SoC was presented in [34]. A recent survey of HDLs, HLS tools, and DSLs is presented in [30].

Next, we analyze some of these frameworks, organized according to their primary focus. (i) Methods for efficient HLS design space exploration: The application of various tool-specific synthesis directives results in an extensive design space for unique microarchitectures. The challenge of identifying an optimal design that provides an appropriate balance among performance, power, and resource utilization is complex. Consequently, several researchers have proposed novel solutions, including a microarchitecture as an accelerator design template [35] and predictive models and methods to determine optimal microarchitectures [36,37,38]. (ii) Further optimization techniques: In numerous instances, initial HLS implementations require substantial optimization before they can effectively utilize the capabilities of hardware acceleration. Typically, the application of synthesis directives is based on a heuristic approach. The researchers established a set of guidelines and principles for optimizing HLS designs that can be implemented manually [32,39] or by a specialized compiler based on a library of abstractions [40]. (iii) Formal verification methods: Formal verification of the C/C++ code against RTL implementation is essential to complement HLS tools. For instance, Vivado HLS simulation-based verification can only identify a finite number of errors. Furthermore, utilizing frameworks that introduce additional hardware transformations and are not closely integrated may result in the synthesis of RTL designs that are susceptible to errors. Methodologies for formal verification employing equivalence and model checking have been proposed in [41,42]. (iv) Hardware security methods: Protection of IP cores generated by HLS is a significant concern in the semiconductor industry. Recent studies have proposed watermarking [43] and birthmarking [44] techniques to safeguard against hardware theft, utilizing either synthesis directives or code transformations that facilitate the detection of violated IP copies. (v) DSE Frameworks for FPGA SoCs: Various frameworks for FPGA SoCs are based on the study of HLS directives because they can exponentially affect design space exploration. The primary objective is to identify a set of Pareto-optimal hardware solutions that account for the finite resources (e.g., LUT, BRAM, DSP, and FF) available in the reconfigurable fabric of the FPGA [45]. Research in DSE has undergone continuous evolution and now incorporates novel approaches, including the utilization of ML. Examples of these DSE frameworks include Prospector [46], MPSeeker, COMBA [47], IronMan-Pro [48], Sherlok [49], FlexWalker [50], VeriPy [51], HLSFactory [52], and CollectiveHLS [53].

In this section, we present an overview of the most relevant classical and contemporary HLS tools and novel approaches along with their salient characteristics. A comprehensive evaluation and survey of these tools is not within the scope of our work; however, the reader may find additional information in [17]. Although HLS tools are currently widely adopted by experienced designers and researchers, the complexity of the automatic generation of heterogeneous multi-core systems remains substantial and presents significant challenges. Consequently, the enhancement of HLS tools or methodologies that optimize or facilitate the utilization of existing tools to simplify the design, verification, and integration of FPGA-based accelerator IP cores and SoCs is a current research topic [15].

2.2. Studies That Evaluate or Develop HLS Tools

This section summarizes relevant studies that either evaluate the performance, usability, and effectiveness of current HLS tools or introduce new HLS design frameworks. In the article “A Survey and Evaluation of FPGA High-Level Synthesis Tools”, Nane et al. [24] presented a comprehensive review of several HLS tools and evaluated the performance and resource utilization of one commercial tool and three academic tools using a set of benchmarks in C language. In the article “An overview of today’s high-level synthesis tools”, Meeus et al. [10] conducted an evaluation of a wide selection of HLS tools in terms of their usability and quality of results. Numan et al. [17] conducted an extensive study on the automaticity level of several HLS tools. In “Are We There Yet? A Study on the State of High-Level Synthesis”, Lahti et al. [54] reviewed the scientific literature that compared QoRs between HLS and RTL design flows. In “High-Level Synthesis for FPGAs: From Prototyping to Deployment”, Cong et al. [9] utilized the AutoESL AutoPilot HLS tool in conjunction with Xilinx development platforms to demonstrate the effectiveness of the C to RTL high-level synthesis. In “Automating the Design of Processor/Accelerator Embedded Systems with LegUp High-Level Synthesis”, Fort et al. [55] presented an open-source HLS tool. This tool accepts a C language description as input and automatically generates a hybrid architecture that combines a hard-core processor and custom accelerators in an FPGA, which executes the most compute-intensive parts. In the article “High-Level Synthesis in the Delft Workbench Hardware/Software Co-design Tool-Chain”, Nane et al. [56] introduced the Delft Workbench toolchain, which accepts a behavioral description in C language as input and semi-automatically implements a heterogeneous architecture.

In the studies reviewed above, significant results were obtained; however, we identified the following limitations in comparison to our work: (i) In [9,10,17,24,54], the authors evaluated HLS tools in terms of synthesis performance, hardware resource utilization, and supported features. Nevertheless, they did not assess the performance of the hardware and software implementations with respect to execution time and speedup. Furthermore, the actual performance evaluation of compute-intensive algorithms accelerated in the hardware of FPGA-based heterogeneous SoCs was not conducted, which could have provided valuable insights. In contrast, our study conducted a comprehensive performance evaluation of the software and hardware implementations of compute-intensive algorithms, both independently and integrated into various configurations of Zynq-based SoCs. (ii) In [55,56], the authors utilized academic HLS tools and analyzed the results of function accelerators (IP cores) implemented in FPGA-based systems in conjunction with a hard-core processor. However, open-source solutions are typically less reliable and lack technical support. Therefore, in our research, we explore a commercial off-the-shelf (COTS) HLS tool, devise a method to optimize its design flow, and enhance its reusability. This approach can be particularly beneficial for less experienced designers and researchers.

2.3. AMD-Xilinx HLS Tools

2.3.1. Vivado HLS

Vivado HLS is a design environment for generating RTL descriptions in VHDL and Verilog from synthesizable C, C++, SystemC, or OpenCL HLL codes [17]. It employs a modified intermediate representation (IR) for synthesis and interconnect-centric optimization, taking into account user-specified constraints. Vivado HLS incorporates annotations for optimizing performance, resource utilization, and data communication between the CPU and custom hardware. Additionally, it provides hardware-optimized libraries and APIs for hardware developers. The IR is synthesized into RTL implementations for Xilinx FPGAs, and the generated RTL can be stored in an IP library for subsequent use [17].

The heterogeneous system development tools SDSoC [57] and SDAccel [58], which are currently deprecated and superseded by the Vitis Unified Software Platform [59], were utilized in conjunction with the Vivado HLS tool. The Vivado HLS tool has been transformed into Vitis HLS and encompasses several new features [17].

2.3.2. Vitis Unified Software Platform (USP)

The Vitis software development platform facilitates the development of accelerated applications on heterogeneous hardware platforms, including AMD Versal adaptive SoCs. This platform provides a unified programming model for the accelerated host, embedded, and hybrid (host + embedded) applications [59].

The central components of the Vitis USP are as follows. The Vitis AI development environment accelerates the AI inference on AMD embedded platforms, Alveo accelerator cards, and FPGA instances, utilizing deep learning frameworks. The Xilinx Runtime Library (XRT) facilitates communication between the application code and accelerators installed on PCIe interface-based AMD accelerator cards, MPSoC-based embedded platforms, or adaptive SoCs. The Vitis target platform delineates the hardware and software architecture, as well as the application environment. Vitis Model Composer is an AMD toolbox for MATLAB and Simulink. The Vitis HLS tool, integrated with the Vivado Design Suite and the Vitis USP, enables complex FPGA algorithm development by synthesizing C/C++ functions into RTL [59].

2.3.3. Vitis HLS and Development Flows

Vitis HLS is the successor to the Vivado HLS technology and features several enhancements over its predecessor. It has a new compiler that employs an updated version of the LLVM compiler standard and compiles/simulates in C/C++ 11/14. When migrating a kernel module or IP, it is critical to understand the variations between HLS versions, which include behavioral differences, deprecated commands, and unsupported features [60]. The Vitis HLS tool facilitates the conversion of C/C++ functions into RTL code for FPGA devices, Zynq MPSoCs, or Versal Adaptive SoCs [60]. It is integrated with the Vivado Design Suite and Vitis unified software platform, enabling users to apply directives to the C code to generate specific RTLs for desired implementations. The design can be validated utilizing C simulation, which enables more rapid iterations than RTL-based simulations. The tool incorporates analysis and debugging tools to facilitate design optimization [61].

The Vitis HLS tool synthesizes C/C++ code by utilizing top-level function arguments as RTL I/O ports and functionalities as RTL blocks. This allows arrays to be directed to any memory resource, and the function loops can be rolled or pipelined. Performance metrics can be reviewed through reports, and outcomes can be customized using pragmas and optimization directives [61].

Vitis HLS offers two development flows for producing different output products [60]: Vivado IPs for hardware designs using Vivado IDE or Vivado IP Integrator in the Vivado IP Flow, and Vitis kernels (a compiled object file. xo) for software applications in the Vitis Application Acceleration Development Flow. The application acceleration development flow requires a Unix/Linux operating system and is not supported by the Windows OS [60].

2.3.4. Vitis Libraries

Vitis Libraries is a collection of open-source, performance-optimized libraries that provide out-of-the-box acceleration for existing applications written in C, C++, or Python with minimal to zero-code modifications [59,62]. Invoking a Vitis accelerated-library API or Kernel can be utilized to enhance the performance of specific portions of existing x86 host application code or to develop accelerators for deployment on Xilinx-embedded platforms [62,63].

Common Vitis libraries (e.g., Math, Statistics, Linear Algebra, Utilities, and DSP) provide core functionality for diverse applications, whereas Domain-specific Vitis libraries offer acceleration for various workloads (e.g., Vision and Image Processing, Quantitative Finance, HPC, Database, and Data Analytics, Data Compression) [62,63].

The Vitis library comprises three levels of functions: L1 Primitives (HLS functions), L2 Kernels (performance-optimized kernels with interfaces and compiler directives), and L3 Software APIs (high-level software) [62].

2.3.5. Vivado HLS Examples

Vitis HLS provides introductory small-code examples to demonstrate effective design practices, coding guidelines, design patterns, and optimization techniques to maximize application performance [60].

These synthesizable examples encompass C/C++ code sources for the top function and test bench, README, and Tcl files. They are categorized into groups such as Dataflow, Pipelining, Interface, Modeling, and Misc. Execution of the examples requires a Tcl file, run_hls.tcl, which establishes the project and specifies the steps of the flow to be executed. By modifying the value of hls_exec, it is feasible to conduct a C-RTL co-simulation and Vivado implementation [62]. In comparison to our work, the Vivado HLS examples do not address SoC creation or integration with IPs, and certain examples are not compute intensive.

2.4. Benchmarking IP Cores for Zynq-Based SoCs

This section summarizes related studies on hardware accelerators for user evaluation and HLS benchmark suites. In [64], the authors proposed a microarchitecture template for integer multiplication accelerators to explore performance and resource utilization without considering HLS implementation and synthesis details. In [65,66,67,68], the authors proposed extensive benchmark suites of C programs for rapid performance prediction of HLS implementations. We identify the following gaps: (i) Ref. [64] focuses on one standalone accelerator; investigation of several IP-core accelerators for common embedded applications and exploration of IP-cores integrated into Zynq SoCs is necessary. (ii) Refs. [65,68] provided platform-independent C programs suitable for hardware acceleration across CPUs, GPUs, and FPGAs, but lack performance comparison between embedded platforms. (iii) Refs. [66,67] provide synthesis results for standalone IP cores and evaluate FPGA acceleration against a high-end general-purpose processor. However, no benchmark suite assesses the performance of exemplary algorithms on embedded processors, HLS standalone IP-cores, and FPGA-SoCs. The reproducibility of these designs would enable users to replicate accelerator and SoC templates and investigate system-level enhancements. In [69], Zhou et al. presented Rosetta, an FPGA benchmark suite with realistic performance constraints and advanced HLS optimizations using SDSoC and SDAccel environments. They reported results on an embedded FPGA Xilinx ZC706 device and an AWS F1 cloud FPGA platform with a Xilinx VU9P FPGA connected to a host CPU. This study focuses on the software view of kernel accelerators on cloud platforms without providing architectural details of Zynq SoC and IP-core accelerators. In contrast, we focus on hardware-embedded SoC designs at the system architecture level using Vivado HLS or Vitis HLS IP development flow to address different design questions.

In conclusion, in this study, we investigated and defined a simplified design flow to create Zynq SoCs with high-performance AHA IP cores using a COTS tool such as Vivado HLS. In addition, we present a suite of five benchmarking AHA IP cores for rapid performance evaluation of Zynq-based SoCs, which accelerates a selected group of compute-intensive algorithms. Additionally, we semi-automate the Vivado HLS design flow using a customized Tcl script to simplify the generation of IP cores (including benchmarking IP cores) and SoCs. The script’s options allow the user to select parameters, such as the type of FPGA, benchmarking IP core, or a new IP core. This approach can offer designers a method for the fast prototyping and testing of a baseline platform, which can be further improved by exploring the advanced features of heterogeneous SoCs. Finally, we present the exploration and performance evaluation of these benchmarks implemented as standalone accelerator IP cores, embedded software in a hard-core processor, and accelerator IP cores integrated into heterogeneous SoCs.

3. AHA Design Methodology

Figure 1 illustrates both the manual and semi-automated (Section 4) design flows for developing Zynq-based SoCs and algorithm hardware accelerator (AHA) IP cores, which are derived from the Vivado HLS design flow. Based on our experience, we conducted an analysis of the original design flow and presented a simplified representation (manual design flow). Notably, we devised a semi-automated and reusable design flow (semi-automated design flow) through the enhancement of Vivado HLS Tcl scripts.

Tcl commands are generally used in EDA to speed up the design or verification process without using a GUI. However, the learning curve for non-hardware designers may be considerable. In Vivado HLS, when any process is performed in the GUI, the Tcl window prints the corresponding Tcl command. Therefore, it may be relatively easy to reproduce these commands. However, for this purpose, the designer must be a Tcl expert or conduct the entire design flow manually at least once to obtain appropriate Tcl commands. When creating AHA IP cores, the design, verification, optimization, and verification times can be longer. Therefore, our approach allows the designer to generate any of the five compute-intensive benchmarking IP cores or a new IP core and a primary Zynq-based SoC. Thus, the designer can have a baseline to start exploring the performance or improve the architectural features of an FPGA-based heterogeneous platform.

Our analysis and methodology are divided into four main processes: (A) algorithm development in C/C ++, (B) creation of AHA IP cores, (C) integration of AHA IP cores in SoCs, and (D) development of embedded software for the hard-core processor. A thorough explanation of these processes is provided in the following paragraphs.

3.1. Process A: Algorithm Development in C/C++

The algorithms require validation prior to their synthesis using Vivado HLS. Consequently, the compute-intensive algorithms (Section 5) were written, adapted, and verified in C/C++ utilizing the Eclipse IDE for C/C++. A hierarchical structure of sub-functions with a main top-level function was maintained, wherein the arguments and return values correspond to the inputs and outputs of the synthesized IP-core [16]. Figure 2 illustrates the relationship between the function and its implementation as an IP core in programmable logic systems. This IP core incorporates an AXI4 wrapper that contains the protocol and interface logic, as well as a single consolidated AXI4 slave port facilitating its interconnection with the host CPU core.

Regarding functional verification, a C-language specification to be synthesized into an HDL description requires a test bench function that compares the results of the RTL implementation against the ideal results, referred to as golden results. A C-language specification is a behavioral description in C/C++, which may include a top-level function and a hierarchy of sub-functions. This post-synthesis RTL verification is a feature in Xilinx Vivado HLS [16] termed “C/RTL Co-simulation”. These golden results were obtained from software executions utilizing Eclipse IDE C/C++ and a GNU compiler. As shown in Figure 1A, each benchmark is composed of four files: (i) fcn.cpp is the file where the top-level function and the required sub-functions are defined. (ii) hd.h is a header file containing the declarations of the top-level function, variable types, and identifiers. (iii) tb.cpp is the test bench written in C, which validates the accuracy of the RTL simulation results against golden results. (iv) gold.dat stores golden results.

3.2. Process B: Generation of AHA IP Cores

The implementation of the software algorithms in the hardware IP cores is illustrated in process (B) of Figure 1 and is executed utilizing Vivado HLS and the manual design flow. Comparable results were obtained employing our custom semi-automated design flow. Process (B) comprises the following six steps.

3.2.1. Importing the C Specification

The algorithm, written in C/C++ and previously validated (Section 3.1), is now imported into Vivado HLS. This C specification must adhere to the following conditions: (i) it must not contain typical OS operations that access non-existent resources (e.g., constructors for dynamic memory allocation, reading, and writing files) and (ii) the anticipated inputs/outputs of the IP-core must be defined as arguments in the top-level function.

3.2.2. C Simulation

Software-level verification is conducted to validate the functionality of the algorithm. In high-level synthesis (HLS), executing the compiled C program is referred to as “C simulation” [16]. Initially, the C specification was compiled utilizing a C/C++ compiler. Subsequently, the executable is run, and the results are stored in a .dat file. Finally, these results were verified using a test bench, which compared them with the set of golden results. This investigation was conducted utilizing Vivado/Vitis versions ranging from v2017.4 to v2020.1, which have fewer space requirements for computer systems. A key objective of this study was to utilize widely accessible and general tools to ensure broad applicability and reproducibility.

3.2.3. C Synthesis

Vivado HLS infers the logic of the C specification and synthesizes it to obtain an RTL hardware description, according to the main HLS processes explained in Section 1. For more details, please refer to [16].

3.2.4. C/RTL Co-Simulation

The C test bench was reused to perform the RTL-level simulation. Using the Vivado Simulator (XSim), this process verifies that the results obtained by simulating the hardware functionality of the RTL description have the same accuracy as the golden results [15].

3.2.5. Evaluation and Optimization

The hardware implementation was evaluated by considering the following: (i) performance indicators, such as latency and initiation intervals and (ii) indicators of resource utilization, such as the number of LUTs, FFs, Block RAMs, and DSPs. The results are presented in Section 7.1. The purpose of this evaluation was to verify whether the implementation adhered to the design goals and metrics (e.g., specific execution latency and maximum resource utilization). The objective of this study was to develop high-performance accelerators. Consequently, the focus was primarily on the execution performance metrics rather than programmable logic resource utilization.

Optimization was conducted by inserting directives in the C files to instruct the Vivado HLS to generate a microarchitecture that satisfies the desired performance and area goals [16]. Based on this evaluation, directives were applied to optimize sections of the algorithms to enhance the overall performance of the system. Directives are special statements in C language, which are used to influence how the C synthesis implements a specification. As illustrated in Figure 1B, the optimization process requires several iterations where different implementations, termed solutions, of the same specification are created. Each new solution was synthesized into an RTL description, verified, and evaluated.

Performance optimizations in Vivado HLS are designed to maximize hardware efficiency through parallelization techniques. A key method is pipelining (using the PIPELINE directive), which enables concurrent operation execution and reduces the initiation intervals while maintaining latency. For instance, in a function with sequential operations, pipelining can decrease the initiation interval from three clock cycles to one, significantly improving throughput. Another critical technique is array partitioning (ARRAY_PARTITION), which splits arrays into smaller subarrays or individual elements to increase the number of read/write ports, enabling parallel operations and enhancing pipelining. Additionally, loop unrolling (UNROLL) allows iterations to be executed in parallel, either partially or fully, thereby reducing latency at the cost of increased resource usage. Finally, dataflow optimization (DATAFLOW) parallelizes hierarchical function execution, decreasing both the latency and initiation intervals by running subfunctions concurrently. Vivado HLS also provides techniques to fine-tune latency and area, which are crucial for designs in which hardware efficiency is critical, enabling a balance between performance and resource consumption based on application-specific requirements. Latency optimization (LATENCY) allows designers to set maximum or minimum timing constraints for functions or loop directives, employing methods such as loop flattening to merge nested loops into a single loop and reduce clock cycles. On the other hand, area optimizations aim to minimize resource utilization through techniques like array mapping (ARRAY_MAP) and array reshaping (ARRAY_RESHAPE), which consolidate smaller arrays into larger structures or optimize memory layout. Resource sharing can be enforced using the ALLOCATION directive, which limits the number of instances of a specific operation or resource, or the INLINE directive, which flattens the function hierarchy to enable logic reuse.

Several synthesis and optimization techniques have been explored for the development of AHA IP cores. We applied a set of Vivado HLS optimization directives tailored to the computation patterns and architectural constraints of each application. For the Matrix Multiplication IP core, four iterative solutions are explored. The final solution employs the PIPELINE directive on the inner computation loop, ARRAY_PARTITION for the full partitioning of matrices A and B, and INTERFACE directives to optimize the AXI communication. This resulted in a one-clock cycle initiation interval. The FFT IP-core required adaptation due to the variable loop bounds, which prevented aggressive unrolling. Optimization focused on loop structure modification, the use of fixed bounds, and the PIPELINE directive to maximize the throughput under architectural constraints. The AES IP-core core achieved its best performance by applying PIPELINE to encryption rounds and subfunctions, optimizing data throughput across MixColumns, ShiftRows, and SubBytes transformations. In addition, INLINE and ARRAY_PARTITION directives were applied to reduce memory access bottlenecks and allow concurrent execution. In the Backpropagation IP-core, initial attempts at PIPELINE and UNROLL generated synthesis errors due to the complexity and interdependencies. Instead, the final version used ARRAY_RESHAPE, limited PIPELINE usage, and a simplified architecture to enable a successful HLS synthesis. Finally, the ANN IP core for digit recognition integrates PIPELINE, ARRAY_PARTITION, and UNROLL in a feedforward structure to maximize layer-to-layer propagation speed while balancing the area constraints. Table 2 summarizes the optimization techniques, directives, goals, and results in the final solutions of the AHA IP cores.

For didactic purposes, we used the matrixmul function to illustrate the synthesis and optimization processes. The matrixmul is the least demanding of our suite of benchmarking algorithms and IP cores, but it is widely used in high-performance applications. Figure 3a,c show the C specification and optimization directives, respectively. The directives were focused on increasing the performance of the accelerators using optimization strategies for improving the throughput of the implementations (e.g., function and loop pipelining, dataflow, array partitioning, and loop unrolling). For the broad treatment of traditional pragma directives and further optimization techniques, we recommend consulting [32]. We begin with a naive implementation that has a latency of 85 and an initiation interval of 86 clock cycles. Latency is the total number of clock cycles required to calculate output values since the input was accepted. Initiation interval is the number of clock cycles before accepting a new input. Using pipelining and array partitioning transformations, we obtained a solution with a latency of seven cycles and an interval of one cycle. The final pipelined implementation has a considerable improvement in throughput, increasing the rate at which new inputs are accepted, and new outputs are generated. Figure 3b,d illustrate the results of the synthesis and how the execution of operations has been scheduled for each clock cycle, both in the default implementation generated in Vivado HLS and in the implementation optimized for performance.

3.2.6. Exporting the IP-Core

Upon completion of the optimization process and acquisition of the optimal solution, the VHDL description of the AHA IP-core was exported from Vivado HLS. Subsequently, the IP core can be integrated into the hardware system of an SoC.

Figure 3. Example of the synthesis and optimization processes of the matrixmul function in Vivado HLS. (a) Function matrixmul written in C. (b) Default implementation in Vivado HLS, which sequentially executes three nested loops. (c) Optimization directives to increase performance. (d) Optimized implementation based on parallel execution of reading, multiplication, addition, and writing operations. The #pragma HLS PIPELINE directive enables loop pipelining to initiate new iterations every clock cycle, reducing overall latency. The #pragma HLS ARRAY_PARTITION directive fully partitions the input matrices to allow parallel access to memory elements, improving throughput. This pipelined implementation reduced latency and initiation interval, increasing overall throughput.

3.3. Process C: Integration of AHA IP-Cores in SoCs

The integration of each algorithm accelerator IP-core into an FPGA-based SoC was performed utilizing the Vivado IP Integrator, which is a component of the Vivado Design Suite toolset. Figure 4 illustrates a simplified block diagram of the proposed heterogeneous architecture. A heterogeneous architecture incorporates different types of processing units with distinct instruction architectures; in this case, the ARM processor and IP cores. Rather than implementing an advanced design that would limit the generality of our work, we developed a simple yet functional SoC architecture to provide researchers with a foundation for further architectural enhancements (e.g., AXI-stream interfaces, DMA, and vectorized instructions). Consequently, the utilization of the benchmarking IP cores or the semi-automated methodology was not constrained by specific architectural limitations.

The Zynq SoC architecture is divided into two regions: the processing system (PS) and programmable logic (PL). We integrate the PS that contains the ARM processor, cache, and controllers to the AHA IP core in the programmable logic (PL) using the AXI interconnection fabric. The AXI slave interface of the accelerator IP core is connected to an AXI master interface of the PS. The advanced extensible interface (AXI) is a multi-master and multi-slave protocol for the interconnection between on-chip modules, which is part of the ARM advanced microcontroller bus architecture (AMBA) standard. We use AXI4-Lite interfaces and rely on AXI Interconnect IP for data transactions. (AXI protocol can offer high data throughput through burst transmissions, data downsizing and upsizing, and insertion of pipeline stages [70]. Profiling techniques for exploring the performance of these features are beyond the scope of this study).

The integration process concludes with the exportation of crucial files such as (i) bitfile .bit, which contains information on the configuration of the programmable logic, and (ii) system.hdf, which holds the hardware platform specifications. This specification encompasses information about the target device, the address map of the selected processor, IP blocks, and data sheets of the system’s peripherals.

3.4. Process D: Embedded Software Development

The embedded software development for SoCs was conducted by utilizing Xilinx SDK/Vitis. Figure 1D illustrates a three-layer software stack that represents the software architecture of custom SoCs developed in this study, in accordance with Xilinx [71]. These layers comprise the following.

3.4.1. Hardware Platform

This layer encompasses all information pertaining to the hardware created by the Vivado IP Integrator. Its function is to facilitate access to the upper software layers of the SoC hardware. The primary component of this layer is the system.hdf file.

3.4.2. Board Support Package (BSP)

This layer consists of a fundamental operating system (OS) termed standalone. This OS constitutes a semi-hosted and single-threaded environment that provides access to and control of the processing system functions. The system.mss file serves as the primary component of the BSP, containing libraries and drivers for the management of the processing system and caches, interruptions, exceptions, timers, and other configurations [72].

3.4.3. Application

This layer represents the uppermost layer of the software stack, specified in C/C++. In this study, a file designated as arm.c was employed, which incorporates the xparameters.h file, wherein the parameters of the processor peripherals are defined, and the .h header file, which contains the device drivers for the IP core.

The embedded software development process culminates in the generation of the .bit file for configuring the programmable logic, and the .elf file for executing software embedded in a processing system.

4. Rapid Generation of AHA IP-Cores + SoC

To reduce the non-recurring engineering (NRE) cost associated with the design and performance evaluation of both SoCs and AHA IP-cores, we customized the HLS Tcl scripts. This approach adheres to the design flow proposed in Section 3. Figure 5 presents excerpts of the Tcl scripts and the user menu, which serves as a simplified interface for Vivado HLS, enabling the selection of one of the available options and the target Zynq device.

The subsequent sections describe the advantages (Section 4.1), processes (Section 4.2), requirements (Section 4.3), and limitations (Section 4.4) of our semi-automated design flow utilizing the HLS Tcl scripts.

4.1. Advantages

The Tcl scripts are designed to execute the steps necessary for creating IP cores in Vivado HLS and SoC creation in the Vivado IP Integrator. Consequently, they reduce the non-recurring engineering cost associated with design time and manual effort required to test new algorithms. To this end, the generated scripts serve two primary purposes.

(i) Replication of reference designs: The aim is to provide designers with a method to efficiently evaluate their Zynq platforms based on the proposed AHA IP-core benchmark suite. These scripts recreate the IP cores (i.e., matrix multiplication, FFT, AES, BPNN, and ANN) and SoCs generated in this study. (ii) Creation of new IP cores and SoCs based on Zynq: The objective is to enable a broader audience to target heterogeneous platforms. To achieve this, versatile Tcl scripts are provided to guide designers with limited or no familiarity with Xilinx tools, allowing them to transform their C/C++ algorithms into hardware IP cores integrated on Zynq using AXI4-Lite. These IP cores can subsequently be controlled from a bare-metal C/C++ application running on ARM, an environment familiar to many engineers and software developers, without requiring knowledge of HDLs, synthesis tools, or IP-integration environments. The scripts assume that the C/C++ code has been validated by the designer in the software environment.

4.2. Semi-Automated Design Flow

4.2.1. Creation of IP-Cores (Script_IP.tcl)

This script generates an AHA IP-core by initiating the C simulation, C synthesis, C/RTL co-simulation, and export IP-core (i.e., generating a VHDL description of the IP-core) processes, as illustrated in Figure 1. Figure 5a presents an excerpt of this script, which was executed in the Tcl console of Vivado HLS.

4.2.2. SoC Creation (Script_SoC.tcl)

This script facilitates the creation of SoCs that integrate the accelerator (IP-core) previously developed in Section 4.2.1. This process must be executed in the Tcl console of Vivado. Figure 5b presents a segment of the script, which imports the IP-core into the Vivado IP Catalog and generates a block design in the Vivado IP Integrator. The block design incorporated a fixed SoC architecture, as depicted in Figure 4. The process concludes with the generation of an SoC hardware platform (.hdf) and a bitstream file (.bit), which were subsequently exported.

4.3. Requirements of the C Specifications for Creating New IP-Cores and SoCs, Using Tcl Scripts

The Tcl scripts presented in this work require that the C specification fulfills the following requirements to execute the processes described in Section 4.2.

The C specification and golden results of the test bench must be verified utilizing software.
The C specification must not contain operating system operations that cannot be synthesized into hardware.
The IP-core inputs and outputs must be defined as arguments in the top-level function.
The C specification must incorporate a test bench that invokes the top-level function and verifies its results. The test bench must return a zero-integer value upon validation of the results.
The AXI interfaces must be specified for the top-level function employing the unique statement #pragma HLS INTERFACE to ensure that the generated IP-core can be subsequently integrated into the SoC.

4.4. Limitations of Semi-Automated Design Using Tcl Scripts

The development of Tcl scripts for the generation of IP cores utilizing Vivado HLS and their subsequent integration into SoCs based on Zynq devices using the Vivado IP Integrator presents several limitations.

While the scripts can automate the process to create a single solution for IP core implementations, performance or resource analysis must be conducted manually. This necessitates the designer to iteratively examine and analyze the synthesis report files and subsequently incorporate appropriate optimization directives directly into the C specification files. Subsequently, the scripts are executed again to generate a new solution.
The scripts generate SoCs with a fixed architecture, which integrates the IP core generated by Vivado HLS into a Zynq SoC through AXI4-Lite interfaces (Figure 4). Any modifications to this architecture require manual intervention in the IP-integrator GUI of Vivado.

5. AHA IP-Core Suite

The AHA (algorithm hardware accelerator) IP core Suite is a set of five compute-intensive IP cores that can be readily integrated into Zynq SoCs. These IP cores were developed utilizing the Vivado HLS Embedded Flow and facilitate rapid evaluation of Zynq heterogeneous SoCs’ performance. Furthermore, Tcl scripts were created to enhance the reusability of these IP cores across various boards and applications (Section 4).

The algorithms employed in this study were selected to establish a set of reference-embedded applications. These algorithms were either adapted from existing benchmark suites or custom-developed. Additionally, the benchmarks were modified to adhere to Vivado HLS requirements. Figure 6 illustrates this concept, and Table 3 provides a summary of the description and characteristics of the algorithm utilized in each IP core.

5.1. Matrix Multiplication

This AHA IP core performs standard multiplication of two

2 \times 2

matrices. Figure 6a illustrates the fundamental principle for obtaining an element of the resulting matrix C. A matrix multiplier is an algorithm utilized in various performance benchmarks due to its straightforward mathematical structure, facilitating comprehension of the algorithm logic and IP-core implementation in Vitis HLS. The C++ function executes matrix multiplication using three nested loops, with a multiplier–accumulator (MAC) component performing the multiplication–accumulation operation, which is compute-intensive and crucial for numerous algorithms.

5.2. FFT (Fast Fourier Transform)

This AHA IP-core implements a 1024-point fast Fourier transform (FFT) utilizing the Cooley–Tukey method. The FFT is an efficient implementation of the Discrete Fourier Transform (DFT), which entails substantial computational complexity. The Cooley–Tukey method, one of the most efficient FFT algorithms, recursively divides each new DFT into even samples

X_{e} [k]

and odd samples

X_{o} [k]

until small DFTs of two points each are obtained, as illustrated in Figure 6b. The Butterfly diagram for a 4-point FFT demonstrates the data flow of samples x, their weights w, and the summation points (green dots). The FFT serves as a standard benchmark for evaluating embedded systems and has been successfully applied in various fields of engineering (e.g., communications systems and radars). The algorithm for this IP core is adapted from MachSuite [68].

5.3. AES (Advanced Encryption Standard)

The AES IP core encrypts and decrypts a 128-bit vector utilizing a 128-bit key. AES is a symmetric encryption algorithm that employs the same key for both data encryption and decryption. It operates on bytes, facilitating comprehension and implementation in software and hardware. AES is based on permutations and substitutions and is structured into repeated steps termed rounds. Figure 6c illustrates the principle of the proposed algorithm. AES encryption necessitates a 128-bit state vector and a 128-, 192-, or 256-bit key. The key length determines the number of rounds to be executed, with 128-, 192-, and 256-bit keys requiring 10, 12, and 14 rounds, respectively. Four types of operations, namely, SubByte, ShiftRows, MixColumn, and AddRoundKey, constitute each round. The AES algorithm is highly relevant for security applications (e.g., encryption of financial, communication, or military information) and is one of the most widely accepted standards for information protection. This benchmark is adapted from the CHStone [66] suite.

5.4. BPNN (Back-Propagation Neural Network)

The BPNN IP-core trains a two-layer artificial neural network with ten neurons each. Machine learning techniques utilize neural networks for classification, pattern recognition, and control. Training these networks is compute-intensive due to the iterative processes and extensive parameter calculations. The backpropagation method is widely employed for training neural networks, which comprise three layers of neurons (input, hidden, and output) connected by synaptic connections, as Figure 6d illustrates. By modifying the weights and bias in proportion to the gradient of the error function, the backpropagation method converges network errors to a minimum value. BPNN is the predominant supervised training technique for neural networks across numerous machine learning applications, such as classification and pattern recognition. The benchmark algorithm is selected from MachSuite [68].

5.5. ANN (Artificial Neural Network)

The algorithm implements an artificial neural network that classifies handwritten numerals. This benchmark was specifically developed for this study. The artificial neural network was previously trained utilizing software, and the benchmark exclusively executed the required classification. The dataset employed to train the artificial neural network comprised

20 \times 20

pixel images obtained from the MNIST dataset [73]. In Figure 6e, each pixel in the image is represented as

x_{i}

, the network outputs as

y_{i}

, and the biases as b. Thus, we validated our algorithm and accelerator in a typical machine learning application. In recent years, the research and utilization of FPGA-based heterogeneous systems to accelerate the execution of machine learning algorithms has increased substantially [74,75].

6. Performance Evaluation

6.1. Experiments Setup

In this section, we delineate the methodology and experimental setup employed to evaluate the performance of the standalone algorithm accelerator IP cores and various hardware and software SoC architectures. Figure 7 illustrates the configurations of these experiments, which are elucidated in the subsequent paragraphs.

6.1.1. Experiment 1

A comparative analysis of the performance of the IP-core in isolation (i.e., prior to integration into an SoC) based on latency cycles. This experiment compares (i) accelerator IP cores obtained by default in Vivado HLS, which prioritizes resource optimization, and (ii) accelerator IP cores that were optimized in Vivado HLS through synthesis directives to achieve maximum execution performance. The performance metrics measured in this experiment are as follows: (i) latency cycles: the number of clock cycles required to calculate the accelerator output values and (ii) clock frequency: programmable logic fabric clock frequency of the accelerator IP core.

6.1.2. Experiment 2

A comparative analysis of performance based on the execution time of the algorithms, which are implemented both in software and in AHA IP-cores that are integrated into a Zynq SoC. Specifically, this experiment compared (i) software implementations in an ARM Cortex-A9 core, (ii) SoCs implementing accelerator IP cores generated by default in the Vivado HLS, and (iii) SoCs implementing optimized accelerator IP cores. The performance metric measured in this experiment was the algorithm execution time in microseconds measured by the software running on the ARM processor core. The execution time was measured from the initiation of the algorithm execution in the IP core until the completion of the algorithm. This event and the availability of the results are communicated to the ARM processor.

6.2. Configuration of SoC Architectures for Experiment 2

Table 4 summarizes the hardware and software configurations of the SoC architectures implemented in Experiment 2 for the performance evaluation. In total, there were three architectures:

A_{b a s e}

(

A_{b}

),

A_{d e f a u l t}

(

A_{d}

), and

A_{o p t i m i z e d}

(

A_{o}

). These are described as follows.

6.2.1. Ab (ARM Cortex-A9)

The

A_{b a s e}

(

A_{b}

) architecture served as the base or reference system for all performance comparisons in Experiment 2. For each of the five algorithm benchmarks (Table 4), an embedded software application was developed in C language and executed on a single-threaded OS (BSP standalone). This application executes the algorithm and measures the execution time. In the conducted tests, only one of the two available ARM Cortex-A9 cores was utilized, and the system’s cache memory was disabled. (The aim was to conduct an objective evaluation of the computational performance of the two processing systems, particularly the IP cores, without the additional benefits of cache memory, multi-threading, multi-processing, or other techniques to enhance processing performance.)

The experimental setup involved two FPGA boards based on Zynq: Digilent Zybo and the Xilinx ZC702 Evaluation Kit. The results obtained on both boards exhibited minimal differences. (The Digilent Zybo and Xilinx ZC702 Evaluation Kit have the same ARM Cortex-A9 processor operating at slightly different frequencies). Consequently, only the results for the Digilent Zybo board (Z-7010) are presented. One hardware architecture

A_{b}

executes five different software configurations.

6.2.2. Ad and Ao (ARM Cortex-A9 + AHA IP-Core)

Both architectures comprise Zynq SoCs that integrate a single algorithm accelerator IP-core connected to the hard-core processor through AXI interfaces, as illustrated in Figure 4. Architecture

A_{d e f a u l t}

(

A_{d}

) implements the default IP-cores generated by the Vivado HLS, whereas architecture

A_{o p t i m i z e d}

(

A_{o}

) implements IP cores optimized for execution performance. Each IP core was synthesized and optimized, as described in Section 3, and it accelerated one of the five algorithm benchmarks (Table 3). The host processor, with a single-threaded OS (BSP standalone), executes a C software application that controls the accelerator IP core and measures the total algorithm execution time. Consequently, the execution performance of each algorithm accelerator IP core was evaluated in both architectures for each algorithm benchmark.

This experiment was conducted on two FPGA boards based on Zynq, namely, Digilent Zybo and Xilinx ZC702 Evaluation Kit. Therefore, architecture

A_{d}

is implemented on two Zynq chips: Z-7010 and Z-7020, which are designated as architectures

A_{d - 10}

and

A_{d - 20}

, respectively. Similarly, the architecture

A_{o}

implementations on these boards are designated as

A_{o - 10}

and

A_{o - 20}

.

6.3. Speedup Factor

The speedup is computed utilizing the formula

S = t o l d / t e n h a n c e d

, which is derived from Amdahl’s law. Equation (1) calculates the speedup factor of the optimized IP cores in comparison to the IP cores generated by default in the Vivado HLS (Experiment 1). Equation (2) calculates the speedup factor of the SoCs that incorporate the default and optimized accelerator IP cores. The software executions in the architecture

A_{b}

serve as a reference system.

\begin{matrix} E x p e r i m e n t 1 & : S_{1} = t_{{I P}_{d e f a u l t}} / t_{{I P}_{o p t i m i z e d}} \end{matrix}

(1)

\begin{matrix} E x p e r i m e n t 2 & : S_{2} = \{\begin{matrix} t_{A_{b}} / t_{A_{d}}, f o r A_{d e f a u l t} \\ t_{A_{b}} / t_{A_{o}}, f o r A_{o p t i m i z e d} \end{matrix} \end{matrix}

(2)

7. Results

This section presents the performance results obtained from Experiments 1 and 2 (Section 6). The corresponding analysis and discussion are presented in Section 8.

7.1. Experiment 1

Table 5 illustrates the results of the default and optimized standalone IP core implementations. Table 6 shows the type and quantity of FPGA resources utilized for each implementation. Figure 8 provides a comparative resource-utilization graph.

7.2. Experiment 2

Table 7 presents a comparative analysis of performance based on the execution times of the benchmarks in architectures

A_{b}

,

A_{d}

, and

A_{o}

. In this table, the speedup is calculated by considering

A_{b}

as the base architecture according to Equation (2). Figure 9 illustrates a comparative graph of the speedup of architectures

A_{d}

and

A_{o}

relative to the baseline architecture

A_{b}

. Figure 10 depicts the FPGA physical layout (floorplanning) of the final implementations of the algorithm accelerator IP cores in architecture

A_{o - 20}

, with the exception of FFT, which corresponds to architecture

A_{d - 20}

.

8. Discussion

This section discusses our contributions in comparison to other studies, the primary challenges encountered in utilizing Vivado HLS, the methodologies employed to address these challenges, the outcomes and limitations of the benchmarking IP cores, and potential optimizations to enhance our designs.

Contemporary HLS research focuses on the following objectives: providing new frameworks for design space exploration [35,36,37], optimization [32,39,40], and formal verification [41,42]. It also evaluates HLS tools in terms of synthesis performance, resource efficiency, and supported features without comparing hardware and software implementations in FPGA-based heterogeneous SoCs [9,10,17,24,54], developing, or using academic HLS tools to analyze the effects of accelerators (IP cores) [55,56]. A fair comparison of implementations is challenging due to the necessity for benchmark standardization. This paper presents a set of IP-cores, their characterization, and a rapid methodology for integrating them into Zynq-based SoCs, such that they could also be utilized for benchmarking. This enables novices to readily incorporate these cores into their designs and explore architectural improvements.

Benchmarking IP cores presents challenges due to the characteristics of each benchmark, as not all algorithms are parallelizable, as demonstrated in Table 7. The speedup factors are dependent on the intrinsic characteristics of the parallelism level of each algorithm in parallel computing.

Furthermore, Vivado HLS does not automatically perform hardware-aware transformations based on specific characteristics [32]. Hardware designers must apply pragma-specific directives for tasks, such as deep pipeline generation. The Vivado HLS synthesis directives may be insufficient for addressing variable-loop bounds, data dependencies, and resource constraints.

Limited resources, particularly in cost-effective low-end platforms, pose challenges in Vivado HLS due to the lack of specifying resource constraints (i.e., DSP48, FFs, and LUTs). Designers lack comprehensive control over performance and resource utilization. A resource-oriented HLS methodology is presented in [76].

The synthesis process is highly automated; however, the design of space exploration is time-consuming. Finding optimized solutions can require several days to complete. In contrast, we rely on pragma directives to create AXI interfaces, avoiding the RTL manual design and reducing verification testing time. In summary, performance optimization is constrained by algorithm parallelism, designer expertise to overcome tool limitations, modification of algorithms accordingly, and available programmable logic resources in the FPGA device.

Consequently, we believe that our work can provide a functional off-the-shelf starting point that allows users to explore further system-level performance optimizations or other objectives without addressing the time-consuming challenges described above.

Our investigation reveals that most benchmarking algorithms have inherent limitations. However, algorithms such as ANN and BPNN are highly parallel and suitable for hardware acceleration despite the implicit increase in resource utilization. Consequently, we optimized the algorithm sections that can, in turn, improve the overall IP core performance.

Experiment 1 (Table 5) evaluated only IP core performance and demonstrated that higher optimizations are feasible for matrix multiplication. Conversely, no further optimization can be achieved in the FFT IP core due to the high data dependencies in the algorithm, specifically the variable loop bounds. Table 6 and Figure 8 indicate that the optimized AES IP core exhibited the highest increase in resource utilization with respect to its default implementation, whereas BPNN displayed the lowest increase.

Experiment 2 (Table 7, Figure 9) evaluated the SoC + IP-core performance and demonstrated that the AES IP-core achieved a higher speedup than the other IP-cores. The speedup was more pronounced when the algorithm was executed as an IP core in the Zynq SoC. In contrast, the matrix multiplication IP core provides the lowest speedup.

Selecting hardware-compatible algorithms for HLS implementation can be challenging, as it is not immediately apparent which algorithms are suitable for FPGA kernels with maximum speedup and minimal resource utilization. Our research proposes a set of criteria for selecting algorithms that significantly improve performance when implemented as FPGA-based accelerator IP cores using Vivado HLS. These guidelines are based on the properties of the algorithm.

Data type: Algorithms that perform byte-level or integer operations are recommended (e.g., matrix multiplication and AES), whereas algorithms that perform floating-point operations may require additional resources. Utilizing data types for efficient hardware that supports arbitrary data lengths produces more compact and expeditious operators [16] that are suitable for applications that compromise data precision without adversely affecting the results. This approach can potentially yield superior speedup factors in applications such as deep neural networks and image processing.
Data dependencies: High data dependencies in algorithms render them unsuitable for hardware implementation due to their sequential execution (e.g., FFT), thus limiting FPGA parallelism. Manual transformation is necessary to mitigate these dependencies, and variable-loop bounds cannot be efficiently parallelized with conventional transformations. This limitation affects the capacity of Vivado HLS to accurately estimate the latency cycles and resource utilization. To address these issues in FFT and AES, a combination of source-code transformations and traditional synthesis directives was employed. For further information, refer to [39], who presents a framework for optimizing applications with variable loop bounds.
Multiple iterations or rounds: Iterative executions of compute-intensive operations benefit from hardware acceleration, particularly with minimal data dependencies and deep pipelines (e.g., AES, BPNN, and ANN), whereas sequential algorithms with few or no iterations do not exhibit significant performance improvements.

In conclusion, Figure 11 presents a recommendation diagram for identifying appropriate algorithms for an FPGA-based accelerator implementation. The polygon illustrates characteristics such as byte-level operations, integer data, floating-point data, number of rounds, and data dependencies. An optimal scenario occurs when representative characteristics are situated on one or several sides, with darker characteristics indicating increased accelerator performance. It is advisable for designers to avoid algorithms with poorly recommended or non-colored areas.

In terms of system-level performance, the results indicated that the contribution of the optimized versions to the total SoC performance was algorithm dependent. In Figure 9, for instance, the optimized (OPT) IP cores of several algorithms (e.g., AES and ANN) demonstrated superior speedups compared to the non-optimized (DEF) IP cores. Conversely, both versions of IP cores yielded identical speedups for specific algorithms (i.e., matrix multiplication).

As illustrated in Figure 8, the low-cost Zynq-7000 all-programmable SoC family devices (i.e., Z-7010 and Z-7020) were unable to provide sufficient resources to implement various versions of the IP cores. This unanticipated finding, however, assists the reader in determining whether to utilize Zynq devices with greater resources, such as UltraScale+.

One of the objectives of our study is to develop benchmarking IP cores that are sufficiently generic, optimized, and reusable to be readily integrated into Zynq SoCs but are not necessarily required to achieve optimal system-level performance. Nevertheless, we propose the following recommendations for system-level enhancements and tight PS-PL integration.

Enabling cache memory in hardware and software can enhance system performance, particularly when external memory is utilized; however, its impact on BRAM on-chip systems is limited.
The AXI SmartConnect IP [77] provides maximum system throughput at low latency by synthesizing a low-area custom interconnect, offering a more scalable and flexible network-on-chip (NoC) architecture compared to the AXI Interconnect IP [70].
AXI HP (high-performance) interfaces facilitate high-bandwidth transfers between the PS and PL slave interfaces to OCM and DDR memories or memory-mapped PL-based accelerators. Nevertheless, cache coherency can be managed using software controls.
The AXI accelerator coherency port (ACP) interface possesses an IP structure similar to that of the AXI HP but exhibits the highest throughput for a single interface. Its hardware coherency and connectivity within the PS, snoop control unit (SCU), and L1 and L2 caches enable ACP transactions to communicate with cache subsystems, potentially reducing the latency between the PS and PL-based accelerators. Comparisons between the HP and ACP ports are presented in [78,79]. Additionally, refs. [70,80] compared various techniques for linking programmable logic to a processing system, focusing on data movement tasks, such as direct memory access (DMA).
Other methods that are less efficient than PL DMA using AXI HP or AXI ACP include CPU-controlled transactions, PS DMA controllers, and PL DMA using a general purpose (GP) AXI Slave [70].

While current implementations of AHA IP-cores use standard C/C++ data types (e.g., int, double) to facilitate synthesis, verification, and integration in early-stage hardware design, users may refine numerical precision to optimize resource usage and performance. Vivado and Vitis HLS support custom fixed-point types via ap_fixed<W, I> or ap_int<W>, where W is the total bit width, and I is the number of integer bits. These types can be substituted directly into the C/C++ source code to replace the default types in arithmetic operations, array declarations, or data structures. For example, changing a variable from double to ap_fixed<16, 8> reduces resource demand while maintaining moderate precision. This manual step allows designers to explore trade-offs between accuracy and area or latency, particularly in applications such as neural networks or embedded signal processing, where the dynamic range is known in advance.

Compared with the AMD-Xilinx LogiCORE FFT [81], which is a commercial IP core optimized for high configurability and performance, our FFT implementation offers a more accessible and modifiable alternative. LogiCORE supports a wide range of transform sizes, flexible data and phase widths, various arithmetic formats, architecture choices (e.g., Radix-2, Radix-4), and AXI4-Stream interfaces, making it well suited for production-grade applications. In contrast, our design is written in high-level C/C++ and synthesized with Vivado HLS using fixed parameters to simplify the synthesis and enhance clarity. Rather than functioning as a vendor-locked IP core, our FFT is fully open at the source level, allowing users (especially those new to HLS) to study, modify, and extend the architecture for custom or educational purposes.

9. Conclusions

This study investigates a solution to the increasing need for rapid evaluation and performance benchmarking of heterogeneous FPGA-based SoCs that integrate accelerator IP cores and Zynq SoCs. This research utilizes the Vivado HLS IP Flow, which produces IPs for hardware designs using the Vivado IDE or Vivado IP Integrator. In contrast to the Vitis Application Acceleration Development Flow, which is designed for software developers, the HLS IP Flow is not limited to specific operating systems, such as Unix, or to advanced yet costly target boards.

We present a suite of five AHA benchmarking IP cores (Table 3) that accelerate a selected group of computation-intensive algorithms. These algorithms include matrix multiplication, fast Fourier transform (FFT), advanced encryption standard (AES), back-propagation neural network (BPNN), and artificial neural network (ANN). By employing a simplified method (Figure 1) derived from Vivado HLS and based on Tcl scripts, these IP cores can be efficiently integrated into a Zynq-based SoC to rapidly evaluate the overall system performance.

Our interactive Tcl scripts were customized to semi-automate the Vivado HLS design flow and simplify the generation of Zynq SoCs, benchmarking IP cores, and other IP cores (Figure 5). This solution reduces the NRE cost compared with the standard Vivado HLS cycle. This technique may benefit novice hardware designers and software developers who are unfamiliar with Xilinx tools and seek to expedite the acceleration of their algorithms, as hardware IP cores are incorporated into a heterogeneous SoC running software on ARM processors.

We present a brief survey of HLS techniques and tools in Table 1. Although expert designers utilize HLS tools, the complexity of automatically developing heterogeneous multicore systems remains substantial. The development of HLS tools or methodologies that enhance or support existing tools in the design, verification, and integration of FPGA-based accelerator IP cores, SoCs, and GPU cores is a current research topic. Significant studies have assessed existing HLS tools or provided new HLS design frameworks. However, comprehensive performance evaluations of compute-intensive algorithms accelerated on FPGA-based heterogeneous SoCs are still lacking.

In contrast to other studies, this research aims to provide and characterize compute-intensive benchmarking IP cores and a methodology for expeditious construction of IP cores and SoCs, rather than developing high-performance architectures. To this end, we investigated the performance of various configurations of benchmarking IP cores and SoCs (Figure 7, Table 4), including (1) standalone accelerator IP cores, (2) embedded software executing in a Zynq SoC’s ARM hard-core processor, and (3) accelerator IP-cores integrated with Zynq SoCs utilizing AXI Interconnect. We employed two IP core solutions generated in Vivado HLS (i.e., the default and optimized execution performance) and two types of Zynq FPGAs (i.e., 7010 and 7020).

The performance evaluation (Table 5 and Table 7, and Figure 9) indicates that SoCs integrating algorithm accelerator IP cores designed in Vivado HLS can achieve speedup factors of one to three orders of magnitude, which is superior to software execution. Nevertheless, this work is not intended as a fully optimized IP library, but as a practical, open-source baseline to accelerate learning and development.

The benchmarking results demonstrate that the ANN and BPNN algorithms achieve significant parallelism and hardware acceleration, whereas sequential execution constraints limit optimization in other algorithms. The optimized AES IP core exhibited the highest increase in resource utilization, whereas the BPNN had the lowest. The matrix multiplication IP core demonstrated a higher optimization potential, whereas the FFT IP core could not be further optimized due to the high data dependencies and variable loop bounds. The AES IP core achieved the highest speedup when executed as an IP core in a Zynq SoC, whereas the matrix multiplication IP core provided the lowest speedup. Vivado HLS does not automatically optimize hardware-aware transformations, necessitating designer intervention via pragma directives. Furthermore, while the synthesis process is automated, design space exploration remains time-intensive. Optimized designs may require more FPGA resources than those available in cost-effective platforms, such as Zynq Z-7010, limiting implementation feasibility.

According to our findings, the nature of each algorithm constrains the acceleration in the IP cores and the entire SoC. This limitation is determined by the extent of parallelizable code in the algorithm and other criteria such as (1) The internal optimizations of the HLS tool to exploit parallelism. (2) The selection of appropriate algorithms or sections to be implemented as AHA IP cores. Consequently, in Section 8, we present a set of recommendations for algorithm selection (Figure 11).

We observed that benchmarking IP cores should be as generalizable as possible, without incorporating specific features that reduce portability and flexibility. Utilizing our methodology, the designer generates a functional SoC baseline, which can be evaluated and subsequently enhanced with additional system-level improvements, such as cache memories, SmartConnect IP, AXI high-performance interfaces, AXI Accelerator Coherency Port interface, and PL DMA, which we elucidate in Section 8.

Future research could investigate precision-tuning methodologies such as wordlength optimization, quantization-aware design, or the use of fixed-point arithmetic types (e.g., ap_fixed, ap_int) to better balance accuracy and hardware efficiency, particularly in applications like neural networks, where the numerical range is well defined. Additionally, further studies may explore the implementation and performance of more advanced FPGA platforms (e.g., AMD-Xilinx UltraScale+ MPSoCs and RFSoCs) and heterogeneous SoCs that integrate GPU cores. The current suite of compute-intensive benchmarking IP cores may also be extended to target specific domains such as embedded vision, autonomous systems, edge computing, O-RAN, RADAR, and defense applications.

Author Contributions

Conceptualization, D.B.-M. and B.N.; methodology, D.B.-M. and B.N.; software, D.B.-M.; validation, D.B.-M.; formal analysis, D.B.-M.; investigation, D.B.-M. and B.N.; resources, D.B.-M. and B.N.; data curation, D.B.-M.; writing—original draft preparation, D.B.-M. and B.N.; writing—review and editing, D.B.-M. and B.N.; visualization, D.B.-M. and B.N.; supervision, B.N.; project administration, B.N.; funding acquisition, B.N. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by Universidad de las Fuerzas Armadas ESPE grant numbers 2020-PIC-015-INV and 2020-PIC-013-CTE. The APC was funded by Universidad de las Fuerzas Armadas ESPE.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The original contributions presented in the study are included in the article, further inquiries can be directed to the corresponding author.

Acknowledgments

The authors extend their gratitude to Jaime Paúl Ayala Taco for providing feedback on select portions of the initial manuscript.

Conflicts of Interest

The authors declare no conflicts of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript; or in the decision to publish the results.

References

Brodtkorb, A.R.; Dyken, C.; Hagen, T.R.; Hjelmervik, J.M.; Storaasli, O.O. State-of-the-art in heterogeneous computing. Sci. Program. 2010, 18, 1–33. [Google Scholar] [CrossRef]
Hao, C.; Zhang, X.; Li, Y.; Huang, S.; Xiong, J.; Rupnow, K.; Hwu, W.M.; Chen, D. FPGA/DNN co-design: An efficient design methodology for IoT intelligence on the edge. In Proceedings of the 56th Annual Design Automation Conference 2019, Dac ’19, New York, NY, USA, 2 June 2019. [Google Scholar] [CrossRef]
Talib, M.A.; Majzoub, S.; Nasir, Q.; Jamal, D. A systematic literature review on hardware implementation of artificial intelligence algorithms. J. Supercomput. 2021, 77, 1897–1938. [Google Scholar] [CrossRef]
Nechi, A.; Groth, L.; Mulhem, S.; Merchant, F.; Buchty, R.; Berekovic, M. FPGA-based Deep Learning Inference Accelerators: Where Are We Standing? ACM Trans. Reconfigurable Technol. Syst. 2023, 16, 1–32. [Google Scholar] [CrossRef]
Intel. Virtual RAN (vRAN) with Hardware Acceleration. Available online: https://builders.intel.com/solutionslibrary/virtual-ran-vran-with-hardware-acceleration (accessed on 31 March 2025).
Azariah, W.; Bimo, F.A.; Lin, C.W.; Cheng, R.G.; Nikaein, N.; Jana, R. A Survey on Open Radio Access Networks: Challenges, Research Directions, and Open Source Approaches. Sensors 2024, 24, 1038. [Google Scholar] [CrossRef]
Kundu, L.; Lin, X.; Agostini, E.; Ditya, V.; Martin, T. Hardware Acceleration for Open Radio Access Networks: A Contemporary Overview. IEEE Commun. Mag. 2024, 62, 160–167. [Google Scholar] [CrossRef]
Coussy, P.; Gajski, D.D.; Meredith, M.; Takach, A. An introduction to high-level synthesis. IEEE Des. Test Comput. 2009, 26, 8–17. [Google Scholar] [CrossRef]
Cong, J.; Liu, B.; Neuendorffer, S.; Noguera, J.; Vissers, K.; Zhang, Z. High-level synthesis for fpgas: From prototyping to deployment. IEEE Trans. Comput.-Aided Des. Integr. Circuits Syst. 2011, 30, 473–491. [Google Scholar] [CrossRef]
Meeus, W.; Van Beeck, K.; Goedemé, T.; Meel, J.; Stroobandt, D. An overview of today’s high-level synthesis tools. Des. Autom. Embedded Syst. 2012, 16, 31–51. [Google Scholar] [CrossRef]
Crockett, L.H.; Elliot, R.A.; Enderwitz, M.A.; Stewart, R.W. The Zynq Book: Embedded Processing with the Arm Cortex-A9 on the Xilinx Zynq-7000 All Programmable Soc; Strathclyde Academic Media: Glasgow, UK, 2014.
Gort, M.; Anderson, J. Design re-use for compile time reduction in FPGA high-level synthesis flows. In Proceedings of the 2014 International Conference on Field-Programmable Technology (FPT), Shanghai, China, 10–12 December 2014; pp. 4–11. [Google Scholar] [CrossRef]
Koch, D.; Hannig, F.; Ziener, D. (Eds.) FPGAs for Software Programmers; Springer International Publishing: Cham, Switzerland, 2016. [Google Scholar]
Zhu, J.; Sander, I.; Jantsch, A. HetMoC: Heterogeneous Modelling in SystemC. In Proceedings of the 2010 Forum on Specification & Design Languages (FDL 2010), Southampton, UK, 14–16 September 2010; pp. 1–6. [Google Scholar]
Berrazueta, M.D. A Study and Application of High-Level Synthesis for the Design of High-Performance FPGA-Based System-on-Chip with Custom IP-Cores. Bachelor’s Thesis, Universidad de las Fuerzas Armadas ESPE, Sangolquí, Ecuador, 2019. [Google Scholar]
AMD-Xilinx. Vivado Design Suite User Guide: High-Level Synthesis (UG902). Available online: https://docs.amd.com/v/u/2017.4-English/ug902-vivado-high-level-synthesis (accessed on 31 March 2025).
Numan, M.W.; Phillips, B.J.; Puddy, G.S.; Falkner, K. Towards Automatic High-Level Code Deployment on Reconfigurable Platforms: A Survey of High-Level Synthesis Tools and Toolchains. IEEE Access 2020, 8, 174692–174722. [Google Scholar] [CrossRef]
Cong, J.; Lau, J.; Liu, G.; Neuendorffer, S.; Pan, P.; Vissers, K.; Zhang, Z. FPGA HLS Today: Successes, Challenges, and Opportunities. ACM Trans. Reconfigurable Technol. Syst. 2022, 15, 1–42. [Google Scholar] [CrossRef]
Li, J.; Chi, Y.; Cong, J. HeteroHalide: From Image Processing DSL to Efficient FPGA Acceleration. In Proceedings of the 2020 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays (FPGA ’20), New York, NY, USA, 23–25 February 2020; Association for Computing Machinery: New York, NY, USA, 2020; pp. 51–57. [Google Scholar]
Ishikawa, A.; Fukushima, N.; Maruoka, A.; Iizuka, T. Halide and GENESIS for Generating Domain-Specific Architecture of Guided Image Filtering. In Proceedings of the 2019 IEEE International Symposium on Circuits and Systems (ISCAS), Sapporo, Japan, 26–29 May 2019; pp. 1–5. Available online: https://ieeexplore.ieee.org/document/8702260 (accessed on 31 March 2025).
Hegarty, J.; Daly, R.; DeVito, Z.; Ragan-Kelley, J.; Horowitz, M.; Hanrahan, P. Rigel: Flexible multi-rate image processing hardware. ACM Trans. Graph. 2016, 35, 85:1–85:11. [Google Scholar] [CrossRef]
Sujeeth, A.K.; Lee, H.; Brown, K.J.; Chafi, H.; Wu, M.; Atreya, A.R.; Olukotun, K.; Rompf, T.; Odersky, M. OptiML: An Implicitly Parallel Domain-Specific Language for Machine Learning. In Proceedings of the 28th International Conference on Machine Learning (ICML’11), Madison, WI, USA, 28 June–2 July 2011; Omnipress: Madison, WI, USA, 2011; pp. 609–616. [Google Scholar]
Lai, Y.-H.; Chi, Y.; Hu, Y.; Wang, J.; Yu, C.H.; Zhou, Y.; Cong, J.; Zhang, Z. HeteroCL: A Multi-Paradigm Programming Infrastructure for Software-Defined Reconfigurable Computing. In Proceedings of the 2019 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays (FPGA ’19), New York, NY, USA, 24–26 February 2019; Association for Computing Machinery: New York, NY, USA, 2019; pp. 242–251. [Google Scholar] [CrossRef]
Nane, R.; Sima, V.; Pilato, C.; Choi, J.; Fort, B.; Canis, A.; Chen, Y.T.; Hsiao, H.; Brown, S.; Ferrandi, F.; et al. A survey and evaluation of FPGA high-level synthesis tools. IEEE Trans. Comput.-Aided Des. Integr. Circuits Syst. 2016, 35, 1591–1604. [Google Scholar] [CrossRef]
Josipović, L.; Guerrieri, A.; Ienne, P. Invited Tutorial: Dynamatic: From C/C++ to Dynamically Scheduled Circuits. In Proceedings of the 2020 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays (FPGA ’20), New York, NY, USA, 23–25 February 2020; Association for Computing Machinery: New York, NY, USA, 2020; pp. 1–10. [Google Scholar] [CrossRef]
Ye, H.; Jun, H.; Jeong, H.; Neuendorffer, S.; Chen, D. ScaleHLS: A Scalable High-Level Synthesis Framework with Multi-Level Transformations and Optimizations. In Proceedings of the 59th ACM/IEEE Design Automation Conference (DAC ’22), New York, NY, USA, 5–9 August 2022; Association for Computing Machinery: New York, NY, USA, 2022; pp. 1355–1358. [Google Scholar] [CrossRef]
Lattner, C.; Amini, M.; Bondhugula, U.; Cohen, A.; Davis, A.; Pienaar, J.; Riddle, R.; Shpeisman, T.; Vasilache, N.; Zinenko, O. MLIR: Scaling Compiler Infrastructure for Domain Specific Computation. In Proceedings of the 2021 IEEE/ACM International Symposium on Code Generation and Optimization (CGO), Virtual Event, 27 February–3 March 2021; pp. 2–14. [Google Scholar] [CrossRef]
Qin, S.; Berekovic, M. A comparison of high-level design tools for SoC-FPGA on disparity map calculation example. arXiv 2015, arXiv:1509.00036. [Google Scholar]
Navas, B.; Oberg, J.; Sander, I. Towards the Generic Reconfigurable Accelerator: Algorithm Development, Core Design, and Performance Analysis. In Proceedings of the 2013 International Conference on Reconfigurable Computing and FPGAs (ReConFig), Cancun, Mexico, 9–11 December 2013; pp. 1–6. Available online: https://ieeexplore.ieee.org/document/6732334 (accessed on 31 March 2025).
Sozzo, E.D.; Conficconi, D.; Zeni, A.; Salaris, M.; Sciuto, D.; Santambrogio, M.D. Pushing the Level of Abstraction of Digital System Design: A Survey on How to Program FPGAs. ACM Comput. Surv. 2022, 55, 106:1–106:48. [Google Scholar] [CrossRef]
Martin, G.; Smith, G. High-level synthesis: Past, present, and future. IEEE Des. Test Comput. 2009, 26, 18–25. [Google Scholar] [CrossRef]
de Fine Licht, J.; Besta, M.; Meierhans, S.; Hoefler, T. Transformations of High-Level Synthesis Codes for High-Performance Computing. IEEE Trans. Parallel Distrib. Syst. 2021, 32, 1014–1029. [Google Scholar] [CrossRef]
Schafer, B.C.; Wang, Z. High-Level Synthesis Design Space Exploration: Past, Present, and Future. IEEE Trans. Comput.-Aided Des. Integr. Circuits Syst. 2020, 39, 2628–2639. [Google Scholar] [CrossRef]
Molina, R.S.; Gil-Costa, V.; Crespo, M.L.; Ramponi, G. High-Level Synthesis Hardware Design for FPGA-Based Accelerators: Models, Methodologies, and Frameworks. IEEE Access 2022, 10, 90429–90455. [Google Scholar] [CrossRef]
Cong, J.; Wei, P.; Yu, C.H.; Zhang, P. Automated Accelerator Generation and Optimization with Composable, Parallel and Pipeline Architecture. In Proceedings of the 55th Annual Design Automation Conference (DAC ’18), San Francisco, CA, USA, 24–28 June 2018; Association for Computing Machinery: New York, NY, USA, 2018; p. 154. [Google Scholar] [CrossRef]
Wang, Z.; Chen, J.; Schafer, B.C. Efficient and Robust High-Level Synthesis Design Space Exploration through Offline Micro-Kernels Pre-Characterization. In Proceedings of the 2020 Design, Automation & Test in Europe Conference & Exhibition (DATE), Grenoble, France, 9–13 March 2020; pp. 145–150. Available online: https://ieeexplore.ieee.org/document/9116309 (accessed on 31 March 2025).
Liu, S.; Lau, F.C.M.; Schafer, B.C. Accelerating FPGA Prototyping through Predictive Model-Based HLS Design Space Exploration. In Proceedings of the 56th Annual Design Automation Conference (DAC ’19), Las Vegas, NV, USA, 2–6 June 2019; Association for Computing Machinery: New York, NY, USA, 2019; p. 97. [Google Scholar] [CrossRef]
Sun, Q.; Chen, T.; Liu, S.; Miao, J.; Chen, J.; Yu, H.; Yu, B. Correlated Multi-Objective Multi-Fidelity Optimization for HLS Directives Design. In Proceedings of the 2021 Design, Automation & Test in Europe Conference & Exhibition (DATE), Virtual Event, 1–5 February 2021; pp. 46–51. Available online: https://ieeexplore.ieee.org/document/9474241 (accessed on 31 March 2025).
Choi, Y.; Cong, J. HLS-Based Optimization and Design Space Exploration for Applications with Variable Loop Bounds. In Proceedings of the 2018 IEEE/ACM International Conference on Computer-Aided Design (ICCAD), San Diego, CA, USA, 5–8 November 2018; IEEE Press: Piscataway, NJ, USA, 2018; pp. 1–8. [Google Scholar] [CrossRef]
Özkan, M.A.; Pérard-Gayot, A.; Membarth, R.; Slusallek, P.; Leißa, R.; Hack, S.; Teich, J.; Hannig, F. AnyHLS: High-level synthesis with partial evaluation. IEEE Trans. Comput.-Aided Des. Integr. Circuits Syst. 2020, 39, 3202–3214. [Google Scholar] [CrossRef]
Piccolboni, L.; Di Guglielmo, G.; Carloni, L.P. KAIROS: Incremental Verification in High-Level Synthesis through Latency-Insensitive Design. In Proceedings of the 2019 Formal Methods in Computer Aided Design (FMCAD), San Jose, CA, USA, 22–25 October 2019; pp. 105–109. Available online: https://ieeexplore.ieee.org/document/8894295 (accessed on 31 March 2025).
Singh, E.; Lonsing, F.; Chattopadhyay, S.; Strange, M.; Wei, P.; Zhang, X.; Zhou, Y.; Chen, D.; Cong, J.; Raina, P.; et al. A-QED Verification of Hardware Accelerators. In Proceedings of the 2020 57th ACM/IEEE Design Automation Conference (DAC), Virtual Event, 20–24 July 2020; pp. 1–6. Available online: https://ieeexplore.ieee.org/document/9218715 (accessed on 31 March 2025).
Chen, J.; Schafer, B.C. Watermarking of Behavioral IPs: A Practical Approach. In Proceedings of the 2021 Design, Automation & Test in Europe Conference & Exhibition (DATE), Grenoble, France, 1–5 February 2021; IEEE: Piscataway, NJ, USA, 2021; pp. 1266–1271. Available online: https://ieeexplore.ieee.org/document/9474071/ (accessed on 31 March 2025).
Badier, H.; Pilato, C.; Le Lann, J.-C.; Coussy, P.; Gogniat, G. Opportunistic IP Birthmarking Using Side Effects of Code Transformations on High-Level Synthesis. In Proceedings of the 2021 Design, Automation & Test in Europe Conference & Exhibition (DATE), Grenoble, France, 1–5 February 2021; pp. 52–55. Available online: https://ieeexplore.ieee.org/document/9474200 (accessed on 31 March 2025).
Mo, L.; Zhou, Q.; Kritikakou, A.; Liu, J. Energy Efficient, Real-Time and Reliable Task Deployment on NoC-Based Multicores with DVFS. In Proceedings of the 2022 Design, Automation & Test in Europe Conference & Exhibition (DATE), Antwerp, Belgium, 14–23 March 2022; pp. 1347–1352. Available online: https://ieeexplore.ieee.org/document/9774667 (accessed on 31 March 2025).
Mehrabi, A.; Manocha, A.; Lee, B.C.; Sorin, D.J. Bayesian Optimization for Efficient Accelerator Synthesis. ACM Trans. Archit. Code Optim. 2021, 18, 1–25. [Google Scholar] [CrossRef]
Zhao, J.; Feng, L.; Sinha, S.; Zhang, W.; Liang, Y.; He, B. COMBA: A Comprehensive Model-Based Analysis Framework for High Level Synthesis of Real Applications. In Proceedings of the 2017 IEEE/ACM International Conference on Computer-Aided Design (ICCAD), Irvine, CA, USA, 13–16 November 2017; pp. 430–437. Available online: https://ieeexplore.ieee.org/document/8203809 (accessed on 31 March 2025).
Wu, N.; Xie, Y.; Hao, C. IronMan-Pro: Multiobjective Design Space Exploration in HLS via Reinforcement Learning and Graph Neural Network-Based Modeling. IEEE Trans. Comput.-Aided Des. Integr. Circuits Syst. 2023, 42, 900–913. [Google Scholar] [CrossRef]
Gautier, Q.; Althoff, A.; Crutchfield, C.L.; Kastner, R. Sherlock: A Multi-Objective Design Space Exploration Framework. ACM Trans. Des. Autom. Electron. Syst. 2022, 27, 1–20. [Google Scholar] [CrossRef]
Zou, Z.; Tang, C.; Gong, L.; Wang, C.; Zhou, X. FlexWalker: An Efficient Multi-Objective Design Space Exploration Framework for HLS Design. In Proceedings of the 2024 34th International Conference on Field-Programmable Logic and Applications (FPL), Gothenburg, Sweden, 2–6 September 2024; pp. 126–132. Available online: https://ieeexplore.ieee.org/document/10705443 (accessed on 31 March 2025).
Rashid, M.I.; Schaefer, B.C. VeriPy: A Python-Powered Framework for Parsing Verilog HDL and High-Level Behavioral Analysis of Hardware. In Proceedings of the 2024 IEEE 17th Dallas Circuits and Systems Conference (DCAS), Dallas, TX, USA, 25–26 April 2024; pp. 1–6. Available online: https://ieeexplore.ieee.org/document/10539889 (accessed on 31 March 2025).
Abi-Karam, S.; Sarkar, R.; Seigler, A.; Lowe, S.; Wei, Z.; Chen, H.; Rao, N.; John, L.; Arora, A.; Hao, C. HLSFactory: A Framework Empowering High-Level Synthesis Datasets for Machine Learning and Beyond. In Proceedings of the 2024 ACM/IEEE 6th Symposium on Machine Learning for CAD (MLCAD), Austin, TX, USA, 18–19 September 2024; pp. 1–9. Available online: https://ieeexplore.ieee.org/document/10740213 (accessed on 31 March 2025).
Ferikoglou, A.; Kakolyris, A.; Masouros, D.; Soudris, D.; Xydis, S. CollectiveHLS: A Collaborative Approach to High-Level Synthesis Design Optimization. ACM Trans. Reconfigurable Technol. Syst. 2024, 18, 11:1–11:32. [Google Scholar] [CrossRef]
Lahti, S.; Sjövall, P.; Vanne, J.; Hämäläinen, T.D. Are we there yet? A study on the state of high-level synthesis. IEEE Trans. Comput.-Aided Des. Integr. Circuits Syst. 2019, 38, 898–911. [Google Scholar] [CrossRef]
Fort, B.; Canis, A.; Choi, J.; Calagar, N.; Lian, R.; Hadjis, S.; Chen, Y.T.; Hall, M.; Syrowik, B.; Czajkowski, T.; et al. Automating the Design of Processor/Accelerator Embedded Systems with LegUp High-Level Synthesis. In Proceedings of the 2014 12th IEEE International Conference on Embedded and Ubiquitous Computing, Milan, Italy, 26–28 August 2014; pp. 120–129. Available online: https://ieeexplore.ieee.org/document/6962276 (accessed on 31 March 2025).
Nane, R.; Sima, V.M.; Pham Quoc, C.; Goncalves, F.; Bertels, K. High-Level Synthesis in the Delft Workbench Hardware/Software Co-Design Tool-Chain. In Proceedings of the 2014 12th IEEE International Conference on Embedded and Ubiquitous Computing, Milan, Italy, 26–28 August 2014; pp. 138–145. Available online: https://ieeexplore.ieee.org/document/6962278 (accessed on 31 March 2025).
AMD-Xilinx. SDSoC Development Environment Design Hub. Available online: https://docs.amd.com/v/u/en-US/dh0057-sdsoc-hub (accessed on 31 March 2025).
AMD-Xilinx. SDAccel Development Environment Design Hub. Available online: https://docs.amd.com/v/u/en-US/dh0058-sdaccel-hub (accessed on 31 March 2025).
AMD-Xilinx. AMD Vitis™ Unified Software Platform. Available online: https://www.amd.com/en/products/software/adaptive-socs-and-fpgas/vitis.html (accessed on 31 March 2025).
AMD-Xilinx. Vitis High-Level Synthesis User Guide (UG1399). Available online: https://docs.amd.com/r/2023.1-English/ug1399-vitis-hls (accessed on 31 March 2025).
AMD-Xilinx. AMD Vitis™ HLS. Available online: https://www.amd.com/en/products/software/adaptive-socs-and-fpgas/vitis/vitis-hls.html (accessed on 31 March 2025).
AMD-Xilinx. Xilinx/Vitis-HLS-Introductory-Examples (GitHub). Available online: https://github.com/Xilinx/Vitis-HLS-Introductory-Examples (accessed on 31 March 2025).
AMD-Xilinx. Xilinx/Vitis_Libraries (GitHub). Available online: https://github.com/Xilinx/Vitis_Libraries (accessed on 31 March 2025).
Vitali, E.; Gadioli, D.; Ferrandi, F.; Palermo, G. Parametric Throughput Oriented Large Integer Multipliers for High Level Synthesis. In Proceedings of the 2021 Design, Automation & Test in Europe Conference & Exhibition (DATE), Grenoble, France, 1–5 February 2021; pp. 38–41. Available online: https://ieeexplore.ieee.org/document/9473908 (accessed on 31 March 2025).
Wu, N.; Yang, H.; Xie, Y.; Li, P.; Hao, C. High-Level Synthesis Performance Prediction Using GNNs: Benchmarking, Modeling, and Advancing. In Proceedings of the 59th ACM/IEEE Design Automation Conference (DAC), San Francisco, CA, USA, 10–14 July 2022; Association for Computing Machinery: New York, NY, USA, 2022; pp. 49–54. [Google Scholar]
Hara, Y.; Tomiyama, H.; Honda, S.; Takada, H. Proposal and quantitative analysis of the chstone benchmark program suite for practical c-based high-level synthesis. J. Inf. Process. 2009, 17, 242–254. [Google Scholar] [CrossRef]
Ndu, G.; Navaridas, J.; Luján, M. CHO: Towards a Benchmark Suite for OpenCL FPGA Accelerators. In Proceedings of the 3rd International Workshop on OpenCL (IWOCL ’15), Oxford, UK, 12–13 May 2015; Association for Computing Machinery: New York, NY, USA, 2015; pp. 1–10. [Google Scholar] [CrossRef]
Reagen, B.; Adolf, R.; Shao, Y.S.; Wei, G.Y.; Brooks, D. MachSuite: Benchmarks for Accelerator Design and Customized Architectures. In Proceedings of the 2014 IEEE International Symposium on Workload Characterization (IISWC), Raleigh, NC, USA, 26–28 October 2014; pp. 110–119. Available online: https://ieeexplore.ieee.org/document/6983050 (accessed on 31 March 2025).
Zhou, Y.; Gupta, U.; Dai, S.; Zhao, R.; Srivastava, N.; Jin, H.; Featherston, J.; Lai, Y.-H.; Liu, G.; Velasquez, G.A.; et al. Rosetta: A Realistic High-Level Synthesis Benchmark Suite for Software Programmable FPGAs. In Proceedings of the 2018 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays (FPGA ’18), Monterey, CA, USA, 25–27 February 2018; Association for Computing Machinery: New York, NY, USA, 2018; pp. 269–278. [Google Scholar]
AMD-Xilinx. Vivado Design Suite: AXI Reference Guide (UG1037). Available online: https://docs.amd.com/api/khub/documents/1N~rJgeEMU28Fyv7kWcW6Q/content?Ft-Calling-App=ft%2Fturnkey-portal&Ft-Calling-App-Version=5.0.70# (accessed on 31 March 2025).
AMD-Xilinx. Generating Basic Software Platforms: Reference Guide (UG1138). Available online: https://docs.amd.com/v/u/en-US/ug1138-generating-basic-software-platforms (accessed on 31 March 2025).
AMD-Xilinx. OS and Libraries Document Collection (UG643). Available online: https://docs.amd.com/r/2020.2-English/oslib_rm (accessed on 31 March 2025).
LeCun, Y.; Cortes, C. The MNIST Database of Handwritten Digits. Available online: http://yann.lecun.com/exdb/mnist/ (accessed on 31 March 2025).
Zhang, X.; Ramachandran, A.; Zhuge, C.; He, D.; Zuo, W.; Cheng, Z.; Rupnow, K.; Chen, D. Machine Learning on FPGAs to Face the IoT Revolution. In Proceedings of the 2017 IEEE/ACM International Conference on Computer-Aided Design (ICCAD), Irvine, CA, USA, 13–16 November 2017; pp. 819–826. Available online: https://ieeexplore.ieee.org/document/8203862 (accessed on 31 March 2025).
Neshatpour, K.; Makrani, H.M.; Sasan, A.; Ghasemzadeh, H.; Rafatirad, S.; Homayoun, H. Design Space Exploration for Hardware Acceleration of Machine Learning Applications in MapReduce. In Proceedings of the 2018 IEEE 26th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM), Boulder, CO, USA, 29 April–2 May 2018; p. 221. Available online: https://ieeexplore.ieee.org/document/8457670 (accessed on 31 March 2025).
Leipnitz, M.T.; Nazar, G.L. High-Level Synthesis of Resource-Oriented Approximate Designs for FPGAs. In Proceedings of the 56th Annual Design Automation Conference (DAC ’19), Las Vegas, NV, USA, 2–6 June 2019; Association for Computing Machinery: New York, NY, USA, 2019; pp. 1–6. [Google Scholar] [CrossRef]
AMD-Xilinx. SmartConnect v1.0 LogiCORE IP Product Guide (PG247). Available online: https://docs.amd.com/r/en-US/pg247-smartconnect/SmartConnect-v1.0-LogiCORE-IP-Product-Guide (accessed on 31 March 2025).
Nayak, R.J.; Chavda, J.B. Proficient Design Space Exploration of ZYNQ SoC Using VIVADO Design Suite: Custom Design of High Performance AXI Interface for High Speed Data Transfer between PL and DDR Memory Using Hardware-Software Co-Design. Int. J. Appl. Eng. Res. 2018, 13, 8991–8997. [Google Scholar]
Sklyarov, V.; Skliarova, I. Exploration of High-Performance Ports in Zynq-7000 Devices with Different Traffic Conditions. In Proceedings of the 2017 5th International Conference on Electrical Engineering—Boumerdes (ICEE-B), Boumerdes, Algeria, 29–31 October 2017; pp. 1–4. Available online: https://ieeexplore.ieee.org/document/8192204 (accessed on 31 March 2025).
AMD-Xilinx. Zynq 7000 SoC Technical Reference Manual (UG585). Available online: https://docs.amd.com/r/en-US/ug585-zynq-7000-SoC-TRM (accessed on 31 March 2025).
AMD-Xilinx. Fast Fourier Transform LogiCORE IP Product Guide (PG109). Available online: https://docs.amd.com/r/en-US/pg109-xfft/Introduction (accessed on 31 March 2025).

Figure 1. Design flow for developing Zynq SoCs with AHA IP-cores. This flow comprises four principal processes (A–D). The conventional manual design flow is depicted on the left, whereas the AHA semi-automated flow, employing Tcl scripts, is presented on the right. This methodology can be applied to Vivado HLS or Vitis HLS with minor adjustments.

Figure 2. Implementation of a C function as an AHA IP core. The inputs and outputs of the IP core (in1, in2, out) correspond to the arguments of the main function fx, and are consolidated into a single AXI4 slave port, which facilitates its interconnection with a host CPU core.

Figure 4. A simplified block diagram illustrating the fundamental heterogeneous SoC architecture used to evaluate the performance of the Zynq-7000 SoCs.

Figure 5. Use of semi-automated design flow: (a) execution of the Tcl script and options in Vivado HLS Tcl interface; (b) illustration of menu options to select IP-core, board (device), and PL clock frequency; and (c) summary of semi-automated flow.

Figure 6. AHA IP core Suite. Basic principle of the algorithm implemented by each IP core.

Figure 7. Experimental configurations for performance evaluation. Experiment 1: Benchmarks implemented as standalone IP cores generated using Vivado HLS (default and optimized). Experiment 2: Benchmarks implemented as functions in the ARM processor or as AXI IP-cores (default and optimized) integrated into a heterogeneous Zynq SoC architecture. Varying dimensions of the IP core illustrates that optimized implementations enhance performance while simultaneously increasing area and resource utilization.

Figure 8. Experiment 1—FPGA resource utilization of the default (DEF) and optimized (OPT) standalone IP cores generated in Vivado HLS. The figure uses a logarithmic scale in which each bar is represented as

\sum {log}_{10} x

, where x represents each type of FPGA resource (i.e., BRAM, DSP48, FF, LUT). * Several techniques were applied to improve the performance of the default implementation of Vivado HLS. However, the results of these new solutions do not vary from the original solution.

Figure 8. Experiment 1—FPGA resource utilization of the default (DEF) and optimized (OPT) standalone IP cores generated in Vivado HLS. The figure uses a logarithmic scale in which each bar is represented as

\sum {log}_{10} x

, where x represents each type of FPGA resource (i.e., BRAM, DSP48, FF, LUT). * Several techniques were applied to improve the performance of the default implementation of Vivado HLS. However, the results of these new solutions do not vary from the original solution.

Figure 9. Experiment 2—Speedup (S) of the Zynq SoC architectures

A_{d}

and

A_{o}

, which integrate the default and optimized IP cores, respectively, in relation to the baseline architecture

A_{b}

. N/A*: multiple optimization techniques were applied to enhance the performance of the default implementation of Vivado HLS. However, the outcomes of these new solutions do not deviate significantly from the original solution. N/A†: these designs require the utilization of more FPGA resources than those available in Zynq Z-7010 devices; consequently, they were not implemented in this architecture.

Figure 9. Experiment 2—Speedup (S) of the Zynq SoC architectures

A_{d}

and

A_{o}

, which integrate the default and optimized IP cores, respectively, in relation to the baseline architecture

A_{b}

. N/A*: multiple optimization techniques were applied to enhance the performance of the default implementation of Vivado HLS. However, the outcomes of these new solutions do not deviate significantly from the original solution. N/A†: these designs require the utilization of more FPGA resources than those available in Zynq Z-7010 devices; consequently, they were not implemented in this architecture.

Figure 10. FPGA floorplanning of the SoC architectures

A_{o - 20}

on a Zynq Z-7020 chip, with the exception of FFT, which corresponds to the architecture

A_{d - 20}

. The unmarked area represents the resources utilized for implementing the algorithm accelerator IP cores.

Figure 10. FPGA floorplanning of the SoC architectures

A_{o - 20}

on a Zynq Z-7020 chip, with the exception of FFT, which corresponds to the architecture

A_{d - 20}

. The unmarked area represents the resources utilized for implementing the algorithm accelerator IP cores.

Figure 11. The recommendation diagram for algorithm selection serves as a tool for identifying which algorithm is suitable for acceleration with IP cores utilizing Vivado HLS. The polygon delineates typical algorithm characteristics, and a candidate algorithm should occupy the colored regions based on its most representative attributes. Highly recommended areas enhance accelerator performance, whereas poorly recommended or non-colored regions should be avoided. The color distribution further indicates that byte operations and rounds are more favorable than float or int operations.

Table 1. Overview of significant high-level synthesis tools and main features.

Tool	Owner	License		Release Year *	Target Architecture			Input				Output
Tool	Owner	A	C	Release Year *	ASIC	FPGA	Vendor-Specific	C	C++	SystemC	Other	Verilog	VHDL	SystemC
Catapult HLS	Mentor Graphics (Wilsonville, OR, USA)		✓	2004	✓	✓		✓	✓	✓		✓	✓
Bluespec	Bluespec, Inc. (Framingham, MA, USA)		✓	2004	✓	✓					BSV	✓		✓
Synphony HLS	Synopsys, Inc. (Sunnyvale, CA, USA)		✓	2009	✓	✓		✓	✓			✓	✓
ROCCC	UC Riverside	✓		2010		✓		✓					✓
LegUp	University of Toronto/Microchip Technology		✓	2011		✓	Microchip (Chandler, AZ, USA)	✓				✓
PandA-Bambu	Polytechnic University of Milan	✓		2012		✓		✓				✓
Vivado HLS	AMD-Xilinx (San Jose, CA, USA)		✓	2012		✓	AMD-Xilinx (San Jose, CA, USA)	✓	✓	✓		✓	✓
HDL Coder	The MathWorks, Inc. (Natick, MA, USA)		✓	2012	✓	✓	Xilinx (San Jose, CA, USA), Altera (San Jose, CA, USA), and Microsemi (Aliso Viejo, CA, USA)				Matlab, Simulink	✓	✓
Stratus HLS	Cadence Design Systems (San Jose, CA, USA)		✓	2014	✓	✓		✓	✓	✓		✓
Intel HLS Compiler	Intel (Altera) Corporation (San Jose, CA, USA)		✓	2017		✓	Intel (Santa Clara, CA, USA)	✓	✓		OpenCL †	✓
Kiwi	University of Cambridge	✓		2017		✓	Xilinx (San Jose, CA, USA), and Altera (San Jose, CA, USA)				C# (.NET bytecode)	✓		✓
Vitis HLS	AMD-Xilinx (San Jose, CA, USA)		✓	2019		✓	AMD-Xilinx (San Jose, CA, USA)	✓	✓	✓	OpenCL ‡	✓	✓

A = academic. C = commercial. * The table lists tools in order of initial release year. † Using Intel FPGA SDK for OpenCL. ‡ Using Vitis Unified Software Platform.

Table 2. Summary of optimization techniques in final AHA IP Cores.

	Directives				Optimization Goals			Results
AHA IP-Core	PIPELINE	UNROLL	Array Opt.	INLINE	Perf. Opt.	Area Opt.	Challenges	Cycle Reduction	Final II ¹	Resources	Outcome
Matrix Multiplication	Yes	Yes	PARTITION (Full)	No	High	Medium	Trade-off between initiation interval and resources	High	1	Mod.	Full loop pipelining and matrix partitioning enabled maximum throughput
FFT	Yes	No	PARTITION (Partial) ²	No	Medium	Low	Unrolling limited by variable loop bounds	Moderate	N/A	Low	Performance constrained by loop structure
AES	Yes	Yes	PARTITION (S-box) ³	Yes	High	Medium	Controlled reuse of sub-functions	High	1	Mod.	Round-level pipelining and S-box parallel access improved throughput
Backpropagation	Limited ⁴	Failed ⁵	RESHAPE	No	Low	Medium	Synthesis failed with full unroll due to complexity	Low	N/A	Low	Simplified structure with partial reshape allowed synthesis
ANN	Yes	Yes	PARTITION (Weights) ⁶	No	High	Medium	Layered structure enabled pipeline across layers	High	1	Mod.	Efficient layer-wise acceleration through pipelining and unroll

¹ Final achieved initiation interval (II) in cycles; N/A = not reached due to synthesis issues. ² Partial partitioning of butterfly-stage data arrays. ³ Partitioning applied to substitution box (S-box) arrays to enable concurrent access. ⁴ Limited pipelining due to internal data dependencies. ⁵ Full unrolling caused synthesis failure in Vivado HLS. ⁶ Partitioning applied to synaptic weight matrices to speed up feedforward propagation.

Table 3. AHA IP core suite. A brief description of each algorithm is presented along with a summary of the main characteristics of the algorithm and its C source code. Hardware characteristics are represented on a four-level scale, where a greater number of black circles corresponds to higher intensity.

AHA IP-Core	Source	Description	Data Type	Lines	Functions	Variables		Statements				Hardware Characteristics of IP-Cores
AHA IP-Core	Source	Description	Data Type	Lines	Functions	Array	Scalar	For	If	Switch	Goto	Resource Utilization	Data Dependencies	Multiple Iterations
Matrix multiplication	Custom coded	Multiplication of matrices of order 2 × 2	2D array of 32-bit integers	23	1	3	3	3				●○○○	●○○○	○○○○
Fast Fourier Transform (FFT)	MachSuite	1024-point FFT by the Cooley-Tukey method	Array of 64-bit doubles	35	1	4	7	2				●●○○	●●●●	○○○○
Advance Encryption Standard (AES)	CHStone	Encryption and de-encryption of a 128-bit vector with a 128-bit key	Array of 32-bit integers	716	11	11	345	24	26	10	37	●●●○	●●○○	●●●○
Back-Propagation Neural Network (BPNN)	MachSuite	Training a two-layer artificial neural network with ten neurons each	Array of 64-bit doubles	314	14	72	19	41	1			●●●●	●●○○	●●●○
Artificial Neural Network (ANN)	Custom coded	Classification of handwritten numbers in images of 20 × 20 pixels	Array of 64-bit doubles	35	1	10	4	4				●●●○	●●○○	●○○○

Table 4. Experiment 2—Summary of configurations of architectures

A_{b}

,

A_{d}

, and

A_{o}

.

Table 4. Experiment 2—Summary of configurations of architectures

A_{b}

,

A_{d}

, and

A_{o}

.

SoC Architecture
Configuration		Development Board		Zynq Device		CPU	Freq (MHz)		AHA IP-Core			OS	App. Code	Compiler
Configuration		Digilent Zybo	Xilinx ZC702 Eval. Kit	Z-7010	Z-7020	ARM Cortex-A9 *	650	667	None	Default	Optimized	BSP	C	GCC GNU 4.4
$A_{b a s e}$	$A_{b - 10}$	✓		✓		✓	✓		✓			✓	✓	✓
$A_{d e f a u l t}$	$A_{d - 10}$	✓		✓		✓	✓			✓		✓	✓	✓
$A_{d e f a u l t}$	$A_{d - 20}$		✓		✓	✓		✓		✓		✓	✓	✓
$A_{o p t i m i z e d}$	$A_{o - 10}$	✓		✓		✓	✓				✓	✓	✓	✓
$A_{o p t i m i z e d}$	$A_{o - 20}$		✓		✓	✓		✓			✓	✓	✓	✓

The

A_{b a s e}

(

A_{b}

) architecture served as the base or reference system for all performance comparisons in Experiment 2. Architecture

A_{d e f a u l t}

(

A_{d}

) implements the default IP cores generated by Vivado HLS, whereas architecture

A_{o p t i m i z e d}

(

A_{o}

) implements IP cores optimized for execution performance. Each architecture was implemented on two Zynq chips, Z-7010 and Z-7020. * Only one of the two available CPU cores was utilized, and the cache memory was disabled for all designs.

Table 5. Experiment 1—Performance evaluation of AHA IP cores generated in Vivado HLS.

IP Cores
Benchmark	Default (DEF)		Optimized (OPT)		Speedup
Benchmark	Cycles †	Freq	Cycles †	Freq	Speedup
Matrix Mult.	85	125	7	125	12.14×
FFT	122,901	100	N/A *	N/A *	N/A *
AES	4118	100	928	100	4.44×
BPNN	464,387,001	100	82,890,001	100	5.60×
ANN	92,193	100	32,298	100	2.85×

Freq = frequency in MHz. * Multiple techniques were implemented to enhance the performance of the default (DEF) implementation of Vivado HLS. However, the newly optimized (OPT) implementation did not yield substantial improvements. † Latency cycles.

Table 6. Experiment 1—Summary of FPGA resources used by AHA IP cores generated in Vivado HLS.

AHA IP-Cores
Benchmark	Default (DEF)				Optimized (OPT)
Benchmark	BRAM	DSP48	FF	LUT	BRAM	DSP48	FF	LUT
Matrix Mult.	6	4	434	350	0	32	1012	936
FFT	16	56	4544	7783	N/A *	N/A *	N/A *	N/A *
AES	16	8	4358	9128	32	0	35,165	53,379
BPNN	19	148	28,561	40,262	26	168	61,546	53,345
ANN	19	34	7083	10,296	18	132	16,410	23,705

BRAM = block RAMs of 18 Kb. The resources shown correspond to the default and optimized implementations in Vivado HLS. These are the general resources obtained from the synthesis reports of Vivado HLS for Zynq Z-7020 (xc7z020clg484-1) and Zynq Z-7010 (xc7z010clg400-1) devices. * Several techniques were applied to improve the performance of the default (DEF) implementation of Vivado HLS. However, the newly optimized (OPT) implementations did not produce significant improvements.

Table 7. Experiment 2—Performance evaluation based on execution time and speed-up factors of different Zynq SoC architectures that integrate the benchmarking AHA IP cores with the processor system (ARM processors) using AXI interfaces.

Zynq SoC Architectures, with AHA IP Cores
Benchmark	$A_{b - 10}$	$A_{d - 10}$		$A_{d - 20}$		$A_{o - 10}$		$A_{o - 20}$
Benchmark	t (us)	t (us)	S	t (us)	S	t (us)	S	t (us)	S
Matrix Mult.	1.52 × 10¹	2.80 × 10⁰	5.43×	2.63 × 10⁰	5.78×	1.97 × 10⁰	7.72×	1.95 × 10⁰	7.79×
FFT	2.04 × 10⁴	1.23 × 10³	16.57×	1.23 × 10³	16.57×	N/A *	N/A *	N/A *	N/A *
AES	2.14 × 10³	4.46 × 10¹	47.98×	4.35 × 10¹	49.18×	N/A †	N/A †	1.26 × 10¹	169.84×
BPNN	2.00 × 10⁷	N/A †	N/A †	4.64 × 10⁶	4.31×	N/A †	N/A †	8.29 × 10⁵	24.13×
ANN	8.45 × 10³	9.24 × 10²	9.14×	9.24 × 10²	9.14×	N/A †	N/A †	3.26 × 10²	25.92×

t = time; S = speedup. * Multiple techniques were implemented to enhance the performance of the default implementation of Vivado HLS. Nevertheless, the outcomes of these new solutions did not demonstrate significant variation from those of the original solution. † These designs require the use of more FPGA resources than those available in Digilent Zybo (Zynq Z-7010). They were not implemented in this architecture

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Berrazueta-Mena, D.; Navas, B. AHA: Design and Evaluation of Compute-Intensive Hardware Accelerators for AMD-Xilinx Zynq SoCs Using HLS IP Flow. Computers 2025, 14, 189. https://doi.org/10.3390/computers14050189

AMA Style

Berrazueta-Mena D, Navas B. AHA: Design and Evaluation of Compute-Intensive Hardware Accelerators for AMD-Xilinx Zynq SoCs Using HLS IP Flow. Computers. 2025; 14(5):189. https://doi.org/10.3390/computers14050189

Chicago/Turabian Style

Berrazueta-Mena, David, and Byron Navas. 2025. "AHA: Design and Evaluation of Compute-Intensive Hardware Accelerators for AMD-Xilinx Zynq SoCs Using HLS IP Flow" Computers 14, no. 5: 189. https://doi.org/10.3390/computers14050189

APA Style

Berrazueta-Mena, D., & Navas, B. (2025). AHA: Design and Evaluation of Compute-Intensive Hardware Accelerators for AMD-Xilinx Zynq SoCs Using HLS IP Flow. Computers, 14(5), 189. https://doi.org/10.3390/computers14050189

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

AHA: Design and Evaluation of Compute-Intensive Hardware Accelerators for AMD-Xilinx Zynq SoCs Using HLS IP Flow

Abstract

1. Introduction

2. HLS Tools and Related Work

2.1. Relevant High-Level Synthesis Tools

2.2. Studies That Evaluate or Develop HLS Tools

2.3. AMD-Xilinx HLS Tools

2.3.1. Vivado HLS

2.3.2. Vitis Unified Software Platform (USP)

2.3.3. Vitis HLS and Development Flows

2.3.4. Vitis Libraries

2.3.5. Vivado HLS Examples

2.4. Benchmarking IP Cores for Zynq-Based SoCs

3. AHA Design Methodology

3.1. Process A: Algorithm Development in C/C++

3.2. Process B: Generation of AHA IP Cores

3.2.1. Importing the C Specification

3.2.2. C Simulation

3.2.3. C Synthesis

3.2.4. C/RTL Co-Simulation

3.2.5. Evaluation and Optimization

3.2.6. Exporting the IP-Core

3.3. Process C: Integration of AHA IP-Cores in SoCs

3.4. Process D: Embedded Software Development

3.4.1. Hardware Platform

3.4.2. Board Support Package (BSP)

3.4.3. Application

4. Rapid Generation of AHA IP-Cores + SoC

4.1. Advantages

4.2. Semi-Automated Design Flow

4.2.1. Creation of IP-Cores (Script_IP.tcl)

4.2.2. SoC Creation (Script_SoC.tcl)

4.3. Requirements of the C Specifications for Creating New IP-Cores and SoCs, Using Tcl Scripts

4.4. Limitations of Semi-Automated Design Using Tcl Scripts

5. AHA IP-Core Suite

5.1. Matrix Multiplication

5.2. FFT (Fast Fourier Transform)

5.3. AES (Advanced Encryption Standard)

5.4. BPNN (Back-Propagation Neural Network)

5.5. ANN (Artificial Neural Network)

6. Performance Evaluation

6.1. Experiments Setup

6.1.1. Experiment 1

6.1.2. Experiment 2

6.2. Configuration of SoC Architectures for Experiment 2

6.2.1. Ab (ARM Cortex-A9)

6.2.2. Ad and Ao (ARM Cortex-A9 + AHA IP-Core)

6.3. Speedup Factor

7. Results

7.1. Experiment 1

7.2. Experiment 2

8. Discussion

9. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI