Extending a Moldable Computer Architecture to Accelerate DL Inference on FPGA

Mirko Mariotti; Giulio Bianchini; Igor Neri; Daniele Spiga; Diego Ciangottini; Loriano Storchi

doi:10.3390/electronics14173518

,

and

¹

Dipartimento di Fisica e Geologia, Università Degli Studi di Perugia, Via Pascoli, 06123 Perugia, Italy

²

INFN Sezione di Perugia, Via Pascoli, 06123 Perugia, Italy

³

Dipartimento di Farmacia, Università Degli Studi G. D’Annunzio, Via dei Vestini, 66100 Chieti, Italy

^*

Authors to whom correspondence should be addressed.

Electronics2025, 14(17), 3518;https://doi.org/10.3390/electronics14173518

This article belongs to the Special Issue Advancements in Hardware-Efficient Machine Learning

Version Notes

Order Reprints

Abstract

Over Over the past years, the field of Machine Learning (ML) and Deep Learning (DL) has seen strong developments both in terms of software and hardware, with the increase of specialized devices. One of the biggest challenges in this field is the inference phase, where the trained model makes predictions of unseen data. Although computationally powerful, traditional computing architectures face limitations in efficiently managing requests, especially from an energy point of view. For this reason, the need arose to find alternative hardware solutions, and among these, there are Field Programmable Gate Arrays (FPGAs): their key feature of being reconfigurable, combined with parallel processing capability, low latency and low power consumption, makes those devices uniquely suited to accelerating inference tasks. In this paper, we present a novel approach to accelerate the inference phase of a multi-layer perceptron (MLP) using BondMachine framework, an OpenSource framework for the design of hardware accelerators for FPGAs. Analysis of the latency, energy consumption, and resource usage, as well as comparisons with respect to standard architectures and other FPGA approaches, is presented, highlighting the strengths and critical points of the proposed solution. The present work represents an exploratory study to validate the proposed methodology on MLP architectures, establishing a crucial foundation for future work on scalability and the acceleration of more complex neural network models.

Keywords:

BondMachine; reconfigurable computing; low latency DL inference on FPGA; low-power DL inference on FPGA; hardware-software co-design; DL accelerator

1. Introduction

Machine learning (ML) algorithms [1], particularly Deep Neural Networks (DNNs), have witnessed remarkable achievements across diverse domains, ranging from computer vision and natural language processing to autonomous systems and recommendation systems [2]. Today, ML techniques, especially DNNs, are widely used in many scientific fields [3,4,5,6]. However, these advances often come at the cost of increased computational complexity, memory requirements, and power consumption. Moreover, the demand for ML in edge applications, where computational resources are typically limited, has further increased these requirements [7]. To address these challenges, researchers have been exploring techniques to optimize and streamline ML models. The main target of these techniques is the choice of data representation. On one hand, it has to be chosen considering the available hardware optimization on the edge device. On the other hand, it has to be sufficiently precise not to compromise the accuracy of the model and as small as possible to reduce memory and computation requirements [8]. Indeed, it is a common technique to decrease the numerical precision of computations within ML tasks with respect to the standard 32-bit floating-point representation. By reducing the number of bits used to represent numerical values, it is possible to achieve more efficient storage, faster computation, and potentially lower energy consumption [9,10,11]. However, this approach is not without its drawbacks, and a careful analysis of the pros and cons is crucial to understand the implications of decreased numerical precision.

All the above-mentioned aspects of hardware optimization can be effectively addressed using FPGA devices. In recent years, many studies have explored the use of FPGAs not only for accelerating specific computational workloads but also as central components in heterogeneous computing systems. These works highlight how FPGAs enable fine-grained control over resource allocation, energy efficiency, and deterministic latency, making them well-suited for edge AI, data center inference, and high-throughput scientific computing. Recent advances include frameworks for dynamic partial reconfiguration, integration with Central Processing Unit (CPU)—Graphics Processing Unit (GPU) architectures, and high-level programming environments to simplify deployment across heterogeneous platforms [12,13,14,15].

In the present work, we aim to use the BondMachine (BM) framework [16,17] (BondMachine Project, [https://www.bondmachine.it], accessed on 28 August 2025) to implement ML models on FPGA devices [18,19,20,21,22,23] and to analyze the effects of reduced numerical precision on the performance of the models, on the accuracy of the results, on the resource utilization, and on the power consumption.

This study is part of a larger research project aimed at developing a new generation of heterogeneous computing systems for different applications, including edge AI. Our approach is to create a multi-core and heterogeneous ‘register machine’ abstraction that can be used as an intermediate layer between the software applications and the FPGA hardware. From the software point of view, the BM offers a common interface that hides the specific FPGA hardware details, allowing a vendor-independent development of the applications. From the FPGA point of view, the BM inherits the flexibility and parallelism of the FPGA devices. A similar approach has been used in software development with the introduction of the Low Level Virtual Machine (LLVM) compiler infrastructure [24] (LLVM Project, [https://llvm.org/], accessed on 28 August 2025). The LLVM VM is used as an intermediate layer between the high-level language and the architecture-specific machine code. Most optimizations are performed at the LLVM level, and the final machine code is generated by the LLVM compiler. The BM is designed to be used in the same way, but for FPGA devices. Most optimizations are performed at the BM level, and the final FPGA configuration is generated by the BM tools.

Our focus is to provide a comprehensive evaluation of the benefits and limitations associated with this approach, shedding light on its impact on model performance, resource utilization, and overall efficiency. The increasing demand for edge AI applications, combined with the need for efficient and low-power computing systems, makes this research particularly relevant and timely, as it provides a new perspective on the design and implementation of AI accelerators for edge computing.

It is important to underline that while ongoing efforts are dedicated to expanding our framework’s compatibility with a broader spectrum of Artificial Neural Networks (ANN) architectures, the present work concentrates on multi-layer perceptrons (MLPs). MLPs retain widespread application across various domains, including High Energy Physics (HEP), recommendation systems, and tabular data modeling, and serve as integral components within more extensive architectures (e.g., transformers). Consequently, we contend that our current evaluation, grounded in MLPs, furnishes a significant and pertinent demonstration of the BM capabilities, simultaneously establishing a foundation for expanded support in subsequent developments.

The present paper is organized as follows. In Section 2, we will first recall the basic components of the BM architecture, together with the basic tools of the BM software framework. Then, we describe the improvements we made to the BM ecosystem to allow the use of DNN models. In the subsequent section that is Section 3, we will describe how a BM can be used to implement a DNN model on an FPGA. Specifically, we will give here all the details related to the mapping of a DNN on a BM, using a simple test model. In this section, we will report all the tests performed to deeply understand the impact of reduced precision algebra both in terms of computational deficiency and in terms of the numerical precision. Finally, in Section 4, we present the results obtained using the Large Hadron Collider (LHC) jet-tagging datasets on an Xilinx/AMD Alveo FPGA (Advanced Micro Devices, Inc., Santa Clara, CA, USA) [25]. We then draw some general conclusions based on the observed performance metrics.

2. Overview of the BondMachine Architecture

The BondMachine (BM) project’s fundamental goal is to integrate the creation of domain-specific hardware with the development of its corresponding software stack. This is achieved by utilizing FPGA technology, which enables the implementation of independent processing units on a single low-power board. Furthermore, it allows for the design of their interconnections “in silicon” to ensure that the hardware is maximally suited to the design’s specific needs. In addition, as we will show, the BM’s architecture fits particularly well with the computational structures of the DNNs and tensor processing models. In the following, we will firstly recall (Section 2.1 and Section 2.2) all the basic logical components of the BM architecture as well as the basic tools of the BM software framework already detailed in our previous work [16]. We will instead describe all the improvements we made to the BM ecosystem to allow the implementation of DNN models in Section 3.

2.1. Architecture Specification

As already stated, the main goal of the BM is to start from a high-level specification of a generic computational task and to produce both the hardware, specified as an Hardware Description Language (HDL) to be used for the FPGA, and the software that will run on the hardware. The basic logical elements of each BM architecture are the Connecting Processor (CP) and the Shared Objects (SOs).

The CPs are the computational cores of the BM. Each CP implements all the needed instructions to solve the given computational task that has been assigned. Thus, it should be clear as that the heterogeneity of the BM is pushed down to the level of the computational core. Indeed, each CP can be highly specialized to execute only those operations strictly needed that is, a CP is as simple as possible, and it is specialized to perform a task and eventually specially optimized for it.

Clearly, to solve complex tasks, many CPs must collaborate and share information and resources. This is exactly where the SOs start to play a fundamental role in increasing the processing capability and functionality of the BM. Currently, the project provides several types of Shared Objects (SOs) to manage communication and resource sharing between processors. These available SOs include: Channels, Shared Memories, Barriers, a pseudo-random number generator, Shared queues, and stacks. A comprehensive description of each of these components can be found in our previous work [26].

2.2. Architecture Handling

A dedicated software toolchain was developed to manage the creation and configuration of a BondMachine (BM) architecture. This toolchain is structured with distinct levels of abstraction, where at the base level, the procbuilder tool controls the creation and management of each individual Connecting Processor (CP). Alongside it, the bondmachine builder is used to configure the interconnections between all CPs and Shared Objects (SOs) that make up the final BM. Layered on top of this framework is the bondgo arch compiler. This powerful tool takes source code written in the Go language [27] and performs a co-design process, generating both the specific, optimal BM hardware configuration and the application software that will run on it. This allows a user to produce a complete, tailored solution—both the machine and the application—from Go source code.

3. Strategy and Ecosystem Improvements to Accelerate DL Inference Tasks

Our strategy to accelerate DL inference tasks within the BM ecosystem is based on a dynamic and adaptable architecture design. Each CP executes an Instruction Set Architecture (ISA) that combines statically defined operations with dynamic instructions generated at run time, such as those for Floating-Point (FloPoCo) opcodes and Linear Quantization. This approach, along with a fragment-based assembly methodology via the BM Assembler (BASM), enables the creation of specialized computing units tailored to diverse neural network-related operations.

Additionally, the BM ecosystem offers a unified handling of heterogeneous numerical representations through the bmnumbers package and provides tools such as Neuralbond, TensorFlow Translator, and Neural Network Exchange Format (NNEF) composer to translate DNN models into BM architectures. Support for multiple FPGA platforms (e.g., Digilent Zedboard, Xilinx/AMD Alveo) and a high-level Python library (pyBondMachine [28], version 1.1.70, https://pypi.org/project/pybondmachine/, (accessed on 28 August 2025)) further simplify interaction, prototyping, and deployment of FPGA-based accelerators.

3.1. A Flexible ISA: Combining Static and Dynamic Instructions

Each CP of a BM is able to execute some instructions, forming a specific ISA. The ISAs, one for each CP, are composed by choosing from a set of available instructions. For example, to perform a sum between two numbers, the CP must be able to execute the add instruction. Some instructions act on specific data types; for example, the add instruction can be used to sum two integers, addf to sum two floating-point numbers, and so on.

Apart from the basic instructions, which are statically defined within the project, each CP also executes dynamic instructions. A Dynamic Instruction is an opcode that is not statically defined within the project, but is dynamically generated at runtime. It can be, for example, HDL code generated by an external tool or an instruction that changes according to the input. Examples of dynamic instructions are Floating-Point Cores (FloPoCo) (i.e., a generator of Floating-Point, but not only, cores for FPGAs) [29] opcodes that are generated by the FloPoCo tool and Linear Quantization [30] opcodes.

3.2. Extending Numerical Representations in the BM Ecosystem: FloPoCo, Linear Quantization, and the Bmnumbers Package

FloPoCo [29,31] is an open-source software tool primarily used to automatically generate efficient hardware implementations of floating-point operators for FPGAs. FloPoCo has been integrated within the BM ecosystem to allow the use of optimized floating-point operators. We aim to provide a comprehensive evaluation of the benefits and limitations associated with this approach, shedding light on its impact on model performance, resource utilization, and overall efficiency. One of the main features of this framework is the representation of floating-point numbers using an arbitrary precision, i.e., the number of bits. In fact, it is possible to arbitrarily choose the number of bits assigned to both the exponent and the mantissa. This leads to great freedom in selecting numerical precision, allowing us to perform various tests by varying the number of bits used to represent a floating-point number.

Linear Quantization is a technique that allows one to represent a floating-point number using a fixed number of bits and integer arithmetic [30]. The BM ecosystem has been extended to allow the use of Linear Quantization operators. The main advantage of this approach is the possibility of using integer arithmetic, which is much more efficient than floating-point arithmetic on FPGA devices. In order to achieve this, special dynamic instructions were developed for this type of operator.

Finally, given the heterogeneity of the BM and the different numerical representations that can be used, it is necessary to have a common way to represent numerical values in both the software layer and the hardware. The bmnumbers package has been developed exactly to achieve this goal. This package is a Go library that manages all the numerical aspects of the BM. In particular, it is able to parse a numerical representation (i.e., a string) and to produce the corresponding binary representation (i.e., a slice of bits). This is achieved using a prefix notation, where the first characters of the string are used to identify the numerical representation. For example, the string “0f0.5” is parsed as a floating point number, or “0lq0.5<8,1>” is parsed as a linear quantized number of 8 bits with ranges given by type 1 (the type is stored as an external data structure within bmnumbers). Table 1 reports the available numerical representations and the corresponding prefix.

Table 1. Numerical representations and corresponding prefixes used within the bmnumbers package.

It is important to underline as bmnumbers is also responsible for the conversion, casting, and export of numerical representations throughout the BM ecosystem.

3.3. BASM and Fragments

The BM ecosystem has been extended to allow the use of a new tool called BASM (BondMachine Assembler). BASM uses a low-level language that allows one to write BM code directly in assembly. Similar to the BondGo compiler, BASM is able to generate both the BM architecture and the application that will run on it.

A central concept of BASM is the fragment. A fragment is a small piece of assembly code that can be combined in different ways and thought of as a function with its input and output parameters. An example of a fragment is shown in Listing 1.

Listing 1. Fragment template for a linear neuron.

Fragments can be linked together to form a computational graph, where each fragment represents a node in the graph. Fragments can also be parameterized, allowing for the creation of generic code that can be instantiated with different parameters, like in the example above, where the weight is passed as a parameter. Also, the instructions can be parameterized, like the multop instruction in the example above. This is useful to allow the use of different numerical representations, such as floating point or fixed point.

Connecting Processors (CPs) can also be interconnected to form a computational graph, where each CP functions as a distinct processor within the BM architecture. In the most straightforward implementation, the relationship between fragment graphs and CPs is a direct one-to-one mapping, with each fragment being assigned to a single, dedicated CP. However, the BM architecture is flexible enough to allow for more complex mappings. By acting on some metadata, it is possible to change how the fragments are mapped to the CPs and how the CPs are connected to each other, allowing for the creation of different BM architectures starting from the same fragments and thus computing the same task with differences in resource usage, latency, and power consumption. For example, a group of fragments can be mapped to a single CP, or each fragment can be mapped to a different CP. In the first case, the fragments will be executed sequentially; in the latter, they will be executed in parallel on different CPs. The resource usage of the first case will be lower, but the latency will be higher, while in the second case, the resource usage will be higher, but the latency will be lower.

3.4. Mapping a DNN as a Heterogeneous Set of CPs

A BM can be used to solve any computational task, but in fact, the interconnected heterogeneous processors model, on which any BM is based, seems to be ideal for mapping a DNN [32]. Thus, we developed three different tools to be used to map a DNN model to a proper BM architecture: Neuralbond, Tensorflow Translator, and NNEF Composer [16]. In the present work, we used the Neuralbond tool, which, starting from the NN architecture, builds a BM composed of several CPs acting as one or more neuron-like computing units. The Neuralbond approach is fragment-based; thus, each generated CP is the composition of one or more BASM fragments taken from a library we created, and each fragment contains the code to perform the computation of a single neuron.

An example of the softmax neuron written in Basm is shown in Listing 2.

Listing 2. Fragment template for a softmax neuron.

BASM takes this fragment and generates a complete computational unit capable of performing the specified operation. Neuralbond then iterates over the neurons and weights of the DNN, leveraging BASM to construct the final BM architecture that implements the entire DNN. This methodology enables the customization of neurons according to the specific requirements of the use case. As an example, consider the fragment mentioned previously that implements the softmax function. This fragment was developed following the corresponding mathematical formulation.

σ {(\vec{z})}_{i} = \frac{e^{z_{i}}}{\sum_{j = 1}^{L} e^{z_{j}}}

(1)

where L in the equation is the number of elements on which the softmax is applied.

e^{x} = \sum_{l = 0}^{K} \frac{x^{l}}{l!}

(2)

where K in the Formula (2) is the number of iterations of the Taylor series expansion of the exponential function. We called expprec this number. As expprec increases, the series includes more terms, making the approximation of

e^{x}

more accurate. Each term in the series brings the approximation closer to the true value of

e^{x}

. However, calculating more terms also increases the computational cost. For example, if

K = 1

, the Taylor expansion is truncated after the first two terms:

e^{x} \approx \sum_{l = 0}^{1} \frac{x^{l}}{l!} = 1 + x

(3)

than

e^{z_{i}}

is poorly approximated and the Softmax output will also be inaccurate since it is sensitive to the exponential value (i.e., it determines how the probabilities are distributed among the classes). The trade-off between accuracy (percentage difference between predictions from the standard architecture and our FPGA implementation; 100% means identical predictions, which is the desired outcome) and computational cost is an important consideration when selecting the value of K, as it affects the performance of the model. A detailed analysis of the impact of the expprec parameter has been conducted using a simplified test model (such as that reported in Figure 1) and a related dataset, and all results of this investigation are reported in Supplementary Materials. Based on these analyses, it was observed that the softmax function is highly sensitive to the number of terms used in the Taylor series expansion of the exponential function, particularly under conditions of reduced numerical precision. For example, analyzing this simplified model, when using the standard 32-bit IEEE 754 [33] floating-point representation, setting the expprec parameter to 1 is generally sufficient to preserve model accuracy. In contrast, when 16-bit precision FloPoCo operators are used, a higher value of expprec, typically between 2 and 10, is required to achieve comparable accuracy. Deviating from this optimal range, either by decreasing or excessively increasing the value of expprec, leads to a significant degradation in model performance. This finding underscores the importance of balancing computational efficiency and numerical accuracy, especially in resource-constrained environments or hardware-limited implementations. In the benchmarks described in Section 4, we reduced this exponent to its minimum value, trading off accuracy for improved performance using the Xilinx/AMD Alveo FPGA platform [25].

Figure 1. A DNN with one hidden layer, four inputs, and two outputs.

To better understand how the DNN mapping on a BM is performed, consider a multi-layer perceptron (MLP) designed to solve a classification problem. This network has four input features, a single hidden layer with one neuron using a linear activation function, and an output layer with two neurons employing the softmax activation function, as shown in Figure 1.

The resulting BM architecture, reported in Figure 2, is made up of four inputs (that is, the number of features) and two outputs (i.e., the classes). It consists of thirty-three connecting processors, including four CPs for reading the incoming inputs, twelve CPs corresponding to the “weights” and performing the multiplication between the input and its associated weight, one CP that executes the linear function, two CPs performing the softmax functions, and two CPs to write the final computed output values. The total number of registers in the CPs is 80, while the total number of bonds is 30.

Figure 2. The BM architecture is a computational graph of the mapped DNN with four inputs, one hidden node (node_1_0) applying a linear transformation, and two outputs computed via softmax (node_3_0, node_3_1). Weight blocks (weightf) represent learned connections between nodes, and the figure shows each individual weight multiplication and signal transfer in detail.

To give more details to the reader, Figure 3 below illustrates the mapping of a specific computation node from the neural network graph to its execution on a processor, showing the corresponding input/output registers and the generated instruction sequence.

Figure 3. Mapping of a softmax neuron from the neural network computation graph to its execution on a processor. The left part shows the neuron (node_3_0) within the expanded graph, while the right part presents its corresponding implementation on the processor, including input/output registers and the sequence of generated low-level instructions. This provides a detailed view of how high-level neural computations are executed in hardware.

Theoretically, by implementing every possible type of neuron within the library, any DNN can be mapped to a BM architecture and synthesized on an FPGA.

3.5. BM as Accelerator

In addition to the core modifications made to the BM framework, further enhancements have extended the automation mechanisms to support heterogeneous hardware platforms, including the Digilent Zedboard (Soc: Xilinx Zynq-7000 [34] (Xilinx, San Jose, CA, USA)) and Xilinx/AMD Alveo [25] boards. The latter are well known for their high resource availability and performance in accelerated computing. In particular, they are increasingly being adopted for the deployment of advanced DL systems [35].

To deploy the architecture created as an accelerator for a host running a standard Linux operating system, the BM ecosystem leverages the Advanced eXtensible Interface (AXI) protocol [36] for communication. AXI is a widely adopted standard for interfacing with FPGA devices and facilitates the integration of all available tools for FPGA accelerator development.

In particular, both the AXI Memory Mapped (AXI-MM) protocol and the AXI Stream protocol have been successfully implemented and can be used to interact with the BM architecture running on the FPGA, depending on the specific requirements and the use case. For example, AXI-MM can be preferred when memory access patterns dominate the application, while AXI Stream is more suitable for continuous data-flow processing.

Furthermore, a high-level Python library, pybondmachine [28], has been developed to facilitate interaction with the BM framework, as well as with FPGA devices programmed using the BM toolkit, since it provides a high-level Application Programming Interface (API) for configuring kernels and exchanging data without directly invoking Xilinx Runtime (XRT) [37] calls. Indeed, when deployed on Xilinx Alveo boards, the BM framework uses the XRT to instantiate kernels, manage execution, and transfer data over PCIe. This is conceptually similar to the Vitis Runtime Library model, where kernels are launched from a host application and the runtime handles synchronization and memory access.

3.6. Real-Time Power Measurement and Energy Profiling

As noted in the Introduction, power consumption is nowadays a critical factor when designing a DNN accelerator on an FPGA. Even if energy consumption in DL involves both training and inference phases, in production environments, inference dominates running costs, accounting for up to 90% of total compute cost [10,11]. Deploying specialized, energy-efficient hardware, such as FPGAs, therefore, represents an effective strategy for reducing this overhead.

The tools supplied by Xilinx provide the ability to analyze the energy usage of a given design by producing a detailed report with estimated power figures. Within the BM framework, we have implemented the necessary automation to extract these data and generate a concise summary report. However, while Vivado’s summary reports are useful for obtaining a general overview of the power profile of a design, direct real-time measurement of energy consumption is essential to evaluate its true impact.

To this end, we implemented a real-time power-measurement setup on the ZedBoard powered by a 12 V stabilized supply and monitored via a high-resolution digital multimeter in order to separate static (leakage) from dynamic (switching) power of the inference (Intellectual Property) IP alone. Dynamic power is modeled as

P_{dyn} = \frac{1}{2} \sum_{n} C_{n} α_{n} f_{clk} V^{2}

Analysis across FloPoCo floating-point precisions of 12, 16, 19, and 32 bits reveals a marked increase in both static and dynamic power with bit-width, yielding energy per inference from 0.8 ×

10^{- 7}

J (12 bits) to 3.8 ×

10^{- 7}

J (32 bits).

In addition, a comparative study on CPU platforms has been conducted, and the results are summarized in Table 2.

Table 2. Comparison of the energy consumption of the ZedBoard BM with the ARM Cortex A9 and the Intel i7-1260P CPUs. The values are expressed in terms of their order of magnitude to highlight the relative differences in performance and energy efficiency across the systems. The System column indicates the system used for the inference, while the Time/Inf (s) and En./Inf (J) columns show the time taken for a single inference (calculated from clock cycles, counted using the benchmark for the FPGA) and the energy consumption per inference (measured using the Perf tool for the CPUs). All measurements were averaged over 166,000 inference runs, corresponding to the size of the test dataset.

The study on CPU platforms demonstrates that FPGAs deliver inference times on the order of

10^{- 6}

s, comparable to the Intel (Intel Corporation, Santa Clara, CA, USA) CPU but with up to three orders of magnitude lower energy consumption; relative to the ARM processor, FPGA inference is slower (

10^{- 2}

s) yet still substantially more energy-efficient. Additional detailed results of the power consumption and of the setup used are reported in Supplementary Materials.

4. Benchmarking the DL Inference FPGA-Based System for Jet Classification in LHC Experiments

DNNs are widely used in the field of high-energy physics [38]; for instance, they are used for the classification of jets produced in the collisions of the Large Hadron Collider (LHC) at CERN [39]. However, in this context, FPGAs are primarily used in the trigger system of the LHC experiments to make a real-time selection of the most interesting events to be stored. Thus, the use of DNN in FPGA is a promising solution to improve the performance of the trigger system and to find novel ways to reduce both the latency of the trigger system and energy consumption, which are crucial factors in this scenario.

The complexity of the DNN involved is balanced to meet the rigid requirements of FPGA-based trigger systems at LHC. Specifically, the Level-1 Trigger system must process data at a rate of 40 MHz and make decisions within a latency of approximately 12.5 microseconds, necessitating the use of DNNs that are optimized for low latency, and efficient resource utilization [40]. Methods such as compressing DNNs by reducing numerical precision have been used to reduce processing times and resource utilization while preserving model accuracy [41].

To evaluate the effectiveness of the proposed solution, benchmarks were performed using the LHC jet tagging dataset [42], a well-known resource in the field of high-energy physics, which is used to classify jets produced in collisions at the LHC. All benchmarks were executed with a batch size of 1024, and data transfers between the host and the FPGA were performed using the AXI Stream protocol with a 32-bit data bus width. The dataset is composed of 830,000 jets, each with 16 features, and the classification task is to identify if a jet is a gluon, a quark, a W boson, a Z boson, or a top quark.

The selected DNN architecture consists of sixteen input neurons, three hidden layers with sixty-four, thirty-two, and thirty-two neurons, respectively, employing the Linear activation function, and five output neurons using the Softmax activation function. To test the proposed model, the Alveo U55C board has been chosen, which offers high throughput and low latency for DL inference tasks. Since the Alveo U55C is produced by AMD/Xilinx, the tool used behind the scenes to generate the firmware is Vivado [43] (version 2023.2, although any version later than 2019.2 is compatible). However, it is important to note that the BM framework autonomously generates the HDL code, making it vendor-independent and portable to other FPGA boards. For interacting with the FPGA programmed with the firmware generated by our system to perform inference, we use the Python library pybondmachine, which leverages PYNQ [44] under the hood to simplify the development and control of hardware-accelerated applications within a Python-based environment.

The benchmarking process involved systematically varying the numerical precision of the model, made possible through the integration of the FloPoCo library. The evaluation focused on three critical metrics: resource utilization, measured in terms of Look-Up Tables (LUTs), Registers (REGs), and Digital Signal Processors (DSPs), particularly important in this context, as also emphasized in [45], where the authors propose a methodology to estimate resource usage during the early stages of FPGA design; latency, defined as the time required to perform a single inference, including host-to-FPGA PCIe transfer time; and precision, quantified as the deviation of the predictions of the FPGA-implemented model from the ground truth. Adjusting numerical precision is a commonly adopted technique for reducing a model’s energy consumption. In this context, a real-world power consumption analysis was performed to assess the effect of FloPoCo numerical precision on the model’s power usage, as mentioned in Section 3.6. The detailed results of the power consumption [10,11,46,47,48,49,50,51] analysis are presented in Supplementary Materials.

The benchmark results, summarized in Table 3, reveal critical points in the performance and resource utilization of various numerical precision formats when implemented on the Alveo U55C FPGA. All latencies were obtained over 166,000 inference runs, corresponding to the number of test samples. Concerning resource utilization, the float32 format, which adhered to the IEEE754 standard, exhibits the highest utilization of LUTs at 476,416, accounting for 36.54% of the FPGA’s available resources. As the precision is reduced to lower bit widths (e.g., float16, flpe5f11, flpe6f10), LUT usage decreases, with the flpe6f4 format requiring only 274,523 LUTs, which corresponds to 21.06% of the available resources. This suggests that lower-precision formats tend to be more efficient in terms of LUT utilization.

Table 3. Resource utilization, including Look-Up Tables (LUTs), Registers (REGs), and Digital Signal Processors (DSPs), along with latency and accuracy for different data types implemented on the Alveo U55C FPGA. The float32 and float16 formats adhere to the IEEE754 standard, while the flpe variants employ custom precision formats from the FloPoCo library. The fixed<16,8>format represents a fixed-point configuration. The accuracy refers to the system’s ability to produce classification outputs that match the expected results obtained with the standard architecture.

Overall trends across formats are visualized in Figure 4.

Figure 4. Relationship between FPGA resource utilization and performance for different numerical formats implemented on the Alveo U55C. The x-axis shows LUTs usage, while the y-axis reports inference latency in microseconds. Point color encodes classification accuracy, and point size is proportional to REG’s usage. Data types include IEEE754 formats (float32, float16), custom FloPoCo floating-point formats (flpe*), and fixed-point representation (fixed<16,8>). This visualization highlights the trade-offs between resource efficiency, accuracy, and execution speed.

Similar trends are observed for the utilization of REGs, with the float32 format requiring the largest number of registers (456,235, or 17.50%). As precision is reduced, the number of registers used decreases, with the fixed<16,8> format utilizing only 207,670 REGs, or 7.94%. This reduction in resource usage is consistent with the general behavior expected from lower-precision formats, which require fewer resources for computation. DSP utilization remains relatively low for most formats. The float32 and float16 formats require 954 and 479 DSPs, respectively. In contrast, the lower-precision flpe formats (e.g., flpe6f4) utilize a minimal number of DSPs (just 4), highlighting the significant reduction in DSP resource requirements when lower-precision custom formats are employed.

The results of the latency show that the float32 format exhibits the highest latency (12.29 ± 0.15

μ

s

), which is expected as higher precision computations generally involve more complex operations, leading to increased computation burden. As precision decreases, latency also decreases. For example, the latency for the flpe6f4 format is reduced to 2.72 ± 0.23

μ

s

, demonstrating a performance improvement with lower precision. Moreover, the fixed<16,8> format shows the lowest latency (1.39 ± 0.06

μ

s

), indicating that fixed-point precision can offer significant speed advantages for certain tasks.

The accuracy of the models is measured as the deviation between the predictions of the FPGA-implemented model and the ground truth. In general, all the floating-point precision formats, including float32, float16, and the various FloPoCo-based formats, exhibit high accuracy, with values approaching or exceeding 99%. Among the FloPoCo-based formats, accuracy is further improved. The flpe7f22, flpe5f11, and flpe6f10 formats each achieve 100% accuracy, demonstrating that FloPoCo-based representations can match or even surpass the precision of conventional floating-point formats, resulting in predictions exactly identical to those of a standard architecture. Furthermore, these formats are particularly advantageous, as they reduce resource utilization and latency while maintaining optimal accuracy, demonstrating the effectiveness of FloPoCo in custom DNN implementations. On the other hand, the fixed<16,8> format, which uses fixed-point arithmetic and is still under development, shows a significant reduction in accuracy, achieving only 86.03%, which corresponds to a 13.97% drop compared with the float32 baseline. This decrease can be attributed to the inherent limitations of fixed-point arithmetic, where rounding errors accumulate, leading to a less precise model.

4.1. Comparing HLS4ML and BM

Performing ML inference on an FPGA is a goal pursued by several frameworks. Among these, one of the most well-known is High-Level Synthesis for Machine Learning (HLS4ML) [52], designed to translate ML models trained in high-level languages like Python [53] or TensorFlow [54] into custom hardware code for FPGA hardware acceleration. Therefore, we chose this framework as a reference point and to evaluate our solution by benchmarking it against the HLS4ML approach, using the same DL model introduced in Section 4, the LHC jet dataset, and the Alveo U55C FPGA. HLS4ML also offers various options in terms of numerical precision, and for our comparison, we chose to use the default numerical precision set by the library itself during the high-level configuration phase, specifically using a fixed-point representation with a total of 16 bits, 6 of which are allocated for the fractional part. Moreover, HLS4ML offers multiple strategies for firmware generation. We selected the default option, called Latency, which prioritizes performance over resource efficiency. HLS4ML, thanks to the drivers it provides for interaction between the PS (Processing System) and PL (Programmable Logic) parts, returns metrics about the inference time. Although the resulting timing depends on the batch size selected, the best results obtained for a single classification are reported in Table 4. For the sake of completeness, it is important to underline that both systems have been evaluated at fixed-point precision with the same total bit-width (i.e., 16 bits); indeed, the BM uses fixed<16,8>, while the HLS4ML run used its default fixed<16,6>. Thus, the total bit-width was therefore identical across frameworks; the difference lies only in the integer/fractional split. We also tested the BM design at fixed<16,6> and found the system’s precision to be about 2% worse than with fixed<16,8>, which is the reason why we are reporting the fixed<16,8> results for the BM, while HLS4ML used its default fixed<16,6>.

Table 4. The table presents the LUTs, REGs, and DSPs used, along with their respective percentages, for the HLS4ML implementation using the Alveo U55C. It also includes the latency and accuracy.

Compared with the system presented, the HLS4ML solution is more efficient in terms of overall resource utilization and performance compared with the BM solution. Specifically, the HLS4ML implementation with fixed<16,6>precision achieves a low inference latency of 0.17 ± 0.01

μ

s

, which is an order of magnitude faster than the best latency achieved by the BM approach (1.39 ± 0.06

μ

s

) using the same fixed-point configuration. While the BM design using floating-point precision formats achieves accuracy values close to or even at 100%, and the fixed-point BM implementation reaches 81.73%, the HLS4ML fixed-point design achieves an intermediate accuracy of 95.11%. This indicates that while HLS4ML’s optimizations for resource usage and latency result in excellent performance and hardware efficiency, they also introduce some loss of precision, likely due to the quantization effects associated with its default fixed-point configuration.

The comparison with HLS4ML was made to assess the overhead introduced by the BM layer, which requires more resources for the same numerical precision while maintaining similar performance in terms of latency. In order to interpret these results well, we need to underline here, once more, that the BM provides a new kind of computer architecture, where the hardware dynamically adapts to the specific computational problem rather than being static and generic, and the hardware and software are co-designed, guaranteeing a full exploitation of fabric capabilities. Although the HLS4ML solution exceeds the proposed approach in terms of resource efficiency, it lacks the same level of flexibility and general-purpose adaptability. In contrast, the BM ecosystem offers greater versatility. As an open-source, vendor-independent platform, BM enables the deployment of the same NN model across different boards while autonomously generating HDL code. This presents a significant advantage over other solutions, which are more constrained in terms of hardware compatibility and are often vendor-dependent, relying on board-specific tools to generate low-level HDL code. Additionally, BM’s modular design ensures seamless integration with other frameworks and libraries, as demonstrated in this work with the integration of the FloPoCo library, making it highly adaptable to diverse hardware configurations. While HLS4ML proves to be more efficient for the current application, BM’s flexibility and extensibility establish it as a more versatile and scalable solution since it can seamlessly adapt to diverse hardware platforms, integrate specialized libraries, and meet evolving application requirements without being tied to vendor-specific tools or architectures.

To further compare our solution with other frameworks, our system achieves an inference latency on the order of 1

μ

s

per sample. While a direct comparison is not strictly fair due to differences in FPGA boards and DL models used in the literature, this latency remains significantly lower than that reported for Vitis AI (

0.652

m

s

for VGG16 on an Avnet Ultra96-V2 [55]) and is of the same order of magnitude as FINN’s best reported performance (

0.3

μ

s

for a quantized MLP on an embedded FPGA [56]). Other frameworks, such as DeepBurning [57] support automatic RTL-based accelerator generation, but their original publications do not provide explicit latency or throughput measurements, preventing quantitative comparison.

4.2. Technical Limitations and Future Work

While the proposed solution based on the BM framework shows competitive performance and accuracy for certain types of DL models, particularly small to medium MLPs and domain-specific configurations, it still presents technical limitations when applied to larger and more complex neural network architectures. First, even if the modular and heterogeneous design enables fine-grained customization and efficient mapping of operations, the current implementation has been primarily validated on relatively compact models. When scaling to architectures with a significantly higher number of layers and parameters (e.g., large CNNs, Transformers), resource utilization can grow rapidly, leading to LUT, REG, and DSP saturation on typical FPGA devices. Second, while the integration of custom numerical formats via FloPoCo enables accuracy levels comparable to IEEE 754 in many cases, as shown in Table 3, model performance in more complex scenarios may still suffer from quantization effects, particularly in deeper networks where the accumulation of rounding errors can lead to significant deviations from expected outputs. Finally, the current implementation focuses on inference tasks, and while it demonstrates the potential for efficient deployment of MLPs, it does not yet support more advanced NN architectures such as convolutional or recurrent networks, which are commonly used in many applications. To address these limitations, future work will focus on several key areas. One direction is to enhance the framework’s ability to handle larger and more complex models, possibly by introducing dynamic directives that allow the framework to collapse multiple layers into a single computational block, thereby reducing overall resource requirements while accepting a potential trade-off in latency. Another priority will be the integration of advanced precision management strategies, including mixed-precision arithmetic as well as more sophisticated quantization techniques, to mitigate the impact of numerical errors in deeper networks. In parallel, compiler and toolchain optimizations will aim to reduce the overhead introduced by the abstraction layer, improving code generation efficiency and minimizing inter-core communication latency. Finally, we are currently testing a multi-FPGA implementation designed to distribute the computational complexity of large-scale DL models across multiple devices, which could significantly extend the scalability of the approach.

5. Conclusions

In the present work, we describe the developments and updates done in the BondMachine OpenSource framework for the design of hardware accelerators for FPGAs, in particular for the implementation of machine learning models, highlighting its versatility and customizability.

Although this work demonstrates the potential of the BondMachine framework as a tool that can also be used to develop and port ML-based models to FPGAs (mainly MLP architectures) and has the capability of integrating different existing libraries for arithmetic precision, we want to emphasize its current exploratory nature. The focus on MLP architectures was a deliberate choice to establish a robust methodology and validate the core principles of our hardware-software co-design approach. We recognize that extending this framework to more complex, large-scale models is a nontrivial step that will be the primary focus of our future research efforts.

The paper begins with a brief introduction to the BM framework, describing its main features and the main components. Then, we detailed how the framework has been extended to map a neural network multi-layer perceptron model into a BondMachine architecture to perform inference on an FPGA. We tested the solution proposed, discussing the results obtained and the optimizations done to improve the performance. Next, we have reported in detail the analyses of our implementation varying the numerical precision, both by integrating external libraries such as FloPoCo and by using well-known techniques like fixed-point precision, evaluating variations in terms of latency and resource utilization. We also reported some of the results about the energy consumption measurements, a key aspect in this context, describing the setup used and providing real measurements of the power consumption of the FPGA board, giving a comparison with the energy consumption of the same task performed on a standard CPU architecture.

Finally, we have compared our solution with respect to the HLS4ML one, a well-known framework for implementing machine learning inference on FPGA. Overall, this work establishes a methodology of testing and measurements for future developments of bigger and more complex models (i.e., we are indeed working on the implementation of Convolution NN), as well as for different FPGA devices, and it represents a first step towards the implementation of a complete and efficient solution for the inference of machine learning models on FPGA.

Supplementary Materials

The following supporting information can be downloaded at: https://www.mdpi.com/article/10.3390/electronics14173518/s1, Figure S1: Clock cycles distribution. Figure S2: Average clock cycles as a function of the K values using 32bit, 19bit, 16bit and 12bit precision. Figure S3: LUT and DSP distribution on the FPGA fabric. The map illustrates the nonclustered, sparse arrangement of logic resources, indicating low to moderate congestion levels. The localized congestion near HBM interfaces is expected due to numerous I/O connections. Table S1: Latency, clock cycles, accuracy and the average error in probabilities per inference are analyzed as the exponent of the exponential K varies. The results were obtained using the ZedBoard, with the latency column representing the mean latency (in

μ

s) for classifying 275 samples. Clock cycles for each prediction are measured using the benchcore system. Table S2: FloPoCo 32-bits, 7 bits for the exponent and 22 bits for the mantissa. Table S3: FloPoCo 19-bits, 5 bits for the exponent and 11 bits for the mantissa. Table S4: FloPoCo 16-bits, 5 bits for the exponent and 8 bits for the mantissa. Table S5: FloPoCo 12-bits, 4 bits for the exponent and 5 bits for the mantissa. Table S6: Resource usage comparison as a function of the number of bits used for the floating-point representation (see text for details). Table S7: Power consumption as a function of the number of bits used for the floating-point representation.

Author Contributions

Software, M.M. and G.B.; Validation, I.N.; Writing—original draft, M.M., G.B., I.N. and L.S.; Writing—review & editing, D.S. and D.C.; Supervision, M.M., D.S. and L.S. All authors have read and agreed to the published version of the manuscript.

Funding

This work is partially supported by ICSC—Centro Nazionale di Ricerca in High Performance Computing, Big Data and Quantum Computing, funded by the European Union—NextGenerationEU. L. S. acknowledges funding from “PaGUSci—Parallelization and GPU Porting of Scientific Codes” CUP: C53C22000350006 within the Cascading Call issued by Fondazione ICSC, Spoke 3 Astrophysics and Cosmos Observations. National Recovery and Resilience Plan (Piano Nazionale di Ripresa e Resilienza, PNRR) Project ID CN_00000013 “Italian Research Center on High-Performance Computing, Big Data and Quantum Computing” funded by MUR Missione 4 Componente 2 Investimento 1.4: Potenziamento strutture di ricerca e creazione di “campioni nazionali di R&S (M4C2-19)”—Next Generation EU (NGEU).

Data Availability Statement

The data presented in this study are openly available in [github] [https://github.com/BondMachineHQ], accessed on 28 August 2025.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Jordan, M.I.; Mitchell, T.M. Machine learning: Trends, perspectives, and prospects. Science 2015, 349, 255–260. [Google Scholar] [CrossRef]
Abiodun, O.I.; Jantan, A.; Omolara, A.E.; Dada, K.V.; Mohamed, N.A.; Arshad, H. State-of-the-art in artificial neural network applications: A survey. Heliyon 2018, 4, e00938. [Google Scholar] [CrossRef]
Storchi, L.; Cruciani, G.; Cross, S. DeepGRID: Deep Learning Using GRID Descriptors for BBB Prediction. J. Chem. Inf. Model. 2023, 63, 5496–5512. [Google Scholar] [CrossRef]
Hong, Q.; Storchi, L.; Bartolomei, M.; Pirani, F.; Sun, Q.; Coletti, C. Inelastic N₂ + H₂ collisions and quantum-classical rate coefficients: Large datasets and machine learning predictions. Eur. Phys. J. D 2023, 77, 128. [Google Scholar] [CrossRef]
Hong, Q.; Storchi, L.; Sun, Q.; Bartolomei, M.; Pirani, F.; Coletti, C. Improved Quantum–Classical Treatment of N₂–N₂ Inelastic Collisions: Effect of the Potentials and Complete Rate Coefficient Data Sets. J. Chem. Theory Comput. 2023, 19, 8557–8571. [Google Scholar] [CrossRef]
Tedeschi, T.; Baioletti, M.; Ciangottini, D.; Poggioni, V.; Spiga, D.; Storchi, L.; Tracolli, M. Smart Caching in a Data Lake for High Energy Physics Analysis. J. Grid Comput. 2023, 21, 42. [Google Scholar] [CrossRef]
Hua, H.; Li, Y.; Wang, T.; Dong, N.; Li, W.; Cao, J. Edge computing with artificial intelligence: A machine learning perspective. ACM Comput. Surv. 2023, 55, 1–35. [Google Scholar] [CrossRef]
Capra, M.; Bussolino, B.; Marchisio, A.; Masera, G.; Martina, M.; Shafique, M. Hardware and software optimizations for accelerating deep neural networks: Survey of current trends, challenges, and the road ahead. IEEE Access 2020, 8, 225134–225180. [Google Scholar] [CrossRef]
Ngadiuba, J.; Loncar, V.; Pierini, M.; Summers, S.; Di Guglielmo, G.; Duarte, J.; Harris, P.; Rankin, D.; Jindariani, S.; Liu, M.; et al. Compressing deep neural networks on FPGAs to binary and ternary precision with hls4ml. Mach. Learn. Sci. Technol. 2020, 2, 015001. [Google Scholar] [CrossRef]
Thomas, D. Reducing Machine Learning Inference Cost for Pytorch Models. 2020. Available online: https://pages.awscloud.com/Reducing-Machine-Learning-Inference-Cost-for-PyTorch-Models_2020_0406-MCL_OD.html (accessed on 12 June 2025).
Plumed, F.; Avin, S.; Brundage, M.; Dafoe, A.; hÉigeartaigh, S.; Hernandez-Orallo, J. Accounting for the Neglected Dimensions of AI Progress; Centre for the Governance of AI, Inc.: Oxford, UK, 2018. [Google Scholar]
Samayoa, W.F.; Crespo, M.L.; Cicuttin, A.; Carrato, S. A Survey on FPGA-based Heterogeneous Clusters Architectures. IEEE Access 2023, 11, 67679–67706. [Google Scholar] [CrossRef]
Zhao, T. FPGA-Based Machine Learning: Platforms, Applications, Design Considerations, Challenges, and Future Directions. Highlights Sci. Eng. Technol. 2023, 62, 96–101. [Google Scholar] [CrossRef]
Liu, X.; Ounifi, H.A.; Gherbi, A.; Li, W.; Cheriet, M. A hybrid GPU-FPGA based design methodology for enhancing machine learning applications performance. J. Ambient. Intell. Humaniz. Comput. 2020, 11, 2309–2323. [Google Scholar] [CrossRef]
Ghanathe, N.P.; Seshadri, V.; Sharma, R.; Wilton, S.; Kumar, A. MAFIA: Machine learning acceleration on FPGAs for IoT applications. In Proceedings of the 2021 31st International Conference on Field-Programmable Logic and Applications (FPL), Dresden, Germany, 30 August–3 September 2021; pp. 347–354. [Google Scholar]
Mariotti, M.; Magalotti, D.; Spiga, D.; Storchi, L. The BondMachine, a moldable computer architecture. Parallel Comput. 2022, 109, 102873. [Google Scholar] [CrossRef]
BondMachineHQ. GitHub Organization: BondMachineHQ. 2025. Available online: https://github.com/BondMachineHQ (accessed on 12 June 2025).
Shawahna, A.; Sait, S.M.; El-Maleh, A. FPGA-based accelerators of deep learning networks for learning and classification: A review. IEEE Access 2018, 7, 7823–7859. [Google Scholar] [CrossRef]
Monmasson, E.; Cirstea, M.N. FPGA design methodology for industrial control systems—A review. IEEE Trans. Ind. Electron. 2007, 54, 1824–1842. [Google Scholar] [CrossRef]
Faizan, M.; Intzes, I.; Cretu, I.; Meng, H. Implementation of Deep Learning Models on an SoC-FPGA Device for Real-Time Music Genre Classification. Technologies 2023, 11, 91. [Google Scholar] [CrossRef]
Enériz, D.; Medrano, N.; Calvo, B. An FPGA-Based Machine Learning Tool for In-Situ Food Quality Tracking Using Sensor Fusion. Biosensors 2021, 11, 366. [Google Scholar] [CrossRef] [PubMed]
Yang, X.; Zhuang, C.; Feng, W.; Yang, Z.; Wang, Q. FPGA Implementation of a Deep Learning Acceleration Core Architecture for Image Target Detection. Appl. Sci. 2023, 13, 4144. [Google Scholar] [CrossRef]
Perticaroli, P.; Ammendola, R.; Biagioni, A.; Chiarini, C.; Ciardiello, A.; Cretaro, P.; Frezza, O.; Lo Cicero, F.; Martinelli, M.; Piandani, R.; et al. Achieving Low-Latency, High-Throughput Online Partial Particle Identification for the NA62 Experiment Using FPGAs and Machine Learning. Electronics 2025, 14, 1892. [Google Scholar] [CrossRef]
Lattner, C.; Adve, V. LLVM: A compilation framework for lifelong program analysis & transformation. In Proceedings of the International Symposium on Code Generation and Optimization: Feedback-Directed and Runtime Optimization, San Jose, CA, USA, 20–24 March 2004; pp. 75–88. [Google Scholar]
Xilinx. Alveo FPGA. 2025. Available online: https://www.xilinx.com/products/boards-and-kits/alveo.html (accessed on 21 May 2025).
Mariotti, M.; Storchi, L.; Spiga, D.; Salomonie, D.; Boccalif, T.; Bonacorsid, D. The BondMachine toolkit: Enabling Machine Learning on FPGA. In Proceedings of the International Symposium on Grids & Clouds 2019, Taipei, Taiwan, 31 March–5 April 2019; p. 20. [Google Scholar]
Meyerson, J. The go programming language. IEEE Softw. 2014, 31, 104. [Google Scholar] [CrossRef]
BondMachineHQ. pybondmachine: Python Interface for BondMachine FPGA Framework. 2025. Available online: https://github.com/BondMachineHQ/pybondmachine (accessed on 20 March 2025).
Dinechin, F.D.; Lauter, C.; Tisserand, A. FloPoCo: A generator of floating-point arithmetic operators for FPGAs. ACM Trans. Reconfigurable Technol. Syst. (TRETS) 2009, 2, 10. [Google Scholar]
Gholami, A.; Kim, S.; Dong, Z.; Yao, Z.; Mahoney, M.W.; Keutzer, K. A survey of quantization methods for efficient neural network inference. In Low-Power Computer Vision; Chapman and Hall/CRC: Boca Raton, FL, USA, 2022; pp. 291–326. [Google Scholar]
de Dinechin, F.; Pasca, B. Designing Custom Arithmetic Data Paths with FloPoCo. IEEE Des. Test Comput. 2011, 28, 18–27. [Google Scholar] [CrossRef]
Haykin, S. Neural Networks: A Comprehensive Foundation; Macmillan Coll Div: New York, NY, USA, 1994. [Google Scholar]
IEEE 754-2019; IEEE Standard for Floating-Point Arithmetic. IEEE: New York, NY, USA, 2019.
Xilinx. ZedBoard FPGA. Available online: https://www.xilinx.com/products/boards-and-kits/1-8dyf-11.html (accessed on 28 August 2025).
Kljucaric, L.; George, A.D. Deep learning inferencing with high-performance hardware accelerators. ACM Trans. Intell. Syst. Technol. 2023, 14, 1–25. [Google Scholar] [CrossRef]
AMBA AXI Protocol Specification; Arm Limited: Cambridge, UK, 2023.
AMD Xilinx. About Xilinx Runtime (XRT), version UG1451; AMD Xilinx: San Jose, CA, USA, 2024.
Denby, B. The Use of Neural Networks in High-Energy Physics. Neural Comput. 1993, 5, 505–549. [Google Scholar] [CrossRef]
Cagnotta, A.; Carnevali, F.; De Iorio, A. Machine Learning Applications for Jet Tagging in the CMS Experiment. Appl. Sci. 2022, 12, 10574. [Google Scholar] [CrossRef]
Savard, C. Overview of the HL-LHC Upgrade for the CMS Level-1 Trigger. EPJ Web of Conf. 2024, 295, 02022. [Google Scholar] [CrossRef]
Aarrestad, T.; Loncar, V.; Ghielmetti, N.; Pierini, M.; Summers, S.; Ngadiuba, J.; Petersson, C.; Linander, H.; Iiyama, Y.; Di Guglielmo, G.; et al. Fast convolutional neural networks on FPGAs with hls4ml. Mach. Learn. Sci. Technol. 2021, 2, 045015. [Google Scholar] [CrossRef]
Pierini, M.; Duarte, J.M.; Tran, N. HLS4ML LHC Jet Dataset (30 Particles), Version v1; Zenodo: Geneva, Switzerland, 2020. [CrossRef]
Feist, T. Vivado Design Suite. White Paper 2012, 5, 30. [Google Scholar]
PYNQ: The Xilinx Platform for Python on FPGAs. Available online: https://www.pynq.io/ (accessed on 28 August 2025).
Kokkinis, A.; Siozios, K. Fast Resource Estimation of FPGA-Based MLP Accelerators for TinyML Applications. Electronics 2025, 14, 247. [Google Scholar] [CrossRef]
Wiltgen, A.; Escobar, K.A.; Reis, A.I.; Ribas, R.P. Power consumption analysis in static CMOS gates. In Proceedings of the 2013 26th Symposium on Integrated Circuits and Systems Design (SBCCI), Curitiba, Brazil, 2–6 September 2013; pp. 1–6. [Google Scholar]
García, A.D.G.; Pérez, L.F.G.; Acuña, R.F. Power consumption management on FPGA. In Proceedings of the 15th International Conference on Electronics, Communications and Computers (CONIELECOMP’05), Puebla, Mexico, 28 February–2 March 2005; pp. 240–245. [Google Scholar]
Agarwal, A.; Mukhopadhyay, S.; Raychowdhury, A.; Roy, K.; Kim, C.H. Leakage power analysis and reduction for nanoscale circuits. IEEE Micro 2006, 26, 68–80. [Google Scholar] [CrossRef]
Chang, L.; Frank, D.J.; Montoye, R.K.; Koester, S.J.; Ji, B.L.; Coteus, P.W.; Dennard, R.H.; Haensch, W. Practical strategies for power-efficient computing technologies. Proc. IEEE 2010, 98, 215–236. [Google Scholar] [CrossRef]
Melo, A.C.D. The new linux’perf’tools. Slides Linux Kongr. 2010, 18, 1–42. [Google Scholar]
David, H.; Gorbatov, E.; Hanebutte, U.R.; Khanna, R.; Le, C. RAPL: Memory power estimation and capping. In Proceedings of the 16th ACM/IEEE International Symposium on Low Power Electronics and Design, Austin, TX, USA, 18–20 August 2010; pp. 189–194. [Google Scholar]
Fahim, F.; Hawks, B.; Herwig, C.; Hirschauer, J.; Jindariani, S.; Tran, N.; Carloni, L.P.; Di Guglielmo, G.; Harris, P.; Krupa, J.; et al. hls4ml: An open-source codesign workflow to empower scientific low-power machine learning devices. arXiv 2021, arXiv:2103.05579. [Google Scholar]
Rossum, G.V.; Drake, F.L. Python 3 Reference Manual; CreateSpace: Scotts Valley, CA, USA, 2009; ISBN 1441412697. [Google Scholar]
Abadi, M.; Agarwal, A.; Barham, P.; Brevdo, E.; Chen, Z.; Citro, C.; Corrado, G.S.; Davis, A.; Dean, J.; Devin, M.; et al. TensorFlow: Large-Scale Machine Learning on Heterogeneous Systems. 2015. Available online: https://www.tensorflow.org/ (accessed on 28 August 2025).
Khaki, A.M.Z.; Choi, A. Optimizing Deep Learning Acceleration on FPGA for Real-Time and Resource-Efficient Image Classification. Appl. Sci. 2025, 15, 422. [Google Scholar] [CrossRef]
Umuroglu, Y.; Fraser, N.J.; Gambardella, G.; Blott, M.; Leong, P.; Jahre, M.; Vissers, K. FINN: A Framework for Fast, Scalable Binarized Neural Network Inference. arXiv 2016, arXiv:1612.07119. [Google Scholar] [CrossRef]
Xie, Y.; Liang, H.; Wang, S.; Huang, S.; Wang, B.; Xie, Y.; Chen, D. DeepBurning: Automatic Generation of FPGA-based Learning Accelerators for the Neural Network Family. In Proceedings of the 53rd Annual Design Automation Conference (DAC), Austin, TX, USA, 5–9 June 2016; pp. 1–6. [Google Scholar] [CrossRef]

Figure 1. A DNN with one hidden layer, four inputs, and two outputs.

Figure 2. The BM architecture is a computational graph of the mapped DNN with four inputs, one hidden node (node_1_0) applying a linear transformation, and two outputs computed via softmax (node_3_0, node_3_1). Weight blocks (weightf) represent learned connections between nodes, and the figure shows each individual weight multiplication and signal transfer in detail.

Figure 3. Mapping of a softmax neuron from the neural network computation graph to its execution on a processor. The left part shows the neuron (node_3_0) within the expanded graph, while the right part presents its corresponding implementation on the processor, including input/output registers and the sequence of generated low-level instructions. This provides a detailed view of how high-level neural computations are executed in hardware.

Figure 4. Relationship between FPGA resource utilization and performance for different numerical formats implemented on the Alveo U55C. The x-axis shows LUTs usage, while the y-axis reports inference latency in microseconds. Point color encodes classification accuracy, and point size is proportional to REG’s usage. Data types include IEEE754 formats (float32, float16), custom FloPoCo floating-point formats (flpe*), and fixed-point representation (fixed<16,8>). This visualization highlights the trade-offs between resource efficiency, accuracy, and execution speed.

Table 1. Numerical representations and corresponding prefixes used within the bmnumbers package.

Numerical Representation	Prefix	Description
Binary	`0b`	Binary number
Decimal	`0d`	Decimal number
Hexadecimal	`0x`	Hexadecimal number
Floating point 32 bits	`0f<32>`	IEEE 754 single precision floating point number
Floating point 16 bits	`0f<16>`	IEEE 754 half precision floating point number
Linear Quantization	`0lq<s,t>`	Linear quantized number with size `s` and type `t`
FloPoCo	`0flp<e,f>`	FloPoCo floating point number with exponent `e` and mantissa `f`
Fixed-point	`0fps<s,f>`	Fixed-point number with `s` total bits and `f` fractional bits

Table 2. Comparison of the energy consumption of the ZedBoard BM with the ARM Cortex A9 and the Intel i7-1260P CPUs. The values are expressed in terms of their order of magnitude to highlight the relative differences in performance and energy efficiency across the systems. The System column indicates the system used for the inference, while the Time/Inf (s) and En./Inf (J) columns show the time taken for a single inference (calculated from clock cycles, counted using the benchmark for the FPGA) and the energy consumption per inference (measured using the Perf tool for the CPUs). All measurements were averaged over 166,000 inference runs, corresponding to the size of the test dataset.

System	Time/Inf (s)	En./Inf (J)
ARM Cortex A9	$10^{- 2}$	$10^{- 6}$
Intel i7-1260P	$10^{- 6}$	$10^{- 4}$
ZedBoard BM	$10^{- 6}$	$10^{- 8}$

Table 3. Resource utilization, including Look-Up Tables (LUTs), Registers (REGs), and Digital Signal Processors (DSPs), along with latency and accuracy for different data types implemented on the Alveo U55C FPGA. The float32 and float16 formats adhere to the IEEE754 standard, while the flpe variants employ custom precision formats from the FloPoCo library. The fixed<16,8>format represents a fixed-point configuration. The accuracy refers to the system’s ability to produce classification outputs that match the expected results obtained with the standard architecture.

Data Type	LUTs		REGs		DSPs
Data Type	Count	(%)	Count	(%)	Count	(%)
float32	476,416	36.54	456,235	17.50	954	10.57
float16	288,944	22.16	298,191	11.44	479	5.31
flpe7f22	423,915	32.52	352,113	13.50	950	10.53
flpe5f11	393,657	30.20	318,821	12.23	477	5.29
flpe6f10	442,809	33.97	334,414	12.83	4	0.04
flpe4f9	347,633	26.67	275,653	10.57	4	0.04
flpe5f8	299,033	22.94	261,403	10.03	4	0.04
flpe6f4	274,523	21.06	236,429	9.07	4	0.04
fixed<16,8>	205,071	15.73	207,670	7.94	477	5.29
Data Type	Latency (μs)			Accuracy (%)
float32	12.29 ± 0.15			100
float16	8.65 ± 0.15			99.17
flpe7f22	6.23 ± 0.18			100.00
flpe5f11	4.46 ± 0.21			100.00
flpe6f10	4.49 ± 0.18			100.00
flpe4f9	2.80 ± 0.15			97.78
flpe5f8	3.31 ± 0.12			99.74
flpe6f4	2.72 ± 0.23			96.39
fixed<16,8>	1.39 ± 0.06			86.03

Table 4. The table presents the LUTs, REGs, and DSPs used, along with their respective percentages, for the HLS4ML implementation using the Alveo U55C. It also includes the latency and accuracy.

Data Type	LUTs	Luts %	REGs	REGs %	DSPs	DSPs %	Latency ( $μ$ $s$ )	Acc %
fixed<16,6>	134,373	10.31	136,113	6.89	255	2.83	0.17 ± 0.01	95.11

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Extending a Moldable Computer Architecture to Accelerate DL Inference on FPGA

Abstract

1. Introduction

2. Overview of the BondMachine Architecture

2.1. Architecture Specification

2.2. Architecture Handling

3. Strategy and Ecosystem Improvements to Accelerate DL Inference Tasks

3.1. A Flexible ISA: Combining Static and Dynamic Instructions

3.2. Extending Numerical Representations in the BM Ecosystem: FloPoCo, Linear Quantization, and the Bmnumbers Package

3.3. BASM and Fragments

3.4. Mapping a DNN as a Heterogeneous Set of CPs

3.5. BM as Accelerator

3.6. Real-Time Power Measurement and Energy Profiling

4. Benchmarking the DL Inference FPGA-Based System for Jet Classification in LHC Experiments

4.1. Comparing HLS4ML and BM

4.2. Technical Limitations and Future Work

5. Conclusions

Supplementary Materials

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Article Metrics

Citations

Article Access Statistics