1. Introduction
Machine learning (ML) algorithms [
1], particularly Deep Neural Networks (DNNs), have witnessed remarkable achievements across diverse domains, ranging from computer vision and natural language processing to autonomous systems and recommendation systems [
2]. Today, ML techniques, especially DNNs, are widely used in many scientific fields [
3,
4,
5,
6]. However, these advances often come at the cost of increased computational complexity, memory requirements, and power consumption. Moreover, the demand for ML in edge applications, where computational resources are typically limited, has further increased these requirements [
7]. To address these challenges, researchers have been exploring techniques to optimize and streamline ML models. The main target of these techniques is the choice of data representation. On one hand, it has to be chosen considering the available hardware optimization on the edge device. On the other hand, it has to be sufficiently precise not to compromise the accuracy of the model and as small as possible to reduce memory and computation requirements [
8]. Indeed, it is a common technique to decrease the numerical precision of computations within ML tasks with respect to the standard 32-bit floating-point representation. By reducing the number of bits used to represent numerical values, it is possible to achieve more efficient storage, faster computation, and potentially lower energy consumption [
9,
10,
11]. However, this approach is not without its drawbacks, and a careful analysis of the pros and cons is crucial to understand the implications of decreased numerical precision.
All the above-mentioned aspects of hardware optimization can be effectively addressed using FPGA devices. In recent years, many studies have explored the use of FPGAs not only for accelerating specific computational workloads but also as central components in heterogeneous computing systems. These works highlight how FPGAs enable fine-grained control over resource allocation, energy efficiency, and deterministic latency, making them well-suited for edge AI, data center inference, and high-throughput scientific computing. Recent advances include frameworks for dynamic partial reconfiguration, integration with Central Processing Unit (CPU)—Graphics Processing Unit (GPU) architectures, and high-level programming environments to simplify deployment across heterogeneous platforms [
12,
13,
14,
15].
In the present work, we aim to use the BondMachine (BM) framework [
16,
17] (BondMachine Project, [
https://www.bondmachine.it], accessed on 28 August 2025) to implement ML models on FPGA devices [
18,
19,
20,
21,
22,
23] and to analyze the effects of reduced numerical precision on the performance of the models, on the accuracy of the results, on the resource utilization, and on the power consumption.
This study is part of a larger research project aimed at developing a new generation of heterogeneous computing systems for different applications, including edge AI. Our approach is to create a multi-core and heterogeneous ‘register machine’ abstraction that can be used as an intermediate layer between the software applications and the FPGA hardware. From the software point of view, the BM offers a common interface that hides the specific FPGA hardware details, allowing a vendor-independent development of the applications. From the FPGA point of view, the BM inherits the flexibility and parallelism of the FPGA devices. A similar approach has been used in software development with the introduction of the Low Level Virtual Machine (LLVM) compiler infrastructure [
24] (LLVM Project, [
https://llvm.org/], accessed on 28 August 2025). The LLVM VM is used as an intermediate layer between the high-level language and the architecture-specific machine code. Most optimizations are performed at the LLVM level, and the final machine code is generated by the LLVM compiler. The BM is designed to be used in the same way, but for FPGA devices. Most optimizations are performed at the BM level, and the final FPGA configuration is generated by the BM tools.
Our focus is to provide a comprehensive evaluation of the benefits and limitations associated with this approach, shedding light on its impact on model performance, resource utilization, and overall efficiency. The increasing demand for edge AI applications, combined with the need for efficient and low-power computing systems, makes this research particularly relevant and timely, as it provides a new perspective on the design and implementation of AI accelerators for edge computing.
It is important to underline that while ongoing efforts are dedicated to expanding our framework’s compatibility with a broader spectrum of Artificial Neural Networks (ANN) architectures, the present work concentrates on multi-layer perceptrons (MLPs). MLPs retain widespread application across various domains, including High Energy Physics (HEP), recommendation systems, and tabular data modeling, and serve as integral components within more extensive architectures (e.g., transformers). Consequently, we contend that our current evaluation, grounded in MLPs, furnishes a significant and pertinent demonstration of the BM capabilities, simultaneously establishing a foundation for expanded support in subsequent developments.
The present paper is organized as follows. In
Section 2, we will first recall the basic components of the BM architecture, together with the basic tools of the BM software framework. Then, we describe the improvements we made to the BM ecosystem to allow the use of DNN models. In the subsequent section that is
Section 3, we will describe how a BM can be used to implement a DNN model on an FPGA. Specifically, we will give here all the details related to the mapping of a DNN on a BM, using a simple test model. In this section, we will report all the tests performed to deeply understand the impact of reduced precision algebra both in terms of computational deficiency and in terms of the numerical precision. Finally, in
Section 4, we present the results obtained using the Large Hadron Collider (LHC) jet-tagging datasets on an Xilinx/AMD Alveo FPGA (Advanced Micro Devices, Inc., Santa Clara, CA, USA) [
25]. We then draw some general conclusions based on the observed performance metrics.
3. Strategy and Ecosystem Improvements to Accelerate DL Inference Tasks
Our strategy to accelerate DL inference tasks within the BM ecosystem is based on a dynamic and adaptable architecture design. Each CP executes an Instruction Set Architecture (ISA) that combines statically defined operations with dynamic instructions generated at run time, such as those for Floating-Point (FloPoCo) opcodes and Linear Quantization. This approach, along with a fragment-based assembly methodology via the BM Assembler (BASM), enables the creation of specialized computing units tailored to diverse neural network-related operations.
Additionally, the BM ecosystem offers a unified handling of heterogeneous numerical representations through the
bmnumbers package and provides tools such as Neuralbond, TensorFlow Translator, and Neural Network Exchange Format (NNEF) composer to translate DNN models into BM architectures. Support for multiple FPGA platforms (e.g., Digilent Zedboard, Xilinx/AMD Alveo) and a high-level Python library (pyBondMachine [
28], version 1.1.70,
https://pypi.org/project/pybondmachine/, (accessed on 28 August 2025)) further simplify interaction, prototyping, and deployment of FPGA-based accelerators.
3.1. A Flexible ISA: Combining Static and Dynamic Instructions
Each CP of a BM is able to execute some instructions, forming a specific ISA. The ISAs, one for each CP, are composed by choosing from a set of available instructions. For example, to perform a sum between two numbers, the CP must be able to execute the add instruction. Some instructions act on specific data types; for example, the add instruction can be used to sum two integers, addf to sum two floating-point numbers, and so on.
Apart from the basic instructions, which are statically defined within the project, each CP also executes dynamic instructions. A Dynamic Instruction is an opcode that is not statically defined within the project, but is dynamically generated at runtime. It can be, for example, HDL code generated by an external tool or an instruction that changes according to the input. Examples of dynamic instructions are Floating-Point Cores (FloPoCo) (i.e., a generator of Floating-Point, but not only, cores for FPGAs) [
29] opcodes that are generated by the FloPoCo tool and Linear Quantization [
30] opcodes.
3.2. Extending Numerical Representations in the BM Ecosystem: FloPoCo, Linear Quantization, and the Bmnumbers Package
FloPoCo [
29,
31] is an open-source software tool primarily used to automatically generate efficient hardware implementations of floating-point operators for FPGAs. FloPoCo has been integrated within the BM ecosystem to allow the use of optimized floating-point operators. We aim to provide a comprehensive evaluation of the benefits and limitations associated with this approach, shedding light on its impact on model performance, resource utilization, and overall efficiency. One of the main features of this framework is the representation of floating-point numbers using an arbitrary precision, i.e., the number of bits. In fact, it is possible to arbitrarily choose the number of bits assigned to both the exponent and the mantissa. This leads to great freedom in selecting numerical precision, allowing us to perform various tests by varying the number of bits used to represent a floating-point number.
Linear Quantization is a technique that allows one to represent a floating-point number using a fixed number of bits and integer arithmetic [
30]. The BM ecosystem has been extended to allow the use of Linear Quantization operators. The main advantage of this approach is the possibility of using integer arithmetic, which is much more efficient than floating-point arithmetic on FPGA devices. In order to achieve this, special dynamic instructions were developed for this type of operator.
Finally, given the heterogeneity of the BM and the different numerical representations that can be used, it is necessary to have a common way to represent numerical values in both the software layer and the hardware. The
bmnumbers package has been developed exactly to achieve this goal. This package is a Go library that manages all the numerical aspects of the BM. In particular, it is able to parse a numerical representation (i.e., a string) and to produce the corresponding binary representation (i.e., a slice of bits). This is achieved using a prefix notation, where the first characters of the string are used to identify the numerical representation. For example, the string “0f0.5” is parsed as a floating point number, or “0lq0.5
<8,1
>” is parsed as a linear quantized number of 8 bits with ranges given by type 1 (the type is stored as an external data structure within
bmnumbers).
Table 1 reports the available numerical representations and the corresponding prefix.
It is important to underline as bmnumbers is also responsible for the conversion, casting, and export of numerical representations throughout the BM ecosystem.
3.3. BASM and Fragments
The BM ecosystem has been extended to allow the use of a new tool called BASM (BondMachine Assembler). BASM uses a low-level language that allows one to write BM code directly in assembly. Similar to the BondGo compiler, BASM is able to generate both the BM architecture and the application that will run on it.
A central concept of BASM is the fragment. A fragment is a small piece of assembly code that can be combined in different ways and thought of as a function with its input and output parameters. An example of a fragment is shown in Listing 1.
Listing 1. Fragment template for a linear neuron. |
![Electronics 14 03518 i001 Electronics 14 03518 i001]() |
Fragments can be linked together to form a computational graph, where each fragment represents a node in the graph. Fragments can also be parameterized, allowing for the creation of generic code that can be instantiated with different parameters, like in the example above, where the weight is passed as a parameter. Also, the instructions can be parameterized, like the multop instruction in the example above. This is useful to allow the use of different numerical representations, such as floating point or fixed point.
Connecting Processors (CPs) can also be interconnected to form a computational graph, where each CP functions as a distinct processor within the BM architecture. In the most straightforward implementation, the relationship between fragment graphs and CPs is a direct one-to-one mapping, with each fragment being assigned to a single, dedicated CP. However, the BM architecture is flexible enough to allow for more complex mappings. By acting on some metadata, it is possible to change how the fragments are mapped to the CPs and how the CPs are connected to each other, allowing for the creation of different BM architectures starting from the same fragments and thus computing the same task with differences in resource usage, latency, and power consumption. For example, a group of fragments can be mapped to a single CP, or each fragment can be mapped to a different CP. In the first case, the fragments will be executed sequentially; in the latter, they will be executed in parallel on different CPs. The resource usage of the first case will be lower, but the latency will be higher, while in the second case, the resource usage will be higher, but the latency will be lower.
3.4. Mapping a DNN as a Heterogeneous Set of CPs
A BM can be used to solve any computational task, but in fact, the interconnected heterogeneous processors model, on which any BM is based, seems to be ideal for mapping a DNN [
32]. Thus, we developed three different tools to be used to map a DNN model to a proper BM architecture: Neuralbond, Tensorflow Translator, and NNEF Composer [
16]. In the present work, we used the Neuralbond tool, which, starting from the NN architecture, builds a BM composed of several CPs acting as one or more neuron-like computing units. The Neuralbond approach is fragment-based; thus, each generated CP is the composition of one or more BASM fragments taken from a library we created, and each fragment contains the code to perform the computation of a single neuron.
An example of the softmax neuron written in Basm is shown in Listing 2.
Listing 2. Fragment template for a softmax neuron. |
![Electronics 14 03518 i002 Electronics 14 03518 i002]() |
BASM takes this fragment and generates a complete computational unit capable of performing the specified operation. Neuralbond then iterates over the neurons and weights of the DNN, leveraging BASM to construct the final BM architecture that implements the entire DNN. This methodology enables the customization of neurons according to the specific requirements of the use case. As an example, consider the fragment mentioned previously that implements the softmax function. This fragment was developed following the corresponding mathematical formulation.
where
L in the equation is the number of elements on which the softmax is applied.
where
K in the Formula (
2) is the number of iterations of the Taylor series expansion of the exponential function. We called
expprec this number. As
expprec increases, the series includes more terms, making the approximation of
more accurate. Each term in the series brings the approximation closer to the true value of
. However, calculating more terms also increases the computational cost. For example, if
, the Taylor expansion is truncated after the first two terms:
than
is poorly approximated and the Softmax output will also be inaccurate since it is sensitive to the exponential value (i.e., it determines how the probabilities are distributed among the classes). The trade-off between accuracy (percentage difference between predictions from the standard architecture and our FPGA implementation; 100% means identical predictions, which is the desired outcome) and computational cost is an important consideration when selecting the value of
K, as it affects the performance of the model. A detailed analysis of the impact of the
expprec parameter has been conducted using a simplified test model (such as that reported in
Figure 1) and a related dataset, and all results of this investigation are reported in
Supplementary Materials. Based on these analyses, it was observed that the softmax function is highly sensitive to the number of terms used in the Taylor series expansion of the exponential function, particularly under conditions of reduced numerical precision. For example, analyzing this simplified model, when using the standard 32-bit IEEE 754 [
33] floating-point representation, setting the
expprec parameter to 1 is generally sufficient to preserve model accuracy. In contrast, when 16-bit precision FloPoCo operators are used, a higher value of
expprec, typically between 2 and 10, is required to achieve comparable accuracy. Deviating from this optimal range, either by decreasing or excessively increasing the value of
expprec, leads to a significant degradation in model performance. This finding underscores the importance of balancing computational efficiency and numerical accuracy, especially in resource-constrained environments or hardware-limited implementations. In the benchmarks described in
Section 4, we reduced this exponent to its minimum value, trading off accuracy for improved performance using the Xilinx/AMD Alveo FPGA platform [
25].
To better understand how the DNN mapping on a BM is performed, consider a multi-layer perceptron (MLP) designed to solve a classification problem. This network has four input features, a single hidden layer with one neuron using a linear activation function, and an output layer with two neurons employing the softmax activation function, as shown in
Figure 1.
The resulting BM architecture, reported in
Figure 2, is made up of four inputs (that is, the number of features) and two outputs (i.e., the classes). It consists of thirty-three connecting processors, including four CPs for reading the incoming inputs, twelve CPs corresponding to the “weights” and performing the multiplication between the input and its associated weight, one CP that executes the linear function, two CPs performing the softmax functions, and two CPs to write the final computed output values. The total number of registers in the CPs is 80, while the total number of bonds is 30.
To give more details to the reader,
Figure 3 below illustrates the mapping of a specific computation node from the neural network graph to its execution on a processor, showing the corresponding input/output registers and the generated instruction sequence.
Theoretically, by implementing every possible type of neuron within the library, any DNN can be mapped to a BM architecture and synthesized on an FPGA.
3.5. BM as Accelerator
In addition to the core modifications made to the BM framework, further enhancements have extended the automation mechanisms to support heterogeneous hardware platforms, including the Digilent Zedboard (Soc: Xilinx Zynq-7000 [
34] (Xilinx, San Jose, CA, USA)) and Xilinx/AMD Alveo [
25] boards. The latter are well known for their high resource availability and performance in accelerated computing. In particular, they are increasingly being adopted for the deployment of advanced DL systems [
35].
To deploy the architecture created as an accelerator for a host running a standard Linux operating system, the BM ecosystem leverages the Advanced eXtensible Interface (AXI) protocol [
36] for communication. AXI is a widely adopted standard for interfacing with FPGA devices and facilitates the integration of all available tools for FPGA accelerator development.
In particular, both the AXI Memory Mapped (AXI-MM) protocol and the AXI Stream protocol have been successfully implemented and can be used to interact with the BM architecture running on the FPGA, depending on the specific requirements and the use case. For example, AXI-MM can be preferred when memory access patterns dominate the application, while AXI Stream is more suitable for continuous data-flow processing.
Furthermore, a high-level Python library, pybondmachine [
28], has been developed to facilitate interaction with the BM framework, as well as with FPGA devices programmed using the BM toolkit, since it provides a high-level Application Programming Interface (API) for configuring kernels and exchanging data without directly invoking Xilinx Runtime (XRT) [
37] calls. Indeed, when deployed on Xilinx Alveo boards, the BM framework uses the XRT to instantiate kernels, manage execution, and transfer data over PCIe. This is conceptually similar to the Vitis Runtime Library model, where kernels are launched from a host application and the runtime handles synchronization and memory access.
3.6. Real-Time Power Measurement and Energy Profiling
As noted in the Introduction, power consumption is nowadays a critical factor when designing a DNN accelerator on an FPGA. Even if energy consumption in DL involves both training and inference phases, in production environments, inference dominates running costs, accounting for up to 90% of total compute cost [
10,
11]. Deploying specialized, energy-efficient hardware, such as FPGAs, therefore, represents an effective strategy for reducing this overhead.
The tools supplied by Xilinx provide the ability to analyze the energy usage of a given design by producing a detailed report with estimated power figures. Within the BM framework, we have implemented the necessary automation to extract these data and generate a concise summary report. However, while Vivado’s summary reports are useful for obtaining a general overview of the power profile of a design, direct real-time measurement of energy consumption is essential to evaluate its true impact.
To this end, we implemented a real-time power-measurement setup on the ZedBoard powered by a 12 V stabilized supply and monitored via a high-resolution digital multimeter in order to separate static (leakage) from dynamic (switching) power of the inference (Intellectual Property) IP alone. Dynamic power is modeled as
Analysis across FloPoCo floating-point precisions of 12, 16, 19, and 32 bits reveals a marked increase in both static and dynamic power with bit-width, yielding energy per inference from 0.8 × J (12 bits) to 3.8 × J (32 bits).
In addition, a comparative study on CPU platforms has been conducted, and the results are summarized in
Table 2.
The study on CPU platforms demonstrates that FPGAs deliver inference times on the order of
s, comparable to the Intel (Intel Corporation, Santa Clara, CA, USA) CPU but with up to three orders of magnitude lower energy consumption; relative to the ARM processor, FPGA inference is slower (
s) yet still substantially more energy-efficient. Additional detailed results of the power consumption and of the setup used are reported in
Supplementary Materials.
4. Benchmarking the DL Inference FPGA-Based System for Jet Classification in LHC Experiments
DNNs are widely used in the field of high-energy physics [
38]; for instance, they are used for the classification of jets produced in the collisions of the Large Hadron Collider (LHC) at CERN [
39]. However, in this context, FPGAs are primarily used in the trigger system of the LHC experiments to make a real-time selection of the most interesting events to be stored. Thus, the use of DNN in FPGA is a promising solution to improve the performance of the trigger system and to find novel ways to reduce both the latency of the trigger system and energy consumption, which are crucial factors in this scenario.
The complexity of the DNN involved is balanced to meet the rigid requirements of FPGA-based trigger systems at LHC. Specifically, the Level-1 Trigger system must process data at a rate of 40 MHz and make decisions within a latency of approximately 12.5 microseconds, necessitating the use of DNNs that are optimized for low latency, and efficient resource utilization [
40]. Methods such as compressing DNNs by reducing numerical precision have been used to reduce processing times and resource utilization while preserving model accuracy [
41].
To evaluate the effectiveness of the proposed solution, benchmarks were performed using the LHC jet tagging dataset [
42], a well-known resource in the field of high-energy physics, which is used to classify jets produced in collisions at the LHC. All benchmarks were executed with a batch size of 1024, and data transfers between the host and the FPGA were performed using the AXI Stream protocol with a 32-bit data bus width. The dataset is composed of 830,000 jets, each with 16 features, and the classification task is to identify if a jet is a gluon, a quark, a W boson, a Z boson, or a top quark.
The selected DNN architecture consists of sixteen input neurons, three hidden layers with sixty-four, thirty-two, and thirty-two neurons, respectively, employing the Linear activation function, and five output neurons using the Softmax activation function. To test the proposed model, the Alveo U55C board has been chosen, which offers high throughput and low latency for DL inference tasks. Since the Alveo U55C is produced by AMD/Xilinx, the tool used behind the scenes to generate the firmware is Vivado [
43] (version 2023.2, although any version later than 2019.2 is compatible). However, it is important to note that the BM framework autonomously generates the HDL code, making it vendor-independent and portable to other FPGA boards. For interacting with the FPGA programmed with the firmware generated by our system to perform inference, we use the Python library pybondmachine, which leverages PYNQ [
44] under the hood to simplify the development and control of hardware-accelerated applications within a Python-based environment.
The benchmarking process involved systematically varying the numerical precision of the model, made possible through the integration of the FloPoCo library. The evaluation focused on three critical metrics: resource utilization, measured in terms of Look-Up Tables (LUTs), Registers (REGs), and Digital Signal Processors (DSPs), particularly important in this context, as also emphasized in [
45], where the authors propose a methodology to estimate resource usage during the early stages of FPGA design; latency, defined as the time required to perform a single inference, including host-to-FPGA PCIe transfer time; and precision, quantified as the deviation of the predictions of the FPGA-implemented model from the ground truth. Adjusting numerical precision is a commonly adopted technique for reducing a model’s energy consumption. In this context, a real-world power consumption analysis was performed to assess the effect of FloPoCo numerical precision on the model’s power usage, as mentioned in
Section 3.6. The detailed results of the power consumption [
10,
11,
46,
47,
48,
49,
50,
51] analysis are presented in
Supplementary Materials.
The benchmark results, summarized in
Table 3, reveal critical points in the performance and resource utilization of various numerical precision formats when implemented on the Alveo U55C FPGA. All latencies were obtained over 166,000 inference runs, corresponding to the number of test samples. Concerning resource utilization, the float32 format, which adhered to the IEEE754 standard, exhibits the highest utilization of LUTs at 476,416, accounting for 36.54% of the FPGA’s available resources. As the precision is reduced to lower bit widths (e.g., float16, flpe5f11, flpe6f10), LUT usage decreases, with the flpe6f4 format requiring only 274,523 LUTs, which corresponds to 21.06% of the available resources. This suggests that lower-precision formats tend to be more efficient in terms of LUT utilization.
Overall trends across formats are visualized in
Figure 4.
Similar trends are observed for the utilization of REGs, with the float32 format requiring the largest number of registers (456,235, or 17.50%). As precision is reduced, the number of registers used decreases, with the fixed<16,8> format utilizing only 207,670 REGs, or 7.94%. This reduction in resource usage is consistent with the general behavior expected from lower-precision formats, which require fewer resources for computation. DSP utilization remains relatively low for most formats. The float32 and float16 formats require 954 and 479 DSPs, respectively. In contrast, the lower-precision flpe formats (e.g., flpe6f4) utilize a minimal number of DSPs (just 4), highlighting the significant reduction in DSP resource requirements when lower-precision custom formats are employed.
The results of the latency show that the float32 format exhibits the highest latency (12.29 ± 0.15 ), which is expected as higher precision computations generally involve more complex operations, leading to increased computation burden. As precision decreases, latency also decreases. For example, the latency for the flpe6f4 format is reduced to 2.72 ± 0.23 , demonstrating a performance improvement with lower precision. Moreover, the fixed<16,8> format shows the lowest latency (1.39 ± 0.06 ), indicating that fixed-point precision can offer significant speed advantages for certain tasks.
The accuracy of the models is measured as the deviation between the predictions of the FPGA-implemented model and the ground truth. In general, all the floating-point precision formats, including float32, float16, and the various FloPoCo-based formats, exhibit high accuracy, with values approaching or exceeding 99%. Among the FloPoCo-based formats, accuracy is further improved. The flpe7f22, flpe5f11, and flpe6f10 formats each achieve 100% accuracy, demonstrating that FloPoCo-based representations can match or even surpass the precision of conventional floating-point formats, resulting in predictions exactly identical to those of a standard architecture. Furthermore, these formats are particularly advantageous, as they reduce resource utilization and latency while maintaining optimal accuracy, demonstrating the effectiveness of FloPoCo in custom DNN implementations. On the other hand, the fixed<16,8> format, which uses fixed-point arithmetic and is still under development, shows a significant reduction in accuracy, achieving only 86.03%, which corresponds to a 13.97% drop compared with the float32 baseline. This decrease can be attributed to the inherent limitations of fixed-point arithmetic, where rounding errors accumulate, leading to a less precise model.
4.1. Comparing HLS4ML and BM
Performing ML inference on an FPGA is a goal pursued by several frameworks. Among these, one of the most well-known is High-Level Synthesis for Machine Learning (HLS4ML) [
52], designed to translate ML models trained in high-level languages like Python [
53] or TensorFlow [
54] into custom hardware code for FPGA hardware acceleration. Therefore, we chose this framework as a reference point and to evaluate our solution by benchmarking it against the HLS4ML approach, using the same DL model introduced in
Section 4, the LHC jet dataset, and the Alveo U55C FPGA. HLS4ML also offers various options in terms of numerical precision, and for our comparison, we chose to use the default numerical precision set by the library itself during the high-level configuration phase, specifically using a fixed-point representation with a total of 16 bits, 6 of which are allocated for the fractional part. Moreover, HLS4ML offers multiple strategies for firmware generation. We selected the default option, called
Latency, which prioritizes performance over resource efficiency. HLS4ML, thanks to the drivers it provides for interaction between the PS (Processing System) and PL (Programmable Logic) parts, returns metrics about the inference time. Although the resulting timing depends on the batch size selected, the best results obtained for a single classification are reported in
Table 4. For the sake of completeness, it is important to underline that both systems have been evaluated at fixed-point precision with the same total bit-width (i.e., 16 bits); indeed, the BM uses fixed<16,8>, while the HLS4ML run used its default fixed<16,6>. Thus, the total bit-width was therefore identical across frameworks; the difference lies only in the integer/fractional split. We also tested the BM design at fixed<16,6> and found the system’s precision to be about 2% worse than with fixed<16,8>, which is the reason why we are reporting the fixed<16,8> results for the BM, while HLS4ML used its default fixed<16,6>.
Compared with the system presented, the HLS4ML solution is more efficient in terms of overall resource utilization and performance compared with the BM solution. Specifically, the HLS4ML implementation with fixed<16,6>precision achieves a low inference latency of 0.17 ± 0.01 , which is an order of magnitude faster than the best latency achieved by the BM approach (1.39 ± 0.06 ) using the same fixed-point configuration. While the BM design using floating-point precision formats achieves accuracy values close to or even at 100%, and the fixed-point BM implementation reaches 81.73%, the HLS4ML fixed-point design achieves an intermediate accuracy of 95.11%. This indicates that while HLS4ML’s optimizations for resource usage and latency result in excellent performance and hardware efficiency, they also introduce some loss of precision, likely due to the quantization effects associated with its default fixed-point configuration.
The comparison with HLS4ML was made to assess the overhead introduced by the BM layer, which requires more resources for the same numerical precision while maintaining similar performance in terms of latency. In order to interpret these results well, we need to underline here, once more, that the BM provides a new kind of computer architecture, where the hardware dynamically adapts to the specific computational problem rather than being static and generic, and the hardware and software are co-designed, guaranteeing a full exploitation of fabric capabilities. Although the HLS4ML solution exceeds the proposed approach in terms of resource efficiency, it lacks the same level of flexibility and general-purpose adaptability. In contrast, the BM ecosystem offers greater versatility. As an open-source, vendor-independent platform, BM enables the deployment of the same NN model across different boards while autonomously generating HDL code. This presents a significant advantage over other solutions, which are more constrained in terms of hardware compatibility and are often vendor-dependent, relying on board-specific tools to generate low-level HDL code. Additionally, BM’s modular design ensures seamless integration with other frameworks and libraries, as demonstrated in this work with the integration of the FloPoCo library, making it highly adaptable to diverse hardware configurations. While HLS4ML proves to be more efficient for the current application, BM’s flexibility and extensibility establish it as a more versatile and scalable solution since it can seamlessly adapt to diverse hardware platforms, integrate specialized libraries, and meet evolving application requirements without being tied to vendor-specific tools or architectures.
To further compare our solution with other frameworks, our system achieves an inference latency on the order of 1
per sample. While a direct comparison is not strictly fair due to differences in FPGA boards and DL models used in the literature, this latency remains significantly lower than that reported for Vitis AI (
for VGG16 on an Avnet Ultra96-V2 [
55]) and is of the same order of magnitude as FINN’s best reported performance (
for a quantized MLP on an embedded FPGA [
56]). Other frameworks, such as DeepBurning [
57] support automatic RTL-based accelerator generation, but their original publications do not provide explicit latency or throughput measurements, preventing quantitative comparison.
4.2. Technical Limitations and Future Work
While the proposed solution based on the BM framework shows competitive performance and accuracy for certain types of DL models, particularly small to medium MLPs and domain-specific configurations, it still presents technical limitations when applied to larger and more complex neural network architectures. First, even if the modular and heterogeneous design enables fine-grained customization and efficient mapping of operations, the current implementation has been primarily validated on relatively compact models. When scaling to architectures with a significantly higher number of layers and parameters (e.g., large CNNs, Transformers), resource utilization can grow rapidly, leading to LUT, REG, and DSP saturation on typical FPGA devices. Second, while the integration of custom numerical formats via FloPoCo enables accuracy levels comparable to IEEE 754 in many cases, as shown in
Table 3, model performance in more complex scenarios may still suffer from quantization effects, particularly in deeper networks where the accumulation of rounding errors can lead to significant deviations from expected outputs. Finally, the current implementation focuses on inference tasks, and while it demonstrates the potential for efficient deployment of MLPs, it does not yet support more advanced NN architectures such as convolutional or recurrent networks, which are commonly used in many applications. To address these limitations, future work will focus on several key areas. One direction is to enhance the framework’s ability to handle larger and more complex models, possibly by introducing dynamic directives that allow the framework to collapse multiple layers into a single computational block, thereby reducing overall resource requirements while accepting a potential trade-off in latency. Another priority will be the integration of advanced precision management strategies, including mixed-precision arithmetic as well as more sophisticated quantization techniques, to mitigate the impact of numerical errors in deeper networks. In parallel, compiler and toolchain optimizations will aim to reduce the overhead introduced by the abstraction layer, improving code generation efficiency and minimizing inter-core communication latency. Finally, we are currently testing a multi-FPGA implementation designed to distribute the computational complexity of large-scale DL models across multiple devices, which could significantly extend the scalability of the approach.
5. Conclusions
In the present work, we describe the developments and updates done in the BondMachine OpenSource framework for the design of hardware accelerators for FPGAs, in particular for the implementation of machine learning models, highlighting its versatility and customizability.
Although this work demonstrates the potential of the BondMachine framework as a tool that can also be used to develop and port ML-based models to FPGAs (mainly MLP architectures) and has the capability of integrating different existing libraries for arithmetic precision, we want to emphasize its current exploratory nature. The focus on MLP architectures was a deliberate choice to establish a robust methodology and validate the core principles of our hardware-software co-design approach. We recognize that extending this framework to more complex, large-scale models is a nontrivial step that will be the primary focus of our future research efforts.
The paper begins with a brief introduction to the BM framework, describing its main features and the main components. Then, we detailed how the framework has been extended to map a neural network multi-layer perceptron model into a BondMachine architecture to perform inference on an FPGA. We tested the solution proposed, discussing the results obtained and the optimizations done to improve the performance. Next, we have reported in detail the analyses of our implementation varying the numerical precision, both by integrating external libraries such as FloPoCo and by using well-known techniques like fixed-point precision, evaluating variations in terms of latency and resource utilization. We also reported some of the results about the energy consumption measurements, a key aspect in this context, describing the setup used and providing real measurements of the power consumption of the FPGA board, giving a comparison with the energy consumption of the same task performed on a standard CPU architecture.
Finally, we have compared our solution with respect to the HLS4ML one, a well-known framework for implementing machine learning inference on FPGA. Overall, this work establishes a methodology of testing and measurements for future developments of bigger and more complex models (i.e., we are indeed working on the implementation of Convolution NN), as well as for different FPGA devices, and it represents a first step towards the implementation of a complete and efficient solution for the inference of machine learning models on FPGA.