Generating Direct Logic Circuit Implementations of Deeply Quantized Neural Networks Using Chisel4ml

Vreča, Jure; Biasizzo, Anton

doi:10.3390/electronics14050849

Open AccessArticle

Generating Direct Logic Circuit Implementations of Deeply Quantized Neural Networks Using Chisel4ml

by

Jure Vreča

^1,2,*

and

Anton Biasizzo

^1,2

¹

Jožef Stefan Institute, 1000 Ljubljana, Slovenia

²

Jožef Stefan International Postgraduate School (IPS), 1000 Ljubljana, Slovenia

^*

Author to whom correspondence should be addressed.

Electronics 2025, 14(5), 849; https://doi.org/10.3390/electronics14050849

Submission received: 15 January 2025 / Revised: 10 February 2025 / Accepted: 17 February 2025 / Published: 21 February 2025

(This article belongs to the Special Issue Embedded AI)

Download

Browse Figures

Versions Notes

Abstract

Deploying deeply quantized neural networks on FPGA devices can be a time-consuming task. This has led to research on tools to automate this procedure, specifically for the case of fast machine learning. This is a specialized field concerned with the very low latency processing of machine learning algorithms as opposed to the more usual task where throughput is the main goal. Existing automated solutions are mainly based on high-level synthesis, which tends to be inefficient for larger neural networks due to the use of polynomial time algorithms. In this paper, we present chisel4ml, a tool for generating fully parallel and very low latency hardware implementations of deeply quantized neural networks for FPGA devices. The circuits generated by chisel4ml utilize up to 40% fewer look-up tables for a reasonably sized neural network compared to fully parallel implementations based on high-level synthesis. As chisel4ml uses structural descriptions of deeply quantized neural networks in the form of Chisel generators, it is able to generate the hardware in two orders of magnitude less time than solutions based on high-level synthesis.

Keywords:

deeply quantized neural networks; quantization; FPGA; Chisel

1. Introduction

Artificial neural networks (ANNs) are important machine learning models that can be used in a plethora of different applications. However, the requirements for ANN implementations vary greatly depending on the application, particularly in terms of the computing resources used, processing latency, throughput, and energy efficiency. Our focus in this paper is the field of fast machine learning [1], which is mainly concerned with the very low latency inference of neural networks. An example application of fast machine learning is the triggering system at the CERN Large Hadron Collider (LHC) [2]. The LHC sensors generate vast amounts of data at terabytes per second data rates. To deal with such high data rates, a neural network triggering system is used to detect events of interest and to filter out the irrelevant data. Due to the limited capacity of temporary storage, the required latency for the triggering system is in the order of a few microseconds. Other applications of fast machine learning include machine learning-based particle accelerator control [3], the fast processing of radio signals [4], and network intrusion detection [5].

High-performance ANN processing is typically performed using GPGPUs or custom ASIC solutions, which are often based on systolic arrays. However, these solutions are optimized for high throughput and not for low latency. It has been shown that the required very low latency ANN processing can be achieved using custom designed fully parallel structures with weights hardwired into the circuit implemented on FPGA devices [2]. The straightforward implementation of an ANN model using floating-point operations results in large digital circuits which are difficult or even impossible to implement on FPGA devices. The complexity of ANN models and consequently the hardware implementation can be reduced by pruning. Pruning is a technique where less significant neuron connections are eliminated. Another technique involves quantization of the ANN model, where floating-point parameters and/or activations are transformed to a compact fixed-point representation. The resulting ANNs are called quantized neural networks (QNNs).

Most approaches to implement QNNs in hardware are based on high-level synthesis (HLS) techniques. Two examples of such HLS-based tools are the FINN framework [6,7] and hls4ml [2,8,9]. The FINN framework is based on AMD Vitis HLS, while hls4ml is more general and supports several HLS tools, e.g., AMD Vitis HLS, Intel HLS compiler, and Catapult HLS. Both frameworks require a trained QNN as input. The training of QNNs requires an appropriate implementation of the quantization functions. Currently, there are two established libraries that provide such implementations and are used to describe QNNs: QKeras [10] and brevitas [11]. Initially, hls4ml supported only QKeras, while brevitas was better integrated with the FINN framework. However, both tools have recently adopted the Quantized Open Neural Network Exchange (QONNX) [12,13] neural network representation as an QNN description format. The use of QONNX enables interoperability, as the training frameworks like QKeras and brevitas can convert their internal model representation to QONNX, and the hardware generation tools only need to know QONNX.

HLS is a family of techniques that in general aims to raise the level of abstraction in hardware design by providing the option to synthesize untimed behavioral descriptions, i.e., an algorithm, into timed structural descriptions of hardware, e.g., registers and gates. This task is roughly separated into the following subtasks: scheduling, resource allocation or binding, and control generation. Initially, a high-level behavioral description, e.g., C code, is ingested and a parallelized data-flow graph is constructed. Next, the scheduling subtask tries to fit the various operations into a clock period based on the desired clock frequency. A scheduled data-flow graph is then passed to the binding subtask that determines the hardware resources used to implement each scheduled operation. Finally, a finite-state machine is synthesized to control the circuit [14,15]. These subtasks are not completely independent of each other. Furthermore, many subtasks of HLS are known to be NP-hard [16]. In practice, polynomial time algorithms that give sub-optimal solutions are used [17]. However, even a polynomial time algorithm leads to large running time for large designs such as neural networks.

In this paper, we present chisel4ml [18], an open source tool for generating hardware implementations of deeply quantized neural networks. It uses structural descriptions in the form of Chisel generators. We compare the performance of chisel4ml with an established tool hls4ml and show that chisel4ml can generate comparable direct logic implementations of circuits to hls4ml in significantly less time.

The remainder of this paper is organized as follows. In Section 2, we provide the required background knowledge on QNNs, HLS-based approaches, and the Chisel hardware construction language. In Section 3, we describe the chisel4ml tool in detail. In Section 4, we compare our tool with hls4ml on a variety of different layer configurations and neural networks. In Section 5, we give an overview of the results and discuss their implications. Finally, in Section 6, we draw a conclusion and point out some possible future directions.

2. Background

2.1. Artificial Neural Networks

Artificial neural networks are a subgroup of machine learning models that are constructed from a series of interconnected layers, which are generally groupings of neurons. A single neuron is a computational model that is defined by a series of weighted input features that are summed up together with a bias value and passed through an activation function to produce a single output feature. Equation (1) shows how to compute the scalar output feature y of a neuron:

y = f (b + \sum_{i}^{N} w_{i} \cdot x_{i}) = f (\vec{w} \cdot \vec{x} + b),

(1)

where

x_{i}

is the i-th element of the input feature vector

\vec{x}

,

w_{i}

is the i-th weight of the weight vector

\vec{w}

, b is a bias value, and f is a non-linear activation function. Some of the common layers in ANNs are fully connected layers, convolutional layers, and various pooling layers such as the maximum pooling layer.

A fully connected layer, i.e., a dense or linear layer, is a grouping of neurons where all layer input features are connected to all neurons of the layer. Mathematically, it can be expressed as

\vec{y} = f (W \cdot \vec{x} + \vec{b}),

(2)

where

\vec{x}

is the vector of input features,

\vec{y}

is the vector of output features, W is a weight matrix,

\vec{b}

is the bias vector, and f is a non-linear transformation function.

Another common layer is a convolutional layer that performs operations on multi-dimensional input features and kernel tensors. Equation (3) shows a basic equation to compute a convolutional layer over a three-dimensional input feature tensor X, e.g., an RGB image, to produce a three-dimensional output feature tensor Y:

Y [k, i, j] = f (\sum_{c = 0}^{C - 1} \sum_{x = 0}^{H_{F} - 1} \sum_{y = 0}^{W_{F} - 1} (X [c, i + x, j + y] \cdot F [k, c, x, y]) + b [k]),

(3)

where the input feature tensor X has the dimensions [C,

H_{X}

,

W_{X}

], the weight tensor F has the dimensions [K, C,

H_{F}

,

W_{F}

], the vector

\vec{b}

has K elements, and the output feature tensor Y has the dimensions [K,

H_{O}

,

W_{O}

]. The weight tensor is composed of a set of K filters that span all channels C of the input but span only a subset of the height and width:

H_{F} < H_{X}

,

W_{F} < W_{X}

. Figure 1 shows a diagram of the convolution operation.

A depthwise convolutional layer is a variant of the convolutional layer where the convolution operation is applied to each input feature channel separately. A side effect of this is that the number of output feature channels is a multiple of the input feature channels. The depthwise convolutional layer is less computationally intensive, as each kernel only considers a single input feature channel, and is usually preferred in computationally constrained environments [19].

The maximum pooling layer differs from the previously discussed layers, in that it does not perform neuron computation and it does not have weight parameters. The operation of the maximum pooling resembles the operation of a convolutional layer since it operates on a given window of input features; however, instead of computing the dot product, the largest element is selected. It is usually used as a down-sampling layer. Figure 2 shows a simple example of a maximum pooling operation.

2.2. Quantized Neural Networks

The neuron model described in Equation (1) is defined for real numbers and is typically computed on digital computers using various forms of floating-point number representations. However, as the number of neurons and consequently the number of weight parameters in a neural network becomes very large, storing parameters using floating-point representation vastly increases the memory consumption. This facilitates research into quantized neural networks (QNNs) that, instead of floating-point representation, use fixed-point representation of both the parameters and the activations. The quantization methods used in QNN models can be broadly divided into two categories:

Post-Training Quantization (PTQ);
Quantization-Aware Training (QAT).

PTQ methods, as the name entails, convert a trained floating-point neural network directly into a QNN. Optionally, they may use some sample input data for calibration. On the other hand, QAT methods perform quantization during the training process. During QAT training, the parameters are still stored internally in floating-point representation. However, special functions, called fake quantization functions, are inserted into the neural network graph that represents the neural network computation. These functions round the values of floating-point parameters to a limited range of values that can be interpreted as fixed-point numbers. This means that although the parameters are stored in floating-point representation during training, they can be converted to fixed-point representation for inference. QAT techniques typically achieve better results and are able to train neural networks quantized to lower bitwidths and/or with less loss of accuracy. Figure 3a shows a normal computational graph of a fully connected layer, and Figure 3b shows a graph with fake quantization functions inserted.

An important consideration in training quantized neural networks is that parameter value ranges tend to differ significantly between the various layers. To allow the use of even lower bitwidths while simultaneously representing large values, we introduce a scaling parameter for the weights of a neuron:

w \approx \frac{w_{q}}{S}

(4)

This maps the quantized weight values

w_{q}

to the effective weight value w. The scaling factor can significantly affect the QNN model accuracy and can be determined as part of the training process [20]. To simplify the scaling operation, it is limited to a power of two, which simplifies the division to a simpler shift operation. By combining Equation (4) and Equation (1), we obtain

y = f (\vec{w} \cdot \vec{x} + b) \approx f (\frac{\vec{w_{q}} \cdot \vec{x}}{S} + b)

(5)

This transforms all operations into efficient integer operations if

\vec{w_{q}}

,

\vec{x}

, and b are all integers, and S is a power of two.

Another aspect of quantization is its granularity. Per-tensor quantization implies a common S and a common bitwidth of the tensor. A common configuration for convolutional layers is per-kernel quantization, where the scaling parameters and/or bitwidths differ for each kernel.

Neural networks with weights and/or activations quantized below 8 bits are called deeply quantized neural networks (DQNNs). They are distinguished from general quantized neural networks (QNN) because most modern computer systems cannot process DQNNs efficiently. This is because most computer systems are built using a byte as the minimum bit grouping. On such systems, special care must be taken to ensure that parameters with smaller bitwidths are stored efficiently. Despite the low bitwidth, such neural networks can also be used for complex tasks such as the semantic segmentation of video [21] and for large language models [22].

For a more comprehensive introduction to neural network quantization, the interested reader is referred to the book chapter by Gholami et al. [23].

2.3. HLS-Based Approach

DQNN neural networks can be implemented using a high-level synthesis approach. An example of such a tool is hls4ml. hls4ml is developed in Python 3 and uses its own internal model representation of a DQNN. It has a series of frontends and backends. The frontends read the various DQNN representation formats and translate them into an internal DQNN representation. For example, the QKeras frontend can translate QKeras models directly to the internal DQNN representation, while brevitas models are supported by the intermediate QONNX format. The backends process the internal DQNN representation and generate C/C++ implementations that are suitable for HLS methods. The hls4ml backends support a plethora of different HLS tools, including Vitis HLS, Intel HLS, and Catapult HLS.

In order to direct the HLS C/C++ code generation, the user of hls4ml can, using HLSConfig class, specify additional attributes of the desired hardware configuration, such as the following:

The desired parallelization factors for each layer or globally;
The implementation strategy: to minimize latency or to minimize resource utilization;
The name of the generated project and the project directory.

The HLSConfig object is merged with the internal representation of a DQNN into the ModelGraph object. Using the ModelGraph object, hls4ml generates a set of C++ kernels that are suitable for the various high-level synthesis tools. This generation is based on pre-written C++ parameterized implementations.

It is also possible to convert floating-point models with hls4ml; however, that is beyond the scope of this paper.

2.4. Chisel Hardware Construction Language

In contrast to HLS, Chisel hardware construction language uses the Register-Transfer Level (RTL) abstraction level. This means that the designer directly instantiates and connects hardware components, e.g., registers and multiplexers, instead of devising them from an algorithm description using HLS. However, as Chisel is an embedded language in a general purpose programming language Scala, the designer is writing a program that, when executed, constructs an interconnected graph representation of a hardware structure. The interconnected graph representation can be exported as Verilog code. To illustrate the Chisel design flow, an example of FIR filter implementations is given.

First, the structure of a simple 2nd-order FIR filter is depicted in Figure 4, and the Chisel code that generates the corresponding hardware implementation is given in Listing 1. In the hardware implementation, the delays (

z^{- 1}

) are realized with registers (line 7–8). The input signal and its delay values are multiplied by the corresponding coefficients and summed to compute the output (line 10).

Listing 1. A simple FIR filter [24].

The FIR structure can be generalized to an arbitrary N-order FIR filter shown in Figure 5. The Scala language enables us to generalize the Chisel description in a concise and precise manner. A generic FIR filter implementation in Chisel is listed in Listing 2. This generic implementation represents an arbitrary N-order FIR filter, where the filter is determined by the provided coefficients coeff. Note that this generic Chisel implementation generates an equivalent hardware design to the previous implementation of a 2nd-order FIR filter presented in Listing 1 when appropriate coefficients are given. This implementation does not introduce any additional hardware structures. While the detailed explanation of the code in Listing 2 is beyond the scope of this paper, a brief description is the following:

A sequence of registers of the same size as the input sequence of coefficients is created (line 7).
The registers are sequentially connected and the input is connected to the first register (lines 8–10).
Next, a series of multipliers that multiply each register output with the corresponding coefficient are instantiated (line 14).
Finally, a series of adders that sum the N products are instantiated. The sum is the FIR output (line 17).

Listing 2. Generic FIR Filter design [24].

3. Chisel4ml

DQNN translation to a hardware for fast inference is commonly performed using high-level synthesis. This technique is employed by the hls4ml tool. The main drawback of HLS-based approaches is the complexity of some of the underlying problems, which can currently be solved in polynomial time at best [17]. This results in a prohibitive running time of the hls4ml tool for large designs. Additionally, it may produce sub-optimal designs [25].

To overcome these shortcomings, we developed chisel4ml tool, a solution based on Chisel. Chisel is flexible enough to describe the circuits concisely, and since it uses a structural description of the hardware, it is able to generate circuits much faster. chisel4ml uses an internal model representation called Low-Bitwidth Intermediate Representation (LBIR). LBIR is able to represent quantized neural networks of arbitrary bitwidths. It supports different granularities of quantization, including per-tensor, per-kernel and per-channel granularity.

chisel4ml is composed of two main parts: a Python frontend that provides the user interface and a Scala backend that generates and simulates the circuits. The high-level software architecture is given in Figure 6.

The source code of chisel4ml is available at https://github.com/cs-jsi/chisel4ml, (accessed on 12 January 2025).

3.1. Python Frontend

The chisel4ml frontend is implemented in Python. This is because the DQNN training libraries brevitas and QKeras both have a Python interface. Thus, the designer developing the neural network will not have to change the language environment. The Python frontend has the following tasks:

Transforming DQNN representations
The Python frontend transforms DQNN model descriptions in the form of QKeras or QONNX models into LBIR.
Simulation Interface
An LBIR model is sent from the Python frontend to the Chisel backend. The Chisel backend then generates the circuit and stores it internally. It returns a reference to the frontend, which can be used to simulate the circuit. Simulations can be driven from Python by using numpy arrays as the stimulus. These are sent to the Chisel backend, simulated, and the simulation results are sent back to the Python frontend. The simulation results can also be saved in a waveform file (e.g., VCD). This waveform file can then be examined with a waveform viewer such as GTKWave.
Circuit Packaging
The generated circuit description in the form of a Verilog file can be retrieved from internal storage by the Python frontend. This can then be used to integrate the circuit into a larger design.

Listing 3 demonstrates the functionality described above in a short Python code snippet.

Listing 3. Demonstration of chisel4ml frontend.

3.2. Chisel Backend

The Chisel backend receives an LBIR DQNN model description and generates the corresponding hardware implementation. The generated hardware is stored internally, and a circuit identification number is sent back to the Python frontend, which can be used to trigger a simulation.

3.2.1. Representing Quantization Information

Scala, and consequently Chisel, is a strongly typed language. This helps us avoid a lot of mistakes; however, it also makes it more difficult to create designs that can use different data types. This is important because there are many different quantization types, and it would be impractical to provide a different implementation of neurons and layers for each quantization type. To solve this, we define the abstract

N e u r o n C o m p u t e

class, described in Listing 4, which encapsulates the quantization type information. It provides an interface to a Chisel function for implementing a generic quantized neuron. An abridged example of a concrete

N e u r o n C o m p u t e

implementation

N C U I n t S I n t U I n t

is shown in Listing 4. It represents a quantized neuron with signed weights and unsigned inputs and outputs. Using such a scheme, chisel4ml is able to represent other types of quantization as well. Binarized quantization is a different example where the inputs, weights, and outputs are quantized to a single bit. In binarized quantization, the multiplication operation is the XNOR operation, and the vector addition is simplified to a population count operation [26] as shown by the

N C U B o o l B o o l B o o l

class in Listing 4.

Listing 4. The NeuronCompute class.

3.2.2. Direct Logic Implementation of a Neuron

A quantized neuron implementation in Chisel can be represented by a function. A simplified Chisel function that generates the direct logic implementation of a neuron is shown in Listing 5. This function represents an implementation of a neuron with arbitrary quantization and is functionally equivalent to the quantized neuron described in Equation (5). It is used by higher-level modules to generate the various layers of the neural network. The inputs to this function are the following:

nc—the $N e u r o n C o m p u t e$ object that provides the methods used by the neuron function for a given quantization type,
in—input features,
weights—weights of the input features,
bias—bias value,
shift—a scaling factor exponent.

Listing 5. Simplified neuron implementation in Chisel.

The generality of this implementation stems from the usage of the abstract class NeuronCompute described in the previous subsection. The operation of the neuron function can be summarized as follows:

Create the multipliers and connect them to their corresponding inputs and weights (lines 8–10).
Next, the multiplication results are summed together and the bias is added (line 11).
A requantization operation of the sum is performed. Essentially, this is a shift operation where the shifted values are used to round the value. Special attention is paid to conform with the rounding behavior of the floating-point emulation of DQNNs (line 12).
Finally, the quantized value is passed to an activation function. The activation function also depends on applied quantization, specifically to the used output bitwidth, in order to properly handle the overflowed value.

3.2.3. Generating Layers of Neurons

The presented neuron implementation is general and can be used in several layer types. In fully connected layers, we simply connect all neurons to all input features, while in convolutional layers, we connect each neuron to its corresponding input feature window. Connecting the input features to the appropriate neuron is the task of the NeuronProcessingUnit (NPU) class. An abridged version of the NPU is shown in Listing 6.

Listing 6. Neuron Processing Unit.

The class in Listing 6 implements an arbitrary convolutional layer or a fully connected layer. Each neuron in the layer is instantiated by calling the neuron function. For instance, to implement a fully connected layer with two neurons, depicted in Figure 7, the neuron function is called twice. The getReceptiveField and getReceptiveFieldWeights functions connect the neurons with the corresponding input features and weights. The operation of the getReceptiveField function in the case of a fully connected layer and a convolutional layer is illustrated in Figure 7 and Figure 8, respectively. These two figures illustrate the connections of the input feature vector in to the neurons and then to the output feature vector out. In the case of the fully connected layer in Figure 7, all four inputs are connected to both neurons. In the case of the convolutional layer in Figure 8, each neuron is only connected to the input features that correspond to the neuron window. Note that only the first and last neurons are depicted to simplify the presented figure.

The generation of the maximum pooling layer is similar to the generation of convolutional layers. The

g e t R e c e p t i v e F i e l d

functions are reused; however, instead of the NeuronCompute object, an analogous OrderCompute object is used.

3.2.4. Interface Generation

chisel4ml generates an AXI-Stream style interface that can then be used to integrate the neural network into a larger design. To support flexible integration of the generated DQNN hardware into higher-level designs, a user can choose the width of this interface. Input and output registers are added to the DQNN hardware structure to integrate it with AXI-Stream interface. Currently, an additional pipeline stage is added for each DQNN layer in order to lower path delays and meet the required clock demand of the overall system. The pipeline registers are distributed throughout the combinational circuit using register retiming by the synthesis tool (e.g., Vivado).

4. Results

The performance of the tools was evaluated on a set of experiments, and a comparative study of chisel4ml and hls4ml was performed. First, a set of experiments on individual layer was conducted. These experiments were performed on randomly generated fully connected, convolutional, and maximum pooling layers while varying the layer attributes. Additionally, similar experiments on entire neural networks composed of the aforementioned layers were conducted. In the experiments, the performance of chisel4ml version 0.3.3 was compared with the hls4ml performance. Since we encountered some difficulties with hls4ml version 1.0.0, we used a slightly earlier version of hls4ml available on GitHub.

hls4ml allows setting the parallelization factor of the generated circuit. As chisel4ml generates fully parallel circuits, we set the hls4ml parallelization factor to its maximum value in order to obtain comparable designs. Furthermore, the maximum parallelization factor necessitates setting the strategy attribute to latency. In all cases, the hardware synthesis was performed by Vivado 2023.1 and Vitis HLS 2023.1 tools. The target of the synthesis was a AMD Virtex Ultrascale+ XCVU9P-L2FLGA2104E device. The experiments were performed on a high-performance computer with 1 TiB of DRAM memory and a 48 core Intel Xeon CPU E5-2680 v3 running at 2.50 GHz. The overview of the experiment flow is depicted in Figure 9.

The code to reproduce the experiment as well as detailed information on the used hls4ml version are available at https://github.com/jurevreca12/c4ml_test_runs, (accessed on 12 January 2025).

The designs were evaluated using the following three performance metrics:

Look-Up Tables
This is the number of look-up tables used by the implemented design indicates the combinational complexity of the circuit.
Path Delay
This is the combination of the logic and net delay that indicates the maximum achievable clock speed.
Generation Time
This is the total time required to translate the model from QONNX into a synthesized FPGA design. This also includes the synthesis performed by Vivado.

4.1. Fully Connected Layer Experiments

To study the performance of chisel4ml and hls4ml on fully connected layers, we performed four experiments. In each of the four experiments, we randomly generated a quantized fully connected layer by varying a selected attribute while keeping the other attributes constant. The following layer attributes were varied:

Number of input features;
Number of output features (neurons);
Number of bits used to quantize the input features;
Number of bits used to quantize the weights of the neurons.

In all cases, the output was quantized to four bits and the bias to eight bits. The generated layer was implemented using the chisel4ml and the hls4ml tool. The experiment results using employed configurations are shown in Figure 10. In these experiments, we evaluated a total of 24 different configurations of the fully connected layer. From the results presented, we can conclude that chisel4ml generates circuits of similar combinational complexity that have slightly longer path delays. However, chisel4ml is consistently faster at generating the hardware than hls4ml.

4.2. Convolutional Layer Experiments

The performance of the convolutional layer was evaluated by four experiments, where the following layer attributes were varied:

Number of input channels;
Number of output channels;
Number of bits used to quantize the input features;
Number of bits used to quantize the weights of the neurons.

In all cases, the output was quantized to four bits, the bias to eight bits, and the kernel size was set to

3 \times 3

. In the experiments, the input window size was

16 \times 16

, except in the experiment where the number of input channels was varied. In this experiment, the input window size was

8 \times 8

. We decided to reduce the input resolution for this experiment because hls4ml had problems completing the experiment when eight or sixteen channels were used. In these experiments, a total of 22 different configurations of the convolutional layer was evaluated. As can be seen in Figure 11a,b, the generation time of hls4ml increases drastically as the number of input or output channels increases. This prevents the use of the hls4ml tool in many applications since convolutional layers are often employed. We attribute this increase to the use of the polynomial time algorithms in HLS. On the other hand, chisel4ml is able to generate the design in a much shorter time, which increases linearly with the number of input or output channels. At the same time, the circuits generated by chisel4ml consume fewer hardware resources in most cases but have slightly larger path delays.

4.3. Maximum Pooling Layer Experiments

Four experiments were conducted to study the implementations of maximum pooling layers. In these experiments, we varied the following attributes of the layer:

Number of channels.
Input size (n); there where $n \times n \times n u m_o f_c h a n n e l s$ input features.
Kernel size (k); the kernel window size was $k \times k$ .
Number of bits used to quantize the input and output features.

In the experiments, depicted in Figure 12, 21 different maximum pooling layer configurations were evaluated. In all cases, chisel4ml was able to generate circuits with substantially fewer look-up tables and in less time. Similar to other layer types, the path delays were slightly larger.

4.4. Convolutional Neural Network Experiments

Two experiments were performed on a convolutional neural network trained on the MNIST dataset [27]. The topology of a test neural network, given in Table 1, was the same in both experiments. It is a simple convolutional neural network consisting of two convolutional layers, two maximum pooling layers, and two fully connected layers. One experiment was focused on the effect of the activation/weight bitwidth, and the second experiment focused on the effects of pruning. In the first experiment, 50% of the neural network connections were pruned, and the bitwidths of all activations and weights were changed simultaneously. In the second experiment, the bitwidths of all activations and weight were set to 4 bits, and the pruning rate was varied from 50% to 90%. In total, we trained and tested 12 different neural networks.

Each neural network was trained for 10 epochs using Quantization-Aware Training. Pruning was also performed during training. For the first five epochs of training, the neural networks used batch normalization, after which it was removed by folding it into the preceding layer [28]. After that, another five epochs of training were performed. The training was performed using the Adam optimizer [29] and a learning rate of 0.001. No hyper-parameter optimization was performed. Figure 13a,b show the accuracies achieved by the trained neural networks on the MNIST test dataset. As we can see, these neural networks are fairly insensitive to quantization. A substantial drop in accuracy occurred only when the weights and activations were quantized to two bits. Additionally, a fairly large rate of pruning is possible before the accuracy drops significantly. This happens when around 85% of the weights are pruned.

FPGA implementations of trained quantized and pruned convolutional neural networks were generated using the chisel4ml and the hls4ml tools. The results of the implementations are shown in Figure 14. The results reflect the observations made in previous experiments on the fully connected, the convolutional, and the maximum pooling layers. In general, chisel4ml is able to generate smaller fully parallel implementations in significantly less time. However, the implementations have a slightly higher path delay.

5. Discussion

The concise interpretation of the experiments result is given in Table 2. The comparison of the look-up table utilization is given in terms of the range of relative change from hls4ml to chisel4ml. The path delay and synthesis time are both given as a qualitative description of their growth rate. This is more important than the actual value, as it indicates the growth rate of the applied algorithm. Since the bitwidth of the DQNN parameters is below eight, the experiments with varied bitwidth should be interpreted in terms of dependence rather than growth rate.

We see that the relative difference between the number of utilized look-up tables used by chisel4ml and hls4ml varies significantly depending on the layer type. For fully connected layers, chisel4ml and hls4ml generate nearly equivalent designs. For convolutional layers, chisel4ml produces circuits with up to 46% fewer look-up tables, although there is one outlier. However, for the maximum pooling layer, there is a significant difference, and chisel4ml uses around 80% fewer look-up tables. We were not able to find the reason for this discrepancy.

The path delay results show that the path delay of the hls4ml generated circuits varies only slightly in most cases and can be interpreted as a constant. This is not surprising, as a desired clock period is specified to the generation process, and HLS strives to meet this demand. In general, the path delay depends on the complexity of the circuit as well as the number of pipeline stages. Currently, chisel4ml uses a constant number of pipeline stages. Because of this, the path delay increases with the rising complexity of the circuit. While the path delays in chisel4ml are longer, they are still low enough for most applications.

The generation time results reflect the complexity of the applied algorithms. From the conducted experiments, we can conclude that the bitwidths of parameters do not significantly affect the generation time of both hls4ml and chisel4ml. However, the growth rate of the hls4ml generation time is polynomial with respect to the number of input and/or output features. This limits the usability of the hls4ml tool in the case of medium and large DQNNs. On the other hand, the growth rate of the chisel4ml generation time is linear. As expected, the generation time of chisel4ml is smaller than the generation time of hls4ml in all experiments. The difference in generation time increases significantly with the increase in the number of features and is most notable in the case of convolutional layers and max-pooling layers.

In the CNN experiments, we used a fixed topology of the neural network so that the number of features and parameters is constant. We conducted two experiments: in the first experiment, we varied the bitwidth of the parameters and activations, and in the second experiment, we varied the pruning rate.

The CNN is composed of a series of layers, and the results are affected by each layer. The circuits generated by chisel4ml utilize up to 40% fewer look-up tables than the circuits generated by hls4ml as expected from the experiments performed on individual layer types. Similarly, the path delay dependency is linear in the case of chisel4ml and constant in the case of hls4ml. Likewise, the generation time increases linearly with the bitwidth. Note, however, that the chisel4ml generation time is significantly shorter.

Pruning can significantly reduce the complexity of the circuit while keeping the accuracy level relatively constant. While both chisel4ml and hls4ml can take advantage of pruning to produce smaller circuits, chisel4ml retains its advantages over hls4ml.

6. Conclusions

We presented chisel4ml, a tool for generating direct logic implementations of deeply quantized neural networks. It is a good solution for applications where very high parallelization is required. chisel4ml was compared to hls4ml, an established tool based on high-level synthesis techniques. chisel4ml was able to generate comparable or better hardware implementations in significantly less time due to the fact that it uses a structural description of the circuits in the form of Chisel generators. This gives it an advantage over HLS-based tools such as hls4ml.

One drawback of chisel4ml designs is higher path delays. To address this issue, we intend to give the designer the ability to control the number of pipeline stages directly. In this way, a designer will be able to reach a desired clock cycle by iteratively changing the number of register stages and then running synthesis.

In our comparison, fully parallelized implementations were generated using hls4ml. However, an advantage of using hls4ml is that the designer can adjust the parallelization factor and by doing so select an acceptable trade-off between complexity, latency, and throughput of the circuit. We intend to extend chisel4ml with a set of parameterizable sequential accelerators, which will allow us to generate implementations that achieve better design trade-offs.

Author Contributions

Methodology, A.B.; Software, J.V.; Writing—original draft, J.V.; Writing—review and editing, A.B. All authors have read and agreed to the published version of the manuscript.

Funding

The authors acknowledge the financial support from the Slovenian Research and Innovation Agency under grant: No. P2-0098. This work is also part of projects that are funded by the ECSEL Joint Undertaking under grant agreement No 101007273 (DAIS) and by the Chips Joint Undertaking under grant agreement No 101139892 (EdgeAI-Trust).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Publicly available datasets were analyzed in this study. These data can be found here: https://huggingface.co/datasets/ylecun/mnist, (accessed on 16 February 2025).

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

FPGA	Field-Programmable Gate Array
ANN	Artificial Neural Network
LHC	Large Hadron Collider
GPGPU	General Purpose Graphics Processing Unit
ASIC	Application-Specific Integrated Circuit
QNN	Quantized Neural Network
QONNX	Quantized Open Neural Network Exchange
CHISEL	Constructing Hardware In a Scala Embedded Language
HLS	High-Level Synthesis
RTL	Register-Transfer Level
PTQ	Post-Training Quantization
QAT	Quantization-Aware Training
DQNN	Deeply Quantized Neural Network
FIR	Finite Impulse Response
LBIR	Low-Bitwidth Intermediate Representation
DRAM	Dynamic Random Access Memory
ReLU	Rectified Linear Unit

References

Deiana, A.M.; Tran, N.; Agar, J.; Blott, M.; Di Guglielmo, G.; Duarte, J.; Harris, P.; Hauck, S.; Liu, M.; Neubauer, M.S.; et al. Applications and Techniques for Fast Machine Learning in Science. Front. Big Data 2022, 5, 787421. [Google Scholar] [CrossRef]
Duarte, J.; Han, S.; Harris, P.; Jindariani, S.; Kreinar, E.; Kreis, B.; Ngadiuba, J.; Pierini, M.; Rivera, R.; Tran, N.; et al. Fast inference of deep neural networks in FPGAs for particle physics. J. Instrum. 2018, 13, P07027. [Google Scholar] [CrossRef]
Shi, R.; Ogrenci, S.; Arnold, J.; Berlioz, J.; Hanlet, P.; Hazelwood, K.; Ibrahim, M.; Liu, H.; Nagaslaev, V.; Narayanan, A.; et al. ML-Based Real-Time Control at the Edge: An Approach Using hls4ml. In Proceedings of the 2024 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW), San Francisco, CA, USA, 27–31 May 2024; p. 191. [Google Scholar] [CrossRef]
Jentzsch, F.; Umuroglu, Y.; Pappalardo, A.; Blott, M.; Platzner, M. RadioML Meets FINN: Enabling Future RF Applications with FPGA Streaming Architectures. IEEE Micro 2022, 42, 125–133. [Google Scholar] [CrossRef]
Vreča, J.; Ivanov, I.; Papa, G.; Biasizzo, A. Detecting Network Intrusion Using Binarized Neural Networks. In Proceedings of the 2021 IEEE 7th World Forum on Internet of Things (WF-IoT), New Orleans, LA, USA, 14 June–31 July 2021; pp. 622–627. [Google Scholar] [CrossRef]
Umuroglu, Y.; Fraser, N.J.; Gambardella, G.; Blott, M.; Leong, P.; Jahre, M.; Vissers, K. FINN: A Framework for Fast, Scalable Binarized Neural Network Inference. In Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays (FPGA ’17), Monterey, CA, USA, 22–24 February 2017; Association for Computing Machinery: New York, NY, USA, 2017; pp. 65–74. [Google Scholar] [CrossRef]
Blott, M.; Preußer, T.B.; Fraser, N.J.; Gambardella, G.; O’brien, K.; Umuroglu, Y.; Leeser, M.; Vissers, K. FINN-R: An End-to-End Deep-Learning Framework for Fast Exploration of Quantized Neural Networks. ACM Trans. Reconfig. Technol. Syst. 2018, 11, 16. [Google Scholar] [CrossRef]
FastML Team. fastmachinelearning/hls4ml. 2023. Available online: https://zenodo.org/records/14344847 (accessed on 12 January 2025).
Aarrestad, T.; Loncar, V.; Ghielmetti, N.; Pierini, M.; Summers, S.; Ngadiuba, J.; Petersson, C.; Linander, H.; Iiyama, Y.; Guglielmo, G.D.; et al. Fast convolutional neural networks on FPGAs with hls4ml. Mach. Learn. Sci. Technol. 2021, 2, 045015. [Google Scholar] [CrossRef]
Coelho, C.N.; Kuusela, A.; Li, S.; Zhuang, H.; Ngadiuba, J.; Aarrestad, T.K.; Loncar, V.; Pierini, M.; Pol, A.A.; Summers, S. Automatic heterogeneous quantization of deep neural networks for low-latency inference on the edge for particle detectors. Nat. Mach. Intell. 2021, 3, 675–686. [Google Scholar] [CrossRef]
Pappalardo, A. Xilinx/brevitas. Available online: https://zenodo.org/records/13912206 (accessed on 12 January 2025).
Pappalardo, A.; Umuroglu, Y.; Blott, M.; Mitrevski, J.; Hawks, B.; Tran, N.; Loncar, V.; Summers, S.; Borras, H.; Muhizi, J.; et al. QONNX: Representing Arbitrary-Precision Quantized Neural Networks. arXiv 2022, arXiv:2206.07527. [Google Scholar]
Umuroglu, Y.; Borras, H.; Loncar, V.; Summers, S.; Duarte, J. fastmachinelearning/qonnx. 2022. Available online: https://zenodo.org/records/14537023 (accessed on 12 January 2025).
AMD Xilinx. Vitis High-Level Synthesis User Guide, v2023.1; AMD Xilinx: San Jose, CA, USA, 2023.
Cong, J.; Liu, B.; Neuendorffer, S.; Noguera, J.; Vissers, K.; Zhang, Z. High-Level Synthesis for FPGAs: From Prototyping to Deployment. IEEE Trans. Comput.-Aided Des. Integr. Circuits Syst. 2011, 30, 473–491. [Google Scholar] [CrossRef]
McFarland, M.C.; Parker, A.C.; Camposano, R. Tutorial on high-level synthesis. In Proceedings of the 25th ACM/IEEE Design Automation Conference (DAC ’88), Atlantic City, NJ, USA, 12–15 June 1988; IEEE Computer Society Press: Washington, DC, USA, 1988; pp. 330–336. [Google Scholar]
Cong, J.; Zhang, Z. An efficient and versatile scheduling algorithm based on SDC formulation. In Proceedings of the 43rd Annual Design Automation Conference (DAC ’06), San Francisco, CA, USA, 24–28 July 2006; Association for Computing Machinery: New York, NY, USA, 2006; pp. 433–438. [Google Scholar] [CrossRef]
Vreča, J.; Biasizzo, A. Towards Deploying Highly Quantized Neural Networks on FPGA Using Chisel. In Proceedings of the 26th Euromicro Conference on Digital System Design (DSD), Golem, Albania, 6–8 September 2023; pp. 161–167. [Google Scholar] [CrossRef]
Tan, M.; Le, Q. EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks. In Proceedings of the 36th International Conference on Machine Learning, Long Beach, CA, USA, 9–15 June 2019; Volume 97, pp. 6105–6114. [Google Scholar]
Bhalgat, Y.; Lee, J.; Nagel, M.; Blankevoort, T.; Kwak, N. LSQ+: Improving low-bit quantization through learnable offsets and better initialization. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), Seattle, WA, USA, 14–19 June 2020; pp. 2978–2985. [Google Scholar] [CrossRef]
Ghielmetti, N.; Loncar, V.; Pierini, M.; Roed, M.; Summers, S.; Aarrestad, T.; Petersson, C.; Linander, H.; Ngadiuba, J.; Lin, K.; et al. Real-time semantic segmentation on FPGAs for autonomous vehicles with hls4ml. Mach. Learn. Sci. Technol. 2022, 3, 045011. [Google Scholar] [CrossRef]
Ashkboos, S.; Markov, I.; Frantar, E.; Zhong, T.; Wang, X.; Ren, J.; Hoefler, T.; Alistarh, D. QUIK: Towards End-to-end 4-Bit Inference on Generative Large Language Models. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, Miami, FL, USA, 12–16 November 2024; Al-Onaizan, Y., Bansal, M., Chen, Y.N., Eds.; Association for Computational Linguistics: Stroudsburg, PA, USA, 2024; pp. 3355–3371. [Google Scholar] [CrossRef]
Gholami, A.; Kim, S.; Dong, Z.; Yao, Z.; Mahoney, M.W.; Keutzer, K. A Survey of Quantization Methods for Efficient Neural Network Inference. In Low-Power Computer Vision; Thiruvathukal, G.K., Lu, Y.H., Kim, J., Chen, Y., Chen, B., Eds.; Chapman and Hall/CRC: New York, NY, USA, 2022; Chapter 13; p. 36. [Google Scholar]
Bailey, S.; Izraelevitz, A.; Lin, R.; Markley, C.; Rigge, P.; Wang, E. Chisel Bootcamp. Available online: https://github.com/freechipsproject/chisel-bootcamp (accessed on 13 November 2024).
Alam, S.A.; Gregg, D.; Gambardella, G.; Preusser, T.; Blott, M. On the RTL Implementation of FINN Matrix Vector Unit. ACM Trans. Embed. Comput. Syst. 2023, 22, 94. [Google Scholar] [CrossRef]
Hubara, I.; Courbariaux, M.; Soudry, D.; El-Yaniv, R.; Bengio, Y. Binarized neural networks. In Proceedings of the 30th International Conference on Neural Information Processing Systems, Barcelona, Spain, 5–10 December 2016; pp. 4114–4122. [Google Scholar]
Deng, L. The MNIST Database of Handwritten Digit Images for Machine Learning Research [Best of the Web]. IEEE Signal Process. Mag. 2012, 29, 141–142. [Google Scholar] [CrossRef]
Jacob, B.; Kligys, S.; Chen, B.; Zhu, M.; Tang, M.; Howard, A.; Adam, H.; Kalenichenko, D. Quantization and Training of Neural Networks for Efficient Integer-Arithmetic-Only Inference. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–23 June 2018. [Google Scholar]
Kingma, D.P.; Ba, J. Adam: A Method for Stochastic Optimization. In Proceedings of the 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, 7–9 May 2015. [Google Scholar]

Figure 1. A diagram of the convolution operation.

Figure 2. A diagram of the maximum-pooling operation.

Figure 3. A computational graph of (a) a fully connected layer and (b) the same graph with quantization functions inserted.

Figure 4. Diagram of the generated FIR circuit.

Figure 5. Diagram of a generic FIR filter circuit.

Figure 6. Software architecture of chisel4ml.

Figure 7. Fully connected layer mapping of four inputs to two outputs.

Figure 8. Convolutional layer mapping with the input size of

3 \times 3

and a kernel and output size of

2 \times 2

.

Figure 8. Convolutional layer mapping with the input size of

3 \times 3

and a kernel and output size of

2 \times 2

.

Figure 9. Experiment flow.

Figure 10. Results of fully connected layer experiments. (a) With 32 output features. Inputs, weights, and outputs quantized to 4 bits. Bias quantized to 8 bits. (b) With 16 input features. Inputs, weights, and outputs quantized to 4 bits. Bias quantized to 8 bits. (c) With 32 input and output features. Weights and outputs quantized to 4 bits. Bias quantized to 8 bits. (d) With 32 input and output features. Inputs and outputs quantized to 4 bits. Bias quantized to 8 bits.

Figure 11. Results of convolutional layer experiments. (a) With the input size of

8 \times 8

, 1 output channel, and a

3 \times 3

kernel. Inputs, weights, and outputs quantized to 4 bits. Bias quantized to 8 bits. (b) With the input size of

16 \times 16

, 1 input channel, and a

3 \times 3

kernel. Inputs, weights, and outputs quantized to 4 bits. Bias quantized to 8 bits. (c) With the input size of

16 \times 16

, 1 input and output channel, and a

3 \times 3

kernel. Weights and outputs quantized to 4 bits. Bias quantized to 8 bits. (d) With the input size of

16 \times 16

, 1 input and output channel, and a

3 \times 3

kernel. Inputs and outputs quantized to 4 bits. Bias quantized to 8 bits.

Figure 11. Results of convolutional layer experiments. (a) With the input size of

8 \times 8

, 1 output channel, and a

3 \times 3

kernel. Inputs, weights, and outputs quantized to 4 bits. Bias quantized to 8 bits. (b) With the input size of

16 \times 16

, 1 input channel, and a

3 \times 3

kernel. Inputs, weights, and outputs quantized to 4 bits. Bias quantized to 8 bits. (c) With the input size of

16 \times 16

, 1 input and output channel, and a

3 \times 3

kernel. Weights and outputs quantized to 4 bits. Bias quantized to 8 bits. (d) With the input size of

16 \times 16

, 1 input and output channel, and a

3 \times 3

kernel. Inputs and outputs quantized to 4 bits. Bias quantized to 8 bits.

Figure 12. Results of maximum pooling layer experiments. (a) With 3 channels,

3 \times 3

kernel size, and inputs and outputs quantized to 4 bits. (b) With

8 \times 8

input size,

3 \times 3

kernel size, and inputs and outputs quantized to 4 bits. (c) With 3 channels,

8 \times 8

input size, and inputs and outputs quantized to 4 bits. (d) With 3 channels,

8 \times 8

input size, and

2 \times 2

kernel size.

Figure 12. Results of maximum pooling layer experiments. (a) With 3 channels,

3 \times 3

kernel size, and inputs and outputs quantized to 4 bits. (b) With

8 \times 8

input size,

3 \times 3

kernel size, and inputs and outputs quantized to 4 bits. (c) With 3 channels,

8 \times 8

input size, and inputs and outputs quantized to 4 bits. (d) With 3 channels,

8 \times 8

input size, and

2 \times 2

kernel size.

Figure 13. Accuracy of the trained quantized (a) and pruned (b) neural networks.

Figure 14. Results of convolutional neural network experiments. (a) With pruning rate of 0.5. (b) With weights and activations quantized to 4 bits.

Table 1. Topology of the test neural network.

Layer	Input Size	Kernel Size	Output Size	Activation
Convolution	28 × 28 × 1	3 × 3	26 × 26 × 8	ReLU
Maximum Pooling	26 × 26 × 8	2 × 2	13 × 13 × 8	-
Depthwise Convolution	13 × 13 × 8	3 × 3	11 × 11 × 8	ReLU
Maximum Pooling	11 × 11 × 8	2 × 2	5 × 5 × 8	-
Fully Connected	200 (5 × 5 × 8)	-	256	ReLU
Fully Connected	256	-	10	-

Table 2. Experimental results.

Layer	Varied Attribute	LUTs	Path Delay		Generation Time
Layer	Varied Attribute	LUTs	c4ml	hls4ml	c4ml	hls4ml
Fully Connected	Input Features	−5%/−1%	lin	const	lin	poly
	Output Features	−2%/13%	const	const	lin	poly
	Input Bitwidth	−3%/9%	const	const	const	const
	Weight Bitwidth	−22%/6%	const	const	const	const
Convolutional	Input Channels	−55%/−32%	poly	const	lin	poly
	Output Channels	−32%/−6%	const	const	lin	poly
	Input Bitwidth	−46%/−11%	const	const	const	const
	Weight Bitwidth	−46%/19%	const	const	const	const
Maximum Pooling	Input Size	−83%/−81%	const	const	lin	poly
	Channels	−83%/−81%	const	const	lin	poly
	Kernel Size	−85%/−79%	const	const	lin	lin
	Input Bitwidth	−83%/−81%	lin	lin	lin	lin

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Vreča, J.; Biasizzo, A. Generating Direct Logic Circuit Implementations of Deeply Quantized Neural Networks Using Chisel4ml. Electronics 2025, 14, 849. https://doi.org/10.3390/electronics14050849

AMA Style

Vreča J, Biasizzo A. Generating Direct Logic Circuit Implementations of Deeply Quantized Neural Networks Using Chisel4ml. Electronics. 2025; 14(5):849. https://doi.org/10.3390/electronics14050849

Chicago/Turabian Style

Vreča, Jure, and Anton Biasizzo. 2025. "Generating Direct Logic Circuit Implementations of Deeply Quantized Neural Networks Using Chisel4ml" Electronics 14, no. 5: 849. https://doi.org/10.3390/electronics14050849

APA Style

Vreča, J., & Biasizzo, A. (2025). Generating Direct Logic Circuit Implementations of Deeply Quantized Neural Networks Using Chisel4ml. Electronics, 14(5), 849. https://doi.org/10.3390/electronics14050849

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Generating Direct Logic Circuit Implementations of Deeply Quantized Neural Networks Using Chisel4ml

Abstract

1. Introduction

2. Background

2.1. Artificial Neural Networks

2.2. Quantized Neural Networks

2.3. HLS-Based Approach

2.4. Chisel Hardware Construction Language

3. Chisel4ml

3.1. Python Frontend

3.2. Chisel Backend

3.2.1. Representing Quantization Information

3.2.2. Direct Logic Implementation of a Neuron

3.2.3. Generating Layers of Neurons

3.2.4. Interface Generation

4. Results

4.1. Fully Connected Layer Experiments

4.2. Convolutional Layer Experiments

4.3. Maximum Pooling Layer Experiments

4.4. Convolutional Neural Network Experiments

5. Discussion

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI