CNN2Gate: An Implementation of Convolutional Neural Networks Inference on FPGAs with Automated Design Space Exploration

: Convolutional Neural Networks (CNNs) have a major impact on our society, because of the numerous services they provide. These services include, but are not limited to image classiﬁcation, video analysis, and speech recognition. Recently, the number of researches that utilize FPGAs to implement CNNs are increasing rapidly. This is due to the lower power consumption and easy reconﬁgurability that are offered by these platforms. Because of the research efforts put into topics, such as architecture, synthesis, and optimization, some new challenges are arising for integrating suitable hardware solutions to high-level machine learning software libraries. This paper introduces an integrated framework (CNN2Gate), which supports compilation of a CNN model for an FPGA target. CNN2Gate is capable of parsing CNN models from several popular high-level machine learning libraries, such as Keras, Pytorch, Caffe2, etc. CNN2Gate extracts computation ﬂow of layers, in addition to weights and biases, and applies a “given” ﬁxed-point quantization. Furthermore, it writes this information in the proper format for the FPGA vendor’s OpenCL synthesis tools that are then used to build and run the project on FPGA. CNN2Gate performs design-space exploration and ﬁts the design on different FPGAs with limited logic resources automatically. This paper reports results of automatic synthesis and design-space exploration of AlexNet and VGG-16 on various Intel FPGA platforms.


Introduction
The impact of machine learning and deep learning is rapidly growing in our society, due to their diverse technological advantages. Convolutional neural networks (CNNs) are among the most notable architectures that provide a very powerful tool for many applications, such as video and image analysis, speech recognition, and recommender systems [1]. On the other hand, CNNs require considerable computing power. In order to better satisfy some given requirements, it is possible to use high-performance processors, like graphic processing units (GPUs) [2]. However, GPUs have some shortcomings that limit their usability and suitability in day-to-day mission-critical and real-time scenarios. The first downside of using GPUs is their high power consumption. This makes GPUs hard to use in robotics, drones, self-driving cars, and Internet of Things (IoTs), while these fields can highly benefit from deep learning algorithms. The second downside is the lack of external Inputs/Outputs (I/Os). GPUs are typically accessible through some PCI-express bus on their host The present paper proposes methods for tackling research challenges that are related to the integration of high-level synthesis tools for convolutional neural networks on FPGAs. The contributions of the present paper are: 1.
Generalized model analysis Most of the previous implementations of the CNNs on FPGA fall short in supporting as many machine learning libraries as possible. CNN2Gate benefits from a generalized model transfer format, called ONNX (Open Neural Network eXchange format) [18]. ONNX helps hardware synthesis tools to be decoupled from the framework in which a specific CNN was designed.
Using ONNX in CNN2Gate provides us the ability to focus on hardware synthesis without being concerned by details of specific machine learning tools. CNN2Gate integrates an ONNX parser that extracts the computation data-flow, as well as weights and biases from the ONNX representation of a CNN model. It then writes these data in a format that is more usable with hardware synthesis workflows.

Automated high-level synthesis tool
Several hardware implementation aspects of machine learning algorithms are not part of the skill set of many machine learning engineers and computer scientists. High-level synthesis is one of the solutions that aims at improving hardware design productivity. CNN2Gate offers a high-level synthesis workflow that can be used to implement CNN models on FPGA.
CNN2Gate is a Python library that parses, gets the design synthesized using vendor's OpenCL tools, and runs a CNN model automatically. Its embedded procedures aim to achieve the best throughput performance. Furthermore, CNN2Gate eliminates the need for FPGA experts to manually implement the CNN model targeting FPGA hardware during the early stages of the design. Having the form of a Python library, CNN2Gate can be directly exploited by machine learning developers in order to implement a CNN model on FPGAs. CNN2Gate is built around an available open-source tool, notably PipeCNN [8], which exploits the capabilities of existing design tools to use OpenCL kernels from which high-level synthesis can be performed.

3.
Hardware-aware design-space exploration An important aspect of design-space exploration is to try choosing design parameters that achieve some desired performance prior to the generation of physical design. CNN2Gate provides a design-space exploration tool that is able to adjust the level of parallelism of the algorithm to fit the design on various FPGA platforms. The design-space exploration methods that are proposed here use estimations of hardware resource requirements (e.g., DSPs, lookup tables, registers, and on-chip memory) in order to fit the design. These estimates are obtained by invoking the first stage of the synthesis tool that was provided by the FPGA vendor and receives back the estimated hardware resource utilization. In the next step, the tool tunes the design parameters according to the resource utilization feedback and iterates again to obtain the new hardware resource utilization. We have implemented three algorithms to undertake design-space exploration. The first algorithm is based on brute-force in order to check all possible parameter values. The second algorithm is based on a reinforcement learning (RL) agent that explores the design space with a set of defined policies and actions. Finally, the third algorithm uses the hill-climbing method in order to tackle the problem of design-space exploration in large convex design-spaces. The advantages and disadvantages of these three exploration algorithms will be discussed in the corresponding sections.
Note that CNN2Gate emphasizes software/hardware co-design methods. The goal of this paper is not to compare the performance of FPGAs and GPUs, but it explores the possibility of designing end-to-end generic frameworks that can leverage high-level model descriptions in order to implement FPGA-based CNN accelerators without human intervention. As shown in Figure 1, CNN2Gate is a Python library that can be used in order to perform inference of CNNs on FPGAs that is capable of:  Figure 1. The CNN2Gate overall architecture comprising a front-end parser (Open Neural Network eXchange format (ONNX) parser), a design-space exploration module, and leverages automated high-level synthesis.
The rest of the paper is organized, as follows. Section 2 reviews the related works. Section 3 reviews the most relevant background knowledge on convolutional neural networks. Section 4 elaborates on how CNN2Gate extracts the computation flow, configures the kernels, and executes design space exploration. Subsequently, Section 5 reports some results and compares them to other existing implementations.

Related Works
A great deal of research was conducted on implementing deep neural networks on FPGAs. Among those researches, hls4ml [14], fpgaConvNet [9], and Caffeine [15] are the most similar to the present paper. hls4ml is a companion compiler package for machine learning inference on FPGA. It translates open-source machine learning models into high-level synthesizable (HLS) descriptions. hls4ml was specifically developed for an application in particle physics, with the purpose of reducing the development time on FPGA. However, as stated in the status page of the project [19], the package only supports Keras and Tensorflow for CNNs and the support for Pytorch is in development. In addition, to the best of our knowledge, hls4ml does not offer design-space exploration. FpgaConvNet is also an end-to-end framework for the optimized mapping of CNNs on FPGAs. FpgaConvNet uses a symmetric multi-objective algorithm in order to optimize the generated design for either throughput, latency, or multi-objective criteria (e.g., throughput and latency). The front-end parser of fpgaConvNet can analyze models that are expressed in the Torch and Caffe machine-learning libraries. Caffeine is also a software-hardware co-design library that directly synthesizes Caffe models comprising convolutional layers and fully connected layers for FPGAs. The main differences between CNN2Gate and other cited works are in three key features. First, as explained in Section 4.1, CNN2Gate leverages a model transfer layer (ONNX), which automatically brings support for most machine-learning Python libraries without bounding the user to a specific machine-learning library. Second, CNN2Gate is based on OpenCL, unlike hls4ml and fpgasConvNet, which are based on C++. Third, CNN2Gate proposes an FPGA fitter algorithm that is based on reinforcement learning.
Using existing high-level synthesis technologies, it is possible to synthesize OpenCL Single Instruction Multiple Thread (SIMT) algorithms to RTL. It is worth mentioning some notable research efforts in that direction. In [20], the authors provided a deep learning accelerator targeting Intel's FPGA devices that are based on OpenCL. This architecture was capable of maximizing data-reuse and minimizing memory accesses. The authors of [21] presented a systematic design-space exploration methodology in order to maximize the throughput of an OpenCL-based FPGA accelerator for a given CNN model. They used synthesis results to empirically model the FPGA resource utilization. Similarly, in [22], the authors analyzed the throughput and memory bandwidth quantitatively in order to tackle the problem of design-space exploration of a CNN design targeting FPGAs. They also applied various optimization methods, such as loop-tiling, to reach the best performance. A CNN RTL compiler is proposed in [23]. This compiler automatically generates sets of scalable computing primitives to form an end-to-end CNN accelerator.
Another remarkable OpenCL-based implementation of CNNs is PipeCNN [8]. PipeCNN is mainly based on the capability of currently available design tools in order to use OpenCL kernels in high-level synthesis. PipeCNN consists of a set of configurable OpenCL kernels to accelerate CNN and optimize memory bandwidth utilization. Data reuse and task mapping techniques have also been used in that design. Recently, in [24], the authors added a sparse convolution scheme to PipeCNN in order to further improve its throughput. Our work (CNN2Gate) follows the spirit of PipeCNN. CNN2Gate is built on top of a modified version of PipeCNN. CNN2Gate is capable of exploiting a library of primitive kernels needed to perform inference of a CNN. In addition, CNN2Gate also includes means to perform automated design-space exploration and can automatically translate CNN models that are provided by a wide variety of machine learning libraries. It should be noted that our research goal is introducing a methodology to design an end-to-end framework to implement CNN models on FPGA targets. This means, without a loss of generality, that CNN2Gate can be modified to support other hardware implementations or technologies ranging from RTL to high-level synthesis with two conditions. First, CNN layers can be expressed as pre-defined templates. Second, the amount of parallelism in the hardware templates can be controlled.
There are several reasons that we have chosen PipeCNN as the foundation of CNN2Gate. First, while using the OpenCL model, it is possible to fine-tune the amount of parallelism in the algorithm, as explained in the Section 4.3. Second, it supports the possibility of having deep pipelined design, as explained in Section 3.2.2. Third, the library of primitive kernels can be easily adapted and re-configured based on the information that was extracted from a CNN model (Section 4.1).
Lately, reinforcement learning has been used in order to perform automated quantization for neural networks. HAQ [25] and ReLeQ [26] are examples of research efforts exploiting this technique. ReLeQ proposes a systematic approach for solving the problem of deep quantization automatically. This provides a general solution for the quantization of a large variety of neural networks. Likewise, HAQ, suggests a hardware-aware automated quantization method for deep neural networks while using actor-critic reinforcement learning method [27]. Inspired by these two papers, we used a reinforcement learning algorithm to control the level of parallelism in CNN2Gate OpenCL kernels.
Finally, it is worth mentioning the challenges of implementing digital vision system in hardware. In [28], the authors proposed a hardware implementation of a haze removal method exploiting adaptive filtering. Additionally, in [29], the authors provided an FPGA implementation of a novel method to recover clear images from degraded ones on FPGA.

Background
This section provides the background knowledge that is required to understand CNNs and OpenCL-based high-level synthesis workflows.

Convolutional Neural Networks (CNNs)
CNNs are categorized as feedforward networks. Figure 2 shows the most common architecture of a CNN. A CNN consists of one or several convolutional, pooling, and fully connected layers. These layers can be connected together to shape a deep convolutional neural network model. The data enters from the input, is processed by various hidden layers, and the classification results are presented as outputs.

Convolutional Layers
Convolutional layers are used to extract features from the input data. A convolutional layer convolves the input with a convolutional kernel (filter) and passes the results to the next layer through a non-linear activation function. More specifically, each neuron in convolutional layers has a receptive field, and it is connected to other neurons in the adjacent layers through some learned weights and biases. The following equation shows that the feature F k can be computed as: in which I is the input from the preceding layer or input image and W k is the convolution kernel for feature k and b k is the bias vector. The non-linear activation function is denoted f (.) in (1).

Pooling Layers
The pooling layers are used in CNNs to down-sample and reduce the spatial resolution of the extracted features. The reduction of spatial resolution can help a CNN model to overcome input distortion and angular transformations. Pooling layers are usually made of max-pooling kernels. A max-pooling kernel selects the maximum value of the region of interest. The max-pooling feature can be defined as: where MPF k denotes k'th max-pooling feature and R m,n shows the region of interest around point (m, n).

Fully Connected Layers
After extracting features of the input data by convolutional and pooling layers, the data are sent to a fully connected neural network that implements the decision making process. The fully connected layer interprets the extracted features and turns them into information represented with quantities. A softmax function is often used at the output of a fully connected layer in order to classify the output.
For more information on CNNs, readers are encouraged to read papers, such as [1,30].

OpenCL High-Level Synthesis on FPGAs
OpenCL can be used in order to write programs that can be executed across heterogeneous platforms. OpenCL offers the ability to describe a parallel algorithm to be implemented on FPGA, for example.
However, this parallel description is at a level of abstraction that is much higher than hardware description languages, such as VHDL. An OpenCL application consists of two parts. The OpenCL "host" program that is written purely in C or C++ and that can be executed on any type of processor. On the other hand, "kernels" are accelerated functions that are implemented on some co-processor "device", such as an FPGA, while using fine-grained parallelism. The host can offload computation workload to kernels using sets of command queues. Figure 3

OpenCL Pipes
In the OpenCL model, every kernel has access to the global memory of the device, as depicted in Figure 4a. In the case of massively parallel algorithms, memory transactions can be a bottleneck. In OpenCL 2.0 [13], "Pipes" were introduced to enable kernel-to-kernel communications. These tiny communication channels between kernels help to reduce the number of times kernels, refer to memory, and increase the memory access efficiency. In the case of FPGAs, these channels are implemented using FIFOs. Figure 4b shows that it is possible to stack OpenCL kernels on top of each other and pass data from one kernel to another. This feature makes FPGAs very well suited to implement stacked layers of CNNs as deeply pipelined kernels.

OpenCL High-Level Synthesis of CNNs for FPGAs
Leveraging the view from Figure 4b, a CNN model can be synthesized on FPGA while using deeply pipelined kernels as shown in Figure 4c. For this architecture, the following hardware accelerator kernels are needed: (1) Memory read (2) Memory write (3) Convolution (4) Pooling and (5) Fully connected. In many cases, convolution kernels and fully connected kernels can be fused together as a single 3-D matrix-matrix multiplication unit. Memory read/write kernels provide and store data for other kernels. This architecture has two main advantages: (1) Depending on the number of computing units in the convolution and pooling layer and the size of data fetched by memory read/write kernels, this architecture can be scalable (details will be discussed in the design-space exploration section) and (2) pipelined kernels can process data without storing the data moved between layers; this can significantly improve the memory access efficiency.

Generalized Model Analysis Using ONNX
ONNX is a library that makes deep learning models interoperable. ONNX represents the computation flow of a deep neural network as an extensible computation graph model while using its own built-in operators and data-types. This inter-operable ecosystem eliminates the limitations that may stem from using any specific deep learning development framework. This makes it easier for hardware developers to exploit the most suited tool without being bound to the library with which the model was developed. ONNX provides the definition of a deep neural model while using extensible acyclic graphs. Nodes represent an operator and they have sets of inputs and outputs. ONNX can support a vast variety of tools, such as Pytorch, TensorFlow, Caffe, Keras, and more [18]. CNN2Gate offers a front-end ONNX parser for CNNs. Using ONNX as a transport layer decouples the high-level synthesis tool from the machine learning tool, as shown in Figure 1. CNN2Gate parses the computation dataflow-or the arrangement of layers-besides weights and biases for each layer. TheCN N2Gate parser traverses the ONNX graph nodes and extracts the synthesis information of each node based on the following operator types: • Convolution: for the convolution operator, CNN2Gate parses dilations, pads, kernel shape, and stride. The reader can refer to [31] for more information regarding these variables and how they affect the computation. It also extracts the learned weights and biases for convolutional kernels. CNN2Gate also computes the output tensor size of the layer while using Equation (3). Let us assume the input of a two dimensional convolutional kernel is of size (c in , h in , w in ), where c denotes the number of features, h denotes the height, and w denotes the width. The output tensor size (c out , h out , w out ) can be written as: where ks is the kernel size, st is the stride, while p and d are the padding and dilation parameters, respectively.
• Max-pooling: similar to the convolution, CNN2Gate parses dilations, pads, kernel size, and strides. However, as max-pooling is a down-sampling kernel, it does not have weights and biases. The output tensor size of a max-pooling node with input size of (c in , h in , w in ) is identical to Equations (3) and (4). • ReLu: CNN2Gate detects the presence of activation function such as "Relu" after a convolutional or max-pooling layer. • General Matrix Multiplication ("GEMM"): a fully connected layer appears as a GEMM operator in ONNX dataflow graph. In CNN2Gate, there is no specific kernel for the fully connected layer. • Softmax: CNN2Gate detects the presence of the softmax operator after a fully connected layer.
The front-end parser saves the information that specifies each layer's data in a linked structure in order to preserve the order. Later, this data structure is used by a high-level hardware synthesis tool. The preserved order serves as a guideline for the synthesizer to configure hardware pipelines.

Automated High-Level Synthesis Tool
Inspired from the Gajski-Kuhn chart [32], Figure 5 sketches the basic idea behind the CNN2Gate automated synthesis workflow. According to this diagram, the design of CNN2Gate is projected in three domains. The Algorithmic domain axis depicts the definition of concurrent algorithms and the Structural domain axis shows the building blocks to realize the Algorithmic domain. Finally, the Physical domain corresponds to the actual implementations in RTL. The lines connecting the dots show the relations between these domains. In the "Algorithmic Domain", CNN2gate parses the information from an ONNX model, as explained in Section 4.1. In the "Structural Domain", CNN2Gate uses 8-bit fixed point arithmetic units to perform computations. In addition, it configures memories, pipes, and kernels that correspond to the information that is received from the ONNX model. In the "Physical Domain", if weights and biases are floating point numbers, CNN2Gate can quantize these values based on the information that the user provides from post-training quantization [3]. In order to clarify further, CNN2Gate does not perform quantization itself; however, it can apply a given value that the user provides for a layer. This value can be expressed as an (N, m) pair, where the fixed-point weights/biases values are represented as N × 2 −m . Moreover, CNN2Gate performs design-space exploration and generates Register Transfer Level (RTL) models targeting FPGAs.  Note that CNN2Gate can automatically configure several memory buffers, depending on the layer operation type. For instance, if the next layer in the neural network is fully connected layer, it writes the data to the memory buffer that is associated with the fully connected layers and, similarly, if the layer is convolutional, it writes the data to the convolution buffer.
CNN2Gate is also capable of building and running the CNN model in both emulation and full flow mode. For the emulation mode, CNN2Gate compiles the project for a CPU. In some cases, the user needs to verify whether the CNN model performs correctly in terms of computational accuracy before committing to long hours of synthesis. The compilation for the emulation mode is significantly faster-in the order of seconds-as compared to synthesizing the full flow for the FPGA, which takes several hours. This feature makes the workflow more versatile for the designers who want to iterate between a FPGA design and a CNN model to reach the best quantization parameters and accuracy. In order to synthesize the full flow on FPGA, CNN2Gate accepts the name of the FPGA board to perform design-space exploration (Section 4.3) and generates the RTL accordingly. In order to validate CNN2Gate, we tested this process by targeting three different Intel TM FPGA boards [33][34][35] and report the results later in this paper. In addition, we used Intel OpenCL SDK 16.1 to synthesize our designs.
Taking advantage of OpenCL flexibility, it is possible to design an architecture with several degrees of freedom. The first degree of freedom is the size of the pipes. The better throughput of the pipes means less congestion point for data to be moved from a kernel to another. The second degree of freedom is the bandwidth of data that memory write/read kernels provide. The third degree of freedom is the number of parallel convolutional (CONV) and RELU units that are used to implement convolution kernels. Figure 6 shows this concept in a simple example. These degrees of freedom for deeply pipelined kernels are leveraged from what proposed by [8]. The memory read kernel fetches N l vectors of size N i for features and weights. Tuning N i and N l can provide a better throughput for data write/read kernels. Note that the memory access schedule of where and when to read the features and weights are derived by the host program. The memory access schedule is configured by the front-end parser that is based on the CNN model . The number of computation lanes (N l ) shows the level of parallelism of the algorithm. The number of CONVs in a convolution kernel, the size of data pipes and the number of max-pool operators in the max-pool kernel are tuned according to N l . Changing N l and N i can result in different utilization ratios of FPGA resources. For instance, in the case of increasing N l , (1) more on-chip memory in read/write kernels, (2) more register for pipe FIFOs, (3) more DSP slices for CONVs, and (4) more LUT for max-pooling kernels are needed in order to accommodate the design on FPGA.

Architectural Limitations
The architecture shown in Figure 6 is used to perform the calculation of all layers. In order to obtain practical implementations and allow the targeting FPGAs of various size, it is necessary to fold (or time multiplex) the layers onto fewer hardware lanes and vectors [9,10]. Therefore, arbitrary choices for N l and N i are not always possible. N i should be a divisor of the features' width for all layers in order to avoid padding. Likewise, N l should be a divisor of the number of features for all layers to avoid idle lanes in some layers.

Hardware-Aware Design-Space Exploration
CNN2Gate analyzes the computation flow layers that are extracted from the model by the ONNX front-end parser and determines all options possible for N l and N i . CNN2Gate is also capable of interacting with the Intel OpenCL compiler to estimate resource usage for a specific choice of N l and N i . Thus, there is a need to identify the best values for N l and N i . The process of finding the best values of N l and N i is what we call design space exploration. Many methods can be considered to do it and three such methods are investigated in this section.

Brute Force Design Space Exploration (BF-DSE)
This method exhaustively searches for all possible pairs of N l and N i , and it finds the feasible option that maximizes FPGA resource utilization. In our case, the solution maximizing resource utilization corresponds to the one providing the best throughput. This method is simple to execute and it always find the best solutions. However, for larger FPGAs with large number of possible candidates for N l and N i , an exhaustive brute force search could take excessive time. In the next section, a more efficientsearch strategy is described.

Reinforcement Learning Design Space Exploration (RL-DSE)
RL-DSE trains a reinforcement learning (RL) agent to find the best options for N l and N i . Reinforcement learning for design-space exploration is of interest for two reasons. First, it can be faster than brute force and second, it can be merged with other RL-agents, such as HAQ [25] or ReleQ [26] to determine the level of parallelism and the quantization of each layer. RL-DSE explores hardware options for N l and N i , and finds the best fit for the specified hardware. The RL-DSE agent receives a reward signal that corresponds to the feedback provided by the Intel OpenCL compiler for FPGA resource utilization. Hence, the reward function for RL-DSE is designed in order to maximize FPGA resource utilization.
When CNN2Gate triggers Intel OpenCL compiler to evaluate a hardware option, it receives the corresponding hardware resource utilization. This feedback comprises (1) the percentage of lookup table utilization, (2) the percentage of DSP utilization, (3) the percentage of on-chip memory block utilization, and (4) the percentage of register utilization. We denote these percentages P lut , P dsp , P mem , and P reg , respectively.
Given the percentages of resource utilization, the agent takes a series of actions. The agent starts from the minimum values of N l and N i . RL-DSE can flexibly choose to (1) increase N l , (2) increase N i , or (3) increase both N i and N l . If one of the variables reaches the maximum possible value that is based on the CNN topology, the variable is reset to its initial value.
Let us assume that the average usage factor is defined as: Further assume that the user defines a vector of thresholds T th for the maximum usage that is tolerated for each quota. The reward function can be described, as follows.
In Algorithm 1, H best is the best hardware options and F max is the maximum usage that is observed by the agent during the exploration of the search environment. In the reward shaping function, if at least one of the hardware utilization quotas (P lut , P dsp , P mem , P reg ) exceeds the thresholds specified in the vector of threshold limits (T th = (T lut , T dsp , T mem , T reg )), the agent receives a negative reward for this choice and the chance of choosing this option for a later iteration decreases. The reward function keeps track of the maximum usage score F max , and constantly update F max and H best with the best hardware option observed by the agent in the environment. Becasue, during exploration, the best value of the reward function is unknown, besides exhaustion of the search space, there is no exact stop condition for agent exploration. In this case, we used a variation of time-limited reinforcement learning [36] with which, the number of iterations in each episode is limited. Finally, in our RL-DSE, a scaling factor β = 0.01 is applied to F avg in order to form the final reward function to convert it from percentage scale to a number between 0 and 1.
Note that a discount factor γ = 0.1 is used in our RL agent design. The discount factor specifies that the agent does not have unlimited time to find the best option. Thus, the agent receives a down-scaled reward as it spends more time in the environment. The discount factor urges the agent to optimize the total discounted reward [37]: where r t is the reward calculated in time t according to Algorithm 1.

Algorithm 1: Reward shaping
Input : T th , F avg , P lut , P dsp , P mem , P reg , N i and N l Output : Reward , H best Variables: F max , T th = (T lut , T dsp , T mem , T reg ) if (P lut , P dsp , P mem , P reg ) < (T lut , T dsp , T mem , T reg ) then if F avg > F max then

Hill-Climbing Design Space Exploration (HC-DSE)
While exhaustive enumeration of all possible hardware options (BF-DSE) is very efficient in small design-spaces, the execution time increases proportionally to the number of possibilities in the design-space. There are other smart optimization algorithms that find the best solution significantly faster than brute-force. Hill-climbing is an iterative numerical algorithm that starts from an arbitrary location in the design space and attempts to find a better solution by moving incrementally in order to optimize the target function. As shown in Figure 7, optimizer O examines all possible successors (in the 2D case A, B and C) and calculates the relative changes to the target function corresponding to those successors (∆ A , ∆ B and ∆ C ). Afterward, the optimizer moves toward the best choice. This process continues until there is no further improvement in the target function or, in our case, it continues until the design does not fit on the target FPGA. This simple algorithm is very efficient in finding local optimum. Algorithm 2 shows the pseudo code for hill-climbing in our context.
Note that the hill-climbing algorithm is guaranteed to find the best solution in a convex design-space.
H best = H current return : H best end end Table 1 shows the execution times of AlexNet [38] and VGG-16 [39] for three platforms while using CNN2Gate. The user can verify the CNN model on CPU using the CNN2Gate emulation mode in order to confirm the resulting numerical accuracy, as mentioned before. Even if the execution time is rather large, this is a very useful feature to let the developer verify the validity of the CNN design on the target hardware before going forward for synthesis, which is a very time-consuming process. Note that the emulation's execution time cannot be a reference for the throughput performance of a core-i7 processor. The emulation mode only serves the purpose of verifying the OpenCL kernels operations. In [40], the authors described the execution of AlexNet on desktop CPUs and reported an execution time as low as 2.15 s. The reported results also show the scalability of this design Indeed, the results are reported for both the low cost Cyclone V SoC and the much more expensive Arria 10 FPGA that has much more resources. The exploited level of parallelism was automatically extended by the design-space exploration algorithm in order to obtain better results for execution times that are commensurate with the capacity of the hardware platform. In order to maintain the scalability of the algorithm, it is not always possible to use arbitrary choices for (N i , N l ) parameters. These parameters must be chosen in a way that kernels can be used as the building block of all the layers. This leads to have limited options to increase the level of parallelism that can be exploited with a given network algorithm. Relaxing this limitation (i.e., manually designing kernels based on each layers' computation flow) could lead to better resource consumption and higher throughput at the expense of losing the scalability that is needed for automation of this process. The maximum operating frequency ( f max ) varies for different FPGAs. Indeed, it depends on the underlying technology and FPGA family. Intel OpenCL compiler (synthesizer) automatically adjust PLLs on the board to use the maximum frequency for the kernels. It is of interest that the operating frequency of the kernels supporting AlexNet and VGG-16 were the same, as they had essentially the same critical path, even though VGG-16 is a lot more complex. The larger complexity of VGG was handled by synthesizing a core (i.e., Figure 6) that executes a greater number of cycles if the number of layers in the network increases. Table 2 gives more details regarding the design-space exploration algorithms which are coupled with synthesis tool. All three algorithms use the resource utilization estimation of the synthesizer to fit the design on the FPGA. This is important as the time consumed for design-space exploration is normally under 5 min., while the synthesis time for larger FPGAs, such as the Arria 10, can be close to 10 h. Experimenting with various DSE algorithms confirms that smart algorithms (such as reinforcement learning and hill-climbing) can provide advantages for design-space exploration when compared to brute-force enumeration. The first goal of suggesting the RL-DSE and HC-DSE methods is to demonstrate that it is possible to further decrease the exploration time by using some better search methods. Analyzing the execution times shows that the reinforcement learning algorithm is almost 25 percent and the hill-climbing is 50 percent faster than the brute-force algorithm when optimizing the design for the Arria 10 FPGA. Note that these exploration times are significantly less than the synthesis time, since we use estimation of resource usage provided by the synthesizer. On the other hand, performing full synthesis as part of the exploration process is not advisable, because the exploration time would take weeks. Because HC-DSE follows the gradient of resource usage, it is always faster than BF-DSE. This method is the best when the design-space is convex. However, if the search space is not convex (e.g., having several local maximums), HC-DSE might choose a wrong solution. In contrast, it is less probable for RL-DSE to be trapped in a local maximum due to the random nature of the reinforcement learning algorithm. Moreover, the RL-DSE algorithm would be more valuable if it could be exploited in conjunction to the reinforcement learning quantization algorithms, such as ReLeQ [26].

Results
The goal of considering various DSE algorithms in our work is to provide a versatile tool for the user in different conditions and it is not limited to a specific case. We included the brute-force algorithm in CNN2gate in order to guarantee a successful exploration for small design-spaces. However, presuming that the design space is small is not always a correct assumption. For large convex design-spaces, we added the hill-climbing algorithm to find the best solution which works significantly faster than brute-force. For non-convex and large design-spaces, reinforcement learning was found to work well [41]. In Table 2, the design-space is small. This is why the difference between the execution time between various exploration algorithms is small. However, it was shown that HC-DSE can be twice faster than brute force when optimizing the design for the Arria 10 FPGA.
There are other model-based design-space exploration algorithms that are dedicated to a specific implementation of a library of primitives. For instance, in [24], the authors proposed a performance model for their design, and they can predict the performance of the hardware implementation based on the resource requirements. The advantage of our proposed DSE algorithms is that our algorithms are model agnostic. This means that the CNN2Gate framework tunes the parallelism parameters of the design and directly queries the performance feedback from the synthesizer.
We tried CNN2Gate on three platforms. The first one is a very small Cyclone V device with 15K adaptive logic modules (ALMs) and 83 DSPs. The fitter could not fit either ALexNet or VGG on this device. Clearly, the minimum space that is required for fitting this design on FPGA is fairly large due the complexity of the control logic. CNN2Gate did not experience any difficulty fitting the design on bigger FPGAs. as demonstrated in Table 2. Resource utilization and hardware options (N i , N l ) are also provided, which correspond to the execution times that are shown in Table 1.
CNN2Gate resource consumption is very similar for AlexNet and VGG-16. In the case of identical hardware options, CNN2Gate's synthesized core is going to be nearly identical for all CNN architecture as shown in Figure 6. The only difference is the size of internal buffers to allocate the data in the computation flow. More quantitatively, in our implementation, VGG-16 uses 8% more of the Arria 10 FPGA block RAMs in comparison to what is shown in Table 2 for AlexNet.
Revisiting Figure 6, pipelined kernels are capable of reading data from global memory and processing the convolution and pooling kernel at the same time. In addition, for fully connected layers in CNNs, the convolution kernel acts as the main data process unit and the pooling kernel is configured as a pass-through. When considering this hardware configuration, we can merge convolution and pooling layers as one layer. In the case of AlexNet, this leads to five fused convolution/pooling and three fully-connected layers. Figure 8 reports the detailed execution time of these six layers, including memory read and write kernels for each layer. Thus, as the algorithm proceeds through the layers, the dimensions of the data (features) reduced and the execution time is decreased. Note that, in this figure, Layer 1 to Layer 5 are fused layers comprising a convolution, a pooling, and a ReLu layers. Additionally, note that, although the number of parameters in convolutional layers are less in fully connected layers, the memory consumption is far greater than with fully connected layer. Therefore, the performance of the first convolutional layers can be affected in a system accessing external memory (RAM). Cyclone V SoC 5CSEMA5 Arria 10 GX1150 Figure 8. Detailed breakdown of execution time for each layer of AlexNet. A Layer here means execution of one round of pipelined kernels as shown in Figure 6. In case of AlexNet Layer-1 to Layer-5 are combination of memory read/write, convolution and pooling kernels, while Layer-5 to Layer-7 are combination of memory read/write and fully connected kernels. Table 3 shows a detailed comparison of CNN2Gate to other existing works for AlexNet. CNN2Gate is faster than [21,22] in terms of latency and throughput. However, CNN2Gate uses more FPGA resources than [21]. To make a fair comparison, we can measure relative performance density as per DSP or ALMs. In this case, the CNN2Gate performance density (GOp/s/DSP) is higher (0.266) when compared to 0.234 for [21]. There are other designs, such as [9,23], which are faster than our design in terms of pure performance. Nevertheless, the CNN2Gate method that is outlined above significantly improves on existing methods, as it more scalable and automatable than the methods that are presented in [9,23], which are limited in these regards, as they require human intervention in order to reach high levels of performance. Second, the design entry of [9,23] are C and RTL, respectively, while our designs were automatically synthesized from ONNX while using OpenCL. Thus, not surprisingly, our work does not achieve the maximum reported performance. This is partly due to the use of ONNX as a starting point, and trying to keep the algorithm scalable for either large and small FPGAs. This imposes some limitations, such as the maximum number of utilized CONV units per layer. There are also other latency reports in the literature, such as [8]. However, those latency reports are measured with favorable batch size (e.g., 16). An increasing batch size can make more parallelism available to the algorithm that can lead to higher throughput. Thus, for clarity, we limited the comparisons in Table 3 and 4 to batch size = 1. * batch size = 1. Table 4 shows a detailed comparison of CNN2Gate to other existing works for VGG-16. It is observable that CNN2Gate is performing better for larger neural networks, such as VGG. CNN2Gate achieves 18 % lower latency than [9], despite the fact that CNN2Gate uses fewer DSPs. While, for AlexNet, [9] was more than 50 % faster than CNN2Gate. Finally, for VGG-16, we did find some hand tailored RTL custom designs, such as [11], which are faster than CNN2Gate.

Discussions and Generalization
In this section, we explore CNN2Gate from the point of view of its software architecture. We also discuss some limitations of CNN2Gate that is caused by the library of primitives over which it is built. Finally, generalized forms of design-space search algorithms are discussed, which can be very insightful in future designs.

CNN2Gate as an External Controller with a Synthesizer in the Loop
It is helpful to see this framework in the context of a controller in order to better understand the design of CNN2Gate. CNN2Gate provides two sets of actions as a controller:

1.
Arranging the basic hardware building blocks or library of primitives: CNN2Gate provides the framework in which the primitive blocks are connected to form the specific architecture of a convolutional neural network. This architecture is based on the model that is parsed from an ONNX representation of the neural network. As shown in Figure 9, CNN2Gate chooses and connects the building blocks of a convolutional neural network from a library of primitives. This library of primitives can be designed by the user or it can be adapted from other existing open-source designs. We assume that the level of parallelism is controllable in the library of primitives.

2.
Providing parallelism control for algorithm or design-space exploration: after providing the layout of building blocks, CNN2Gate makes several queries to the hardware synthesizer and determines the best parameters particularly regarding the level of parallelism in order to obtain the best performance or performance/accuracy/cost trade-off of the hardware implementation. The performance characteristics of the hardware architecture can be defined by the user as the total logic utilization, throughput, or combination of them.
In our case, we adapted the library of primitives from PipeCNN [8]. The three DSE algorithm that we are introduced in the previous sections are the control algorithms that work in conjunction with the synthesizer in order to fit the design to the desired FPGA target.  Figure 9. CNN2Gate as an external controller.

Ablation Study of the Basic Hardware Building Blocks
Because CNN2Gate's primitive libraries are adapted from [8], it inherits the limitations of PipeCNN. Nevertheless, the CNN2Gate is flexible and can be easily adapted to other primitive libraries. Being in the form of an external controller, CNN2Gate offers the possibility of continuous integration to new DSE algorithms and primitive hardware libraries. For instance, a more recent version of the PipeCNN primitive library were introduced to support RESNET [43]. In this case, the CNN2Gate parser block can be changed to support the new architecture. Moreover, the design-space algorithms can stay the same.
The methods that CNN2Gate uses are flexible enough to be adapted for the implementation of Recurrent Neural Networks (RNNs). This requires the user to provide the primitive libraries of the RNN. The number of parameters that control the parallelism in the algorithm might change when the library of primitive changes completely. This requires having a generic N-dimensional design-space exploration. In the next Section, possible generalizations of design-space algorithms are explained.

Generalizations of Design-Space Exploration Algorithms
In Section 4.3, some algorithms are introduced in order to explore the two-dimensional design space. These algorithms can be easily generalized to N-dimensional explorations. The following sections demonstrate this generalization.

N-Dimensional Hill-Climbing Design Space Exploration
The hill-climbing algorithm is widely used in artificial intelligence to find local maximums. The relative simplicity of this algorithm makes it the first choice for our exploration algorithms. The hill-climbing algorithm will find the global maximum if the search space is convex.
A hill-climbing optimizer starts from an initial point H 0 = (x 10 , x 20 , ..., x n0 ), as shown in Algorithm 3. This initial point can be chosen randomly in the search space. In each step, the optimizer visits all the neighbours of the current location and examine the objective function O, if the optimizer finds a better choice among the neighbours , it updates its position to the better choice. This procedure continues iteratively until the hill climber reaches the local maximum and cannot find a better point in its neighborhood. Note that the objective function O can be defined as the total logic utilization, throughput, or combination of them.

N-Dimensional Reinforcement Learning Design Space Exploration
A reinforcement learning (RL) software agent takes actions in the environment in order to maximize the reward, as shown in Figure 10. In the context of CNN2Gate, an RL agent is used in order to obtain the optimal control to choose the best hardware options for the model in the synthesis process. When considering CNN2Gate as an external controller (i.e., Figure 9) fits very well with the paradigm of a reinforcement learning agent. In this context, CNN2Gate is the 'Agent' that explores the available synthesis options of the convolutional neural networks (i.e., 'Environment'). Controlling the various characteristics of a neural network using reinforcement learning has recently appeared in the literature. For instance, in [25], an RL agent is used to find the best quantization level for each layer. To set up the RL agent to explore the N-dimensional design-space, we need to take into consideration the following steps: 1.
Defining the states: in the context of a hardware optimizer, states can be defined as a set of all possible hardware options. Particularly, states represent the whole design-space.

2.
Defining the actions: actions are the act of moving from one state to another. In the case of N-dimension, actions can be defined as the unit vector of each dimension.

3.
Forming a reward function: executing an Action a in the state s provides the agent a numerical reward score r. The agent tries to maximize this reward. In the context of this article, the reward can be throughput.

4.
Considering a living penalty or discount factor for the agent: the agent should not have an infinite amount of time to explore the environment. The longer the agent stays in the environment, the less reward it should get. This trains the agent to find the best hardware option as fast as possible.
Having defined all of the previous steps, we are able to train our agent to find the best hardware option while using common reinforcement learning techniques, such as Q-Learning [44].

Conclusions
This paper described the design and implementation of a general framework for developing convolutional neural networks on FPGAs. This framework takes the form of a Python library that can be integrated with a wide range of popular machine learning frameworks. CNN2Gate makes it easy for machine learning developers to program and use FPGAs in order to perform inference. CNN2Gate exploits the OpenCL synthesis workflow for FPGAs that are offered by commercial vendors. CNN2Gate is capable of parsing the data-flow of CNN models expressed in ONNX. This framework also has an integrated design-space exploration tool that helps developers to find the best hardware option for synthesizing a given CNN model on an FPGA. CNN2Gate achieves a classification latency of 205 ms for VGG-16 and 18 ms for AlexNet on an Intel Arria 10 FPGA. These results are excellent when considering that they were obtained by an automated design space exploration and synthesis process, which is not relying on expert's low-level hardware knowledge. 44. Watkins, C.J.; Dayan, P. Q-learning. Mach. Learn. 1992, 8,

279-292. [CrossRef]
Publisher's Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.