You are currently viewing a new version of our website. To view the old version click .
Applied Sciences
  • Article
  • Open Access

22 December 2021

MLoF: Machine Learning Accelerators for the Low-Cost FPGA Platforms

,
,
and
1
VeriMake Innovation Lab, Nanjing Renmian Integrated Circuit Co., Ltd., Nanjing 210088, China
2
Department of Electrical and Computer Engineering, University of British Columbia, Vancouver, BC V6T 1Z4, Canada
3
Department of Computer Science, University of North Carolina at Chapel Hill, Chapel Hill, NC 27514, USA
4
National ASIC System Engineering Technology Research Center, Southeast University, Nanjing 210096, China
This article belongs to the Special Issue Hardware-Aware Deep Learning

Abstract

In Internet of Things (IoT) scenarios, it is challenging to deploy Machine Learning (ML) algorithms on low-cost Field Programmable Gate Arrays (FPGAs) in a real-time, cost-efficient, and high-performance way. This paper introduces Machine Learning on FPGA (MLoF), a series of ML IP cores implemented on the low-cost FPGA platforms, aiming at helping more IoT developers to achieve comprehensive performance in various tasks. With Verilog, we deploy and accelerate Artificial Neural Networks (ANNs), Decision Trees (DTs), K-Nearest Neighbors (k-NNs), and Support Vector Machines (SVMs) on 10 different FPGA development boards from seven producers. Additionally, we analyze and evaluate our design with six datasets, and compare the best-performing FPGAs with traditional SoC-based systems including NVIDIA Jetson Nano, Raspberry Pi 3B+, and STM32L476 Nucle. The results show that Lattice’s ICE40UP5 achieves the best overall performance with low power consumption, on which MLoF averagely reduces power by 891% and increases performance by 9 times. Moreover, its cost, power, Latency Production (CPLP) outperforms SoC-based systems by 25 times, which demonstrates the significance of MLoF in endpoint deployment of ML algorithms. Furthermore, we make all of the code open-source in order to promote future research.

1. Introduction

Machine Learning (ML) algorithms are effective and efficient in processing Internet of Things (IoT) endpoint data with well robustness []. As data volumes grow, IoT endpoint ML implementations have become increasingly important. Compared to the traditional cloud-based approaches, they can compute in real-time and reduce the communication overhead []. There are some researches deploying ML algorithms on System on Chip (SoC) based endpoint devices in industry and academia. For example, the TensorFlow Lite [], X-CUBE-AI [], and the Cortex Microcontroller Software Standard Neural Network (CMSIS-NN) [] are three frameworks proposed by Google, STM, and ARM for pre-trained models in embedded systems. However, these solutions cannot achieve a balance among power consumption, cost-efficiency, and high-performance simultaneously for IoT endpoint ML implementations.
Field Programmable Gate Array (FPGA) devices have an inherently parallel architecture that makes them suitable for ML applications []. Moreover, some FPGAs have substantially close costs and leakage power compared to those of Microcontroller Unit (MCU). Therefore, FPGA can be an ideal target platform for IoT endpoint ML implementations. Nowadays, some researches and supports have been done for the deployment of ML algorithms on FPGAs. For instance, the ongoing AITIA, led by the Technical University of Dresden, KUL, IMEC, VUB, etc. [], is a preliminary project that investigated the feasibility of ML implementations on FPGAs. On the other hand, in the industry, Xilinx and Intel have supported portions of the machine learning based Intellectual Property (IP) cores [,,,]. Unfortunately, these solutions are based on high-end FPGAs and require highly professional standards. It is difficult for many small and medium-sized corporations and individual developers to deploy their hardware platforms. Therefore, an ML library that can run on any FPGA platform is needed [], especially those low-cost FPGAs. Low-cost FPGAs do not mean they have the lowest absolute value among all FPGAs, instead, they refer to the lowest-priced FPGA series in the most representative manufacturers. Meanwhile, the lack of comprehensive comparisons makes it difficult to demonstrate the benefits of FPGA ML implementations over conventional SoC-based solutions.
Therefore, we introduce Machine Learning on FPGA (MLoF) with a series of IP cores dedicated to low-cost FPGAs. The MLoF IP-cores are developed in Verilog Hardware Description Language (HDL) and can be used to implement popular machine learning algorithms on FPGAs, including Support Vector Machines (SVMs), K-Nearest Neighbors (k-NNs), Decision Trees (DTs), and Artificial Neural Networks (ANNs). The performance of seven FPGA producers (Anlogic *2, Gowin *1, Intel *2, Lattice *2, Microsemi *1, Pango *1, and Xilinx *1) is thoroughly evaluated using low-cost platforms. As far as we know, MLoF is the first case to implement machine learning algorithms on nearly every low-cost FPGA platform. Compared with the typical way of implementing machine learning algorithms on embedded systems, including NVIDIA Jetson Nano, Raspberry Pi 3B+, and STM32L476 Nucle, the advantage of MLoF is that it balances the cost, performance, and power consumption. Moreover, these IP cores are open-source, assisting developers and researchers in more efficient implementation of machine learning algorithms on their endpoint devices.
The contributions of this paper are as follows:
(1)
To the best knowledge of the authors, this is the first time that four ML hardware accelerator IP-cores are generated using Verilog HDL, including SVM, k-NN, DTs and ANNs. The source code of all IP-cores is fully disclosed at github.com/verimake-team/MLonFPGA;
(2)
The proposed IP-cores are deployed and validated on 10 mainstream low-cost and low-power FPGAs from seven producers to show its broad compatibility;
(3)
Our designs are comprehensively evaluated on FPGA boards and embedded system platforms. The results prove that low-cost FPGAs are ideal platforms for IoT endpoint ML implementations;
The rest of this paper is organized as follows: Section 2 reviews prior work on the hardware implementations of machine learning algorithms. Section 3 introduces details of the proposed MLoF IP series. Section 4 provides a comparison of various FPGA development platforms. Section 5 contains experiments and analyses. Section 6 concludes the research results and future works.

3. Machine Learning Algorithms Implementation on Low-Cost FPGAs

Normally, the post-process of IoT data mainly focuses on two tasks: Classifications and regressions []. By exploiting the parallelism and low power consumption of FPGAs, MLoF offers a superior solution for these workloads. Drawn from past designs, MLoF is designed with lower computation resources, and it includes a variety of common machine learning algorithms, namely ANN, DT, k-NN, and SVM. Details will be further presented in this section.
As shown in Figure 1, the system has consisted of the training and the MLoF modules as most machine learning models are used for inferencing and evaluating within IoT devices without requiring extensive training [], all of the model training process is completed through PC. First, the IoT dataset is gained from the endpoints and will be sent to the training module. Then, an ML library (e.g., Scikit-Learn and TensorFlow Lite) [] and an ML algorithm should be chosen as the first level of parameters. Thereafter, a set range of hyperparameters (the same set as in the MLoF module) are used to constrain the PC training process, as FPGA has limited local resources. Since hyperparameters are key features in ML algorithms [,,], users could set them to different values to find the best sets for training according to Table 1. Next, with all the labeled data and hyperparameters, the best model and the best parameters (including the updated hyperparameters) are generated and sent to the MLoF module. After receiving and storing the parameters into ROM or external Flash, the algorithm is deployed on an FPGA. The language of FPGA is mainly Verilog HDL, thus the algorithms can literally be deployed on any known FPGA platform.
Figure 1. Block diagram of the MLoF system architecture.
Table 1. Configurable parameters.
In this paper, we selected six datasets, the four most representative ML algorithms, and deployed them on 10 low-cost FPGAs, with a total of 240 combinations. The experiments and evaluations are described in detail in Section 5.

3.1. Artificial Neural Networks (ANN)

3.1.1. Overall Structure of ANN

The implementation of ANN, for example, with eight inputs and two hidden layers (eight neurons within each layer) is shown in Figure 2a. This ANN model includes a Memory Unit (Mem), a Finite State Machine (FSM), eight Multiplying Accumulator (MAC) computing units, Multiplexers (MUX), an Activation Function Unit (AF), and a Buffer Unit (BUFFER). The Memory Unit (Mem) stores the weights and biases after training. The FSM manages the computation order and the data stream. The MAC units are designed with multipliers, adders, and buffers for multiplying and adding operations within each neuron. Here, we implement eight MAC blocks to process in parallel the multiplication of the eight neurons. Initially, features are serially entered for registration. Then, the MAC is used to sequentially process the ANN from the first input feature to the eighth feature on the first hidden layer (Figure 2b). Next, the second hidden layer is sequentially processed from the first hidden neuron to the eighth neuron. Finally, we process the output by the activation function. As demonstrated in Figure 3, the multiplexers (MUX) allocate the data stream. The AF computes the activation functions, which will be discussed in Section 3.1.2. The buffer stores data computed from each neuron.
Figure 2. ANN algorithm implementation. (a) Block diagram of ANN. (b) Scheduling of the ANN processing.
Figure 3. Block diagram of MAC.
The entire procedure for performing a hardware-based ANN architecture is described below. First, eight features are serially inputted to an ANN model. Each feature is entered into eight MAC units simultaneously and multiplications for eight neurons in the first layer are then completed. An inner buffer is used to store the multiplication results. Next, all eight results are added to a user-specified activation function. The output is further stored in the buffer unit as the input of the next hidden layer. Finally, the results are exported from the output layer following the second hidden layer.

3.1.2. Activation Function

The activation function is required within each neuron, which introduces non-linearity into the ANN, resulting in better abstract capability. Three typical activation functions include the Rectified Linear Unit (ReLU), the Hyperbolic Tangent Function (Tanh), and the Sigmoid Function []. All of three activation functions listed above are developed in hardware, and details are described as follows.

Rectified Linear Unit (ReLU)

The mathematical representation of the ReLU is described as Equation (1):
ReLU x = x ,   x > 0 0 , x 0
The hardware implementation is shown in Figure 4 with a comparator and a multiplexer []:
Figure 4. Block diagram of ReLU.

Hyperbolic Tangent Function (Tanh)

The mathematical representation of the Tanh function is described as Equation (2):
Tanh x = e x e x e x + e x
This cannot be achieved directly in the hardware using HDL. Therefore, we fit this functionality separately with five sub-intervals []. We divide the interval of [0, +∞] into five sub-intervals: [0, 1], (1, 2], (2, 3], (3, 4], (4, +∞). Table 2 contains the heuristic functions used to fit the Tanh function for each sub-interval. The performance of each sub-interval is shown in Figure 5 with an error kept within an acceptable range. The sub-intervals enable the Tanh function to be implemented using only adders and multipliers.
Table 2. Fitting function at different intervals.
Figure 5. The images of the fitting effect of these functions. (a) Numerical interval [0, 1]. (b) Numerical interval (1, 2]. (c) Numerical interval (2, 3]. (d) Numerical interval (3, 4].

Sigmoid Function

The mathematical representation of the Sigmoid function is described as Equation (3):
Sigmoid x = 1 1 + e x
Similar to the Tanh function, the Sigmoid function cannot be implemented directly in the hardware using HDL, as well. The Sigmoid function is equivalent to the tanh function [] when the transformation in Equation (4) is applied:
Sigmoid x = Tanh x 2 + 1 2
This transformation can be implemented easily on the hardware using a shift operation and an adder based on the tanh function.

3.2. Decision Tree (DT)

Figure 6 illustrates the implementation of DT with multiple inputs, a depth of four, and a maximum of eight nodes on each layer. The DT consists of a Memory Unit (Rom), a Finite State Machine (FSM), eight compare units, and a dispatcher. The memory unit is used to store nodes’ parameters from PC training. The FSM is used to determine which input node to use next based on the output. The compare unit serves as the selecting node. The distributor is used to distribute the input to each node.
Figure 6. DT algorithm implementation. (a) The sample decision tree. (b) Block diagram of DT. (c) Scheduling of the DT processing.

3.3. The k-Nearest Neighbors (k-NN)

3.3.1. Overall Structure of k-NN

The k-NN method is used to classify the samples based on their distances. In this case, we use the squared Euclidean distance as defined in Equation (5). The structure of k-NN is demonstrated in Figure 7 with an example of eight inputs and a k-value of 6. It consists of a Memory Unit (Mem), a Finite State Machine (FSM), a subtractor, a multiplier, a buffer, an adder, and a Sorting Network and Label Finder (SNLF) module.
d = i = 1 8 x i y i 2
Figure 7. The k-NN algorithm implementation. (a) Block diagram of k-NN. (b) Scheduling of the k-NN processing.

3.3.2. Structure of Sort Network and Label Finder Module

The Sorting Network and Label Finder (SNLF) is a key module that completes the sorting operation of distance and then outputs the classification or prediction results. It balances the pipeline and parallel execution with a ping-pong operation. As shown in Figure 8, this module consists of three parts, MUX, comparators, and cache registers. The mux was used to control di transmit in different clock cycles. The comparator was used for comparison with the new di and for storing dis in cache registers. There were 12 cache registers (Ox and Ex) used for storing di, as shown in Figure 7b. Specifically, Ox registers were used to store the six smallest dis in ‘odd’ clock cycles. Ex registers were used to store the six smallest dis in ‘even’ clock cycles. The ping-pong cache was used to rank the di in different clock cycles for the SNLF. From those dis, six could be identified to be the smallest. Initially, we set the register’s value to the maximum. The di value was compared with each Ox’s value when the clock cycle was jth (j = 3,5…) period. If the di value was bigger than each Ox’s values in the next period, it would be dropped. Otherwise, the di value was inserted into Ox and the biggest value in six Ox’s would be dropped. Similarly, the Ex’s values were updated in (j + 1)th (j + 1 = 4,6…) period. The cycles were repeated until all of the 600 training samples had been calculated with input features. Finally, we compare all the 12 registers to get the smallest six values. The result was voted from the smallest six value of registers.
Figure 8. Block diagram of SNLF.

3.4. Support Vector Machine (SVM)

The SVM (with eight inputs) is composed of Memory Units (Mem), the Finite State Machine (FSM), multipliers, adders, and multiplexers. The pre-trained support vectors are stored in the memory unit. The FSM controls the order of output data and the running process. The FSM controlling process is mainly used in multi-class classification, where the support vector and bias are updated recursively within the structure shown in Figure 9. Multipliers and adders complete the support vector calculation in Equation (6), and the multiplexer is used for the sign function.
f x = s i g n i = 1 8 S i · F e a t u r e + B
Figure 9. SVM algorithm implementation. (a) Block diagram of SVM. (b) Scheduling of the SVM processing.

4. Comparison of Development Platforms

The specifications of 10 different FPGA platforms from seven different producers are thoroughly analyzed. The key features of these FPGA cores are listed in Table 3. It is worth noting that Intel MAX and Xilinx Artix-7 FPGA have the richest LUTs and DSPs, which are beneficial for the parallel implementation of multi-input machine learning algorithms. Additionally, on-chip Random Access Memory (RAM) resources are important. Otherwise, a large amount of pre-trained data must be pre-fetched from memory to the cache and limited LUTs resources cannot be used as buffers to cache data during the calculation. Pango PGL12G, Lattice MachXO2, and Anlogic EG4S20 all have limited RAM capacitances. Anlogic EG4S20 has an internal SDRAM module that satisfies the need for additional caching. In addition, static power consumption is a critical metric for endpoint platforms, and Anlogic FPGAs, Intel Cyclone 10LP, and Microchip M2S010 perform well in this regard. Moreover, two Lattice FPGA platforms consume the least static power, which is quite competitive for endpoint implementations. Finally, both Anlogic EF2M45 and Microchip MS010 are equipped with an internal Cortex-M3 core, which significantly improves their general performance in terms of driving external devices and communication.
Table 3. The resource of 10 different FPGAs.
On-board resources, external interfaces, and prices are the three main discriminative FPGA features that the developers pay most attention to. Therefore, in Table 4, we present the features of 10 FPGAs from seven producers.
Table 4. The specification of 10 different FPGA boards.
All seven producers develop their own Electronic Design Automation (EDA) software, among which Lattice develops a completely different EDA software for different devices. In addition, the final resource consumption is determined by synthesis tools. Most of the seven producers use either their own synthesis tools or Synplify [], but the latter requires an individual supporting license. Table 5 summarizes the relative information of seven producers. It is worth noting that Lattice’s iCEcube2 and Lattice Diamond are for ICE40UP5 and MachXO2 development respectively, and thus they cannot share the same EDA.
Table 5. The specification of EDA software.

5. Experimental Analysis and Result

To evaluate the performance of these IPs, we select six typical IoT endpoint datasets for different parameter combinations and tests. As shown in Table 6, the datasets include binary classifications, multi-classifications, and regressions. The Gutter Oil dataset proposed by VeriMake Innovation Lab aims to detect gutter oils [], and contains six input oil features, including the pH value, refractive index, peroxide value, conductivity, pH value differences under different temperatures, and conductivity value difference under different temperatures. This dataset can serve both in a dichotomous and a polytomous way. The Smart Grid dataset for conducting research on electrical grid stability is from Karlsruher Institut für Technologie, Germany [,]. This is a dichotomous dataset with 13 input features used to determine whether the grid is stable under different loads. The Wine Quality dataset is proposed by the University of Minho [] for classifying wine quality. This is a polytomous dataset, with 11 input dimensions (e.g., humidity, light, etc.), rating wines on a scale of 0 to 10. The rain dataset by the Bureau of Meteorology, Australia, is based on datasets of different weather stations for recording and forecasting the weather []. This is a dataset for regression prediction, using eight input parameters, such as wind, humidity, and light intensity to predict the probability of rain. Power Consumption is an open-source dataset created by the University of California, Irvine. It tracks the total energy consumption of various devices within families [].
Table 6. Dataset specifications.
We use a desktop PC with a 2.59 GHz Core i7 processor to train various models on the six datasets and export the best parameters with the best scores obtained during training. For binary classifications and multiclass classifications, the scores represent the classification accuracies. For linear regressions, the scores represent R2 []. Then, these trained parameters are fed to our machine learning IPs and implemented on 10 different FPGA boards using EDAs from seven different candidate producers. Each EDA is configured to operate in the balanced mode with identical constraints. As shown in Table 5, part of the EDAs is integrated with synthesis tools, such as Gowin and Pango. However, as Synplify requires an individual supporting license, in this paper, only their self-developed synthesis tools (GowinSynthesis, ADS) are used for analyzing FPGA implementations. The analysis of FPGA implementations is not limited to the computing performance, but encompasses all aspects of the hardware. While Power Latency Production (PLP) [] is a common metric for evaluating the results of FPGA implementations, it does not consider the cost, which is a critical factor in IoT endpoint device development. As a result, we introduce the Cost Power Latency Production (CPLP) as an additional metric for evaluating the results.
In addition, we realize the same machine learning algorithms and parameters to the Nvidia Jetson Nano 2, the Raspberry Pi 3B+, and STM32L476 Nucleo, respectively [], allowing for more comprehensive comparisons of the implementations within different FPGAs.

5.1. ANN

5.1.1. ANN Parameter Analysis

We intend to find the best user-defined ANN parameters in six datasets, including the number of hidden layers, neurons within each layer, and activation functions. Different combinations of these parameters are used to train our ANN model on the desktop PC and their corresponding results are shown in Appendix A, Table A1. There are only minor differences among all the combinations. We chose to apply parameters with the best score from the software to our hardware implementation. The hyperparameter values associated with the best scores for these datasets processed using the ANN algorithm are shown in Table 7.
Table 7. The highest-scoring parameters of ANN obtained in different datasets.

5.1.2. Implementation and Analysis of ANN Hardware

Based on the ANN architectures in Table 7, we use the corresponding EDA (with the Balanced Optimization Mode in Synthesis Settings) to implement ANN on 10 different FPGA boards. The results are summarized in Appendix A, Table A2. In terms of computing performance (latency), Intel MAX10M50DAF outperforms the others in five out of six datasets, while PGL12G outperforms the competition in the Rain task. The performance differences in time delay between 10 FPGAs are all at the millisecond level, which can almost be ignored. For the comprehensive comparisons, Lattice’s ICE40UP5 has achieved first place in most application scenarios for its extremely low power consumption and cost-effectiveness among most of the datasets (five out of six). One exception is that in the Wine Quality task scenario, it was not implemented on ICE40UP5 due to the resource constraint. In addition, the device that performed the best on the Wine Quality task was Lattice MachXO2. The FPGA deployment results with the best comprehensive performance under each task are shown in Table 8.
Table 8. ANN deployment results on the best CPLP-performing FPGAs.

5.2. DT

5.2.1. Analysis of DT Parameters

For PC simulations, 12 different combinations of the maximum depth and the maximum number of leaf nodes are chosen. The results are shown in Appendix A, Table A3. Different DT structures produce nearly identical results. Adding the maximum depth and the maximum number of leaf nodes has no significant improvement on the score. Here, we chose the best results from various combinations for hardware deployment, and the results with the best scores for the six datasets are shown in Table 9.
Table 9. The highest-scoring parameters of DT obtained in different datasets.

5.2.2. Implementation and Analysis of DT Hardware

Based on the DT architectures in Appendix A, Table A4, we use the appropriate EDA (with the Balanced Optimization Mode in Synthesis Settings) to implement DT on 10 different FPGA boards. In terms of computing performance, Intel MAX10M50DAF outperforms the competition in all six datasets. While in terms of comprehensiveness, Lattice’s ICE40UP5 came to first place again in most application scenarios for its extremely low power consumption and cost-effectiveness. The FPGA DT deployment results with the best comprehensive performance under each task are shown in Table 10.
Table 10. DT deployment results on the best CPLP-performing FPGAs.

5.3. K-NN

5.3.1. Analysis of k-NN Parameters

In our k-NN model, the parameter k is user-defined. We experiment with various k values when training our model on the PC, and the results are shown in Appendix A, Table A5. The increment of k has no significant effect on the score. In fact, on the contrary, it might decrease them. We deploy the architecture that is optimal in terms of k value for hardware deployment. The hyperparameter values associated with the best scores for these datasets processed using the k-NN algorithm are shown in Table 11.
Table 11. The highest-scoring parameters of k-NN obtained in different datasets.

5.3.2. Implementation and Analysis of k-NN

According to the k parameters analyzed in Section 5.3.1, we implement our model on 10 FPGA boards (with the Balanced Optimization Mode in Synthesis Settings). The corresponding results are shown in Appendix A, Table A6. Gowin’s GW2A has the best computing performance in all of the task scenarios. By relying on extremely low power consumption and cost-effectiveness, Lattice’s ICE40UP5 achieves the best comprehensive performance across all datasets.
Additionally, two things are worth noting: Anlogic’s EF2M45 and Lattice’s MachXO2 are unable to deploy k-NN in multiple mission scenarios due to resource constraints. Pango’s PGL12G is also incapable of deploying k-NN. In addition, the reason is that the synthesis tool is unable to correctly recognize the current k-NN design, and therefore ignores the key path. This does not occur when using alternative development tools. The FPGA k-NN deployment results with the best comprehensive performance under each task are shown in Table 12.
Table 12. The k-NN deployment results on the best CPLP-performing FPGAs.

5.4. SVM

In the experiment, the linear SVM is chosen for training, and the results are shown in Table 13. Due to the function similarity between the linear SVM and ANN, their simulation scores are very similar. The SVM deployment results on 10 FPGA boards with the best comprehensive performance under each task are shown in Table 14. The remaining implementation results are provided in the Appendix A, Table A7.
Table 13. The highest-scoring of SVM obtained in different datasets.
Table 14. SVM deployment results on the best CPLP-performing FPGAs.

5.5. Comparisons with Embedded Platforms

To provide a more accurate assessment of our implementation, we also compare MLoF with three representative embedded platforms, namely Nvidia Jetson Nano, Raspberry Pi3 B+, and STM32L476 Nucleo. The specification of each platform is listed in Table 15. Jetson Nano is powered by a Cortex-A57 core running at 1.43 GHz and a 128-core Nvidia Maxwell-based GPU [], while Raspberry features a Cortex-A53 core running at 1.2 GHz. STM32L476 Nucleo is a typical IoT development platform with a Cortex-M4 core running at 80 MHz []. Compared with Table 4, the prices of these three representative embedded development platforms are similar to the FPGAs, which indicates that all of them are comparable in terms of other indexes. Given the proper cost of FPGAs, they can be considered as competitive substitutes for past typical platforms.
Table 15. The specification of three representative embedded development platforms.
Based on the simulation results of previous desktop PCs, the Receiver Operating Characteristic (ROC) curve and Precision-Recall (PR) curve shown in Figure 10, we select the models with the highest score in each of the six task scenarios for deployment of the embedded platform []. The corresponding deployment models and architectures for each task are listed in Table 16. PyCuda is used for GPU parallel acceleration with fixed weight parameters on Jetson Nano. Moreover, we use the same Python code to implement it on the Raspberry Pi, as well.
Figure 10. Comparison of the receiver operating characteristic curve and precision-recall curve for the six typical IoT endpoint datasets with different algorithms. (a) Receiver Operating Characteristic curve for Gutter Oil binary classification; (b) Receiver Operating Characteristic curve for Smart Grid binary classification; (c) Receiver Operating Characteristic curve for Gutter Oil multiclass classification; (d) Receiver Operating Characteristic curve for Smart Grid multiclass classification; (e) Precision-Recall curve for Gutter Oil binary classification; (f) Precision-Recall curve for Smart Grid binary classification; (g) Precision-Recall curve for Gutter Oil multiclass classification; (h) Precision-Recall curve for Smart Grid multiclass classification.
Table 16. The highest-scoring model obtained in different datasets.
Table 17 compares the performance of our FPGA and three embedded platform implementations. Jetson Nano takes the lead in terms of computing performance. On the other hand, Nucleo consumes the lowest power. While the power consumption of FPGA decreased by an average of 891%, and its performance improved by an average of 9 times compared to typical IoT endpoint platforms. Moreover, FPGAs outperform all other platforms in terms of Energy Efficiency (PLP) and Cost Efficiency (CPLP).
Table 17. Breakdown of platform implemented results.
To demonstrate the benefits of FPGA implementation in IoT endpoint scenarios, we compare embedded and FPGA platforms using six datasets in terms of performance (latency), power consumption, PLP, and CPLP, as shown in Figure 11 with the ordinate-axis in logarithmic scale. Jetson Nano exceeds the others in performance, but the second-best FPGA is not far behind, only 38% lower in average, 100% ahead of Raspberry, and 2300% ahead of Nucleo. In terms of power consumption, Nucleo is quite competitive as a low-power MCU with 102 mW on average, 30 mW lower than FPGA. These two platforms advanced well beyond Jetson Nano and Raspberry. It can be clearly seen that in comparison to other platforms, FPGAs require significantly less PLP and CPLP on the ordinate-axis in logarithmic scale. The smallest PLP and CPLP are critical for IoT endpoint development and implementation, as response time, power consumption, and cost are all critical factors in IoT endpoint tasks. Furthermore, the FPGA PLP is 17× better than the average for embedded platforms and the FPGA CPLP is 25× better than the average for embedded platforms.
Figure 11. Comparison of performance (latency), power consumption, PLP, and CPLP for the six typical IoT endpoint datasets with different algorithms when implemented on FPGA and embedded platforms. (a) Comparison of performance (latency). (b) Comparison of power consumption. (c) Comparison of PLP. (d) Comparison of CPLP.

6. Conclusions

In this paper, the Machine Learning on FPGA (MLoF), a series of ML hardware accelerator IP cores for IoT endpoint devices was introduced to offer high-performance, low-cost, and low-power.
MLoF completes the process of making inferences on FPGAs based on the optimal parameter results from PC training. It implements four typical machine learning algorithms (ANN, DT, k-NN, and SVM) with Verilog HDL on 10 FPGA development boards from seven different manufacturers. The usage of LUTs, Power, Latency, Cost, PLP, as well as CPLP are used in comparisons and analyses of the MLoF deployment results with six typical IoT datasets. At the same time, we analyzed the synthesis results of different EDA tools under the same hardware design. Finally, we compared the best FPGA deployment results with typical IoT endpoint platforms (Jetson Nano, Raspberry, STM32L476). The results indicate that the FPGA PLP outperforms the IoT platforms by an average of 17× due to their superior parallelism capability. Meanwhile, FPGAs have 25× better CPLP compared to the IoT platforms. To our knowledge, this is the first paper that conducts hardware deployment, platform comparisons, and deployment result analysis. At the same time, it is also the first set of IP on open-source FPGA machine learning algorithms, and has been verified on low-cost FPGA platforms.
MLoF still has room for further improvements: 1. The adaptability of MLoF could be enhanced, thus more complex algorithms (kNN with k > 16) could also be deployed on low-cost FPGAs with few resources, such as MachXO2; 2. More options for user parameters configuration could be added, including more ML algorithms, larger data bit width, and more hyperparameters; 3. Usability could be improved by further providing a script file or a user interface, to help the users generate the desired ML algorithm IP core more easily. These existing shortcomings of MLoF point out the direction of our future work.

Author Contributions

Conceptualization, M.L. and R.C.; investigation, M.L. and R.C.; algorithm proposed, R.C.; hardware architecture, R.C.; data curation, R.C., T.W. and Y.Z.; validation, M.L. and R.C.; resources, M.L. and R.C.; supervision, M.L.; visualization, R.C., T.W. and Y.Z.; writing—original draft, M.L., R.C. and T.W.; writing—review and editing, M.L., R.C., T.W. and Y.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Innovation Funding of Agriculture Science and Technology in Jiangsu Province, under grant CX(21)3121.

Institutional Review Board Statement

Not applicable.

Data Availability Statement

The following are available online at https://github.com/verimake-team/MLonFPGA, the source code and data.

Acknowledgments

The authors would like to thank Jiangsu Academy of Agricultural Sciences for all the support, and would also like to acknowledge the technical support from VeriMake Innovation Lab.

Conflicts of Interest

The authors declare no conflict of interest.

Abbreviations

ANNArtificial Neural Network
DTDecision Tree
FSMFinite State Machine
FPGAField Programmable Gate Array
GOPGiga Operation per Second
HDLHardware Description Language
IoTInternet of Thing
IPIntellectual Property
k-NNK-Nearest Neighbor
LUTLook-up Table
MACMultiplying Accumulator
MLoFMachine Learning on FPGA
MCUMicrocontroller Unit
MLMachine Learning
RAMRandom Access Memory
ReLURectified Linear Unit
SoCSystem on Chip
SVMSupport Vector Machine

Appendix A

Table A1. The score for different ANN architectures and activation functions.
Table A1. The score for different ANN architectures and activation functions.
Activation FunctionDatasetThe Score for ANN Architecture
1.41.82.42.83.43.84.44.8
ReLUGutter Oil95.7696.0895.9796.6195.5596.9296.7196.82
Smart Grid98.1798.2598.4197.9298.397.8398.1698.41
Gutter Oil95.2896.5596.4696.2896.2896.9793.196.19
Wine Quality70.8471.5863.0171.4570.2271.3670.772.29
Rain78.8585.1885.0683.6678.8585.3685.5585.83
Power Consumption0.99760.99670.98940.9970.99730.99640.99730.9973
TanhGutter Oil96.1996.495.8796.496.9295.8796.596.29
Smart Grid98.2598.5298.6898.398.7398.4198.3298.16
Gutter Oil95.7396.4696.3796.8296.3796.9196.5396.55
Wine Quality71.8570.4471.7672.9572.0272.2972.4672.94
Rain78.8583.6278.8578.8578.8578.8578.8478.85
Power Consumption0.99480.9950.99590.99650.99450.99560.99040.9958
SigmoidGutter Oil96.2996.496.1996.497.1497.1497.0396.92
Smart Grid98.0798.0798.5298.0398.6498.2697.8797.56
Gutter Oil95.5595.596.2796.7396.197.0996.8397
Wine Quality72.0772.6472.8672.9572.1572.817373.01
Rain78.8585.1485.1285.1778.8585.4678.8584.37
Power Consumption0.99530.99660.99510.99790.99540.99560.99580.9964
Table A2. ANN deployment results on the low-cost FPGAs.
Table A2. ANN deployment results on the low-cost FPGAs.
DatasetDeviceLUTsDSPsLatencyPowerPLPCPLP
Gutter OilEF2M4514451511.76492.485789.10149,937.75
EG4S2011652511.40371.824238.32211,492.13
GW2A18331211.45451.475167.12873,243.27
Cyclone10LP10CL17951510.55271.382863.87142,907.27
MAX10M50DAF14002510.17312.533177.18270,060.30
ICE40UP51898811.121401556.1059,131.80
MachXO22717/11.48190.982192.0765,542.85
M2S01013982211.63292.473402.01203,780.46
PGL12G1780711.65585.226820.15354,648
Artix-79161010.755896331.75505,906.83
Smart GridEF2M4513711515.59494.187704.24199,539.69
EG4S2010912511.40381.854353.48217,238.81
GW2A17791212.30419.115155.09871,210.19
Cyclone10LP10CL15301510.62267.112837.24141,578.40
MAX10M50DAF1335219.58311.082978.90253,206.68
ICE40UP51855811.121401556.1059,131.80
MachXO22696/11.481912192.3065,549.71
M2S01012632211.79300.443542.49212,195.03
PGL12G11161011.615866802.87353,749.45
Artix-79191010.905886409.20512,095.08
Gutter OilEF2M4526511512.44542.826752.71174,895.08
EG4S2017662911.13481.855360.60267,494.11
GW2A3235812.32579.517136.701,206,102.74
Cyclone10LP10CL30691510.362722818.46140,641.35
MAX10M50DAF2198259.693173072.05261,124
ICE40UP53887811140154058,520
MachXO25763/12.051912301.1768,804.92
M2S010202222123053658.78219,160.92
PGL12G2546711.975857002.45364,127.40
Artix-7226914115916501519,429.90
Wine QualityEF2M4531391516.37622.8310,195.78264,070.60
EG4S2022312912.31491.846053.59302,074.21
GW2A36961614.27682.899741.401,646,296.15
Cyclone10LP10CL38151511.802793290.81164,211.17
MAX10M50DAF2895339.663223109.88264,339.46
ICE40UP5//////
MachXO26114/12.731922444.5473,091.87
M2S01025602213.99302.214226.41253,161.77
PGL12G2929715.615869146.87475,637.45
Artix-723201412.475887332.36585,855.56
RainEF2M4514261511.84542.856428.45166,496.94
EG4S2013541610.77431.854650.61232,065.65
GW2A3245811.45579.736634.961,121,308.93
Cyclone10LP10CL2263159.392732562.38127,862.66
MAX10M50DAF2292168.123132542.50216,112.42
ICE40UP5355388.731351178.6944,790.03
MachXO24392/9.141891728.0351,668.01
M2S0102196169.123112836.63169,914.20
PGL12G205574.725842754.73143,245.86
Artix-7179489.255955503.75439,749.63
Power ConsumptionEF2M4525481516.61555.519227.56238,993.80
EG4S2016142914.61454.126632.39330,956.43
GW2A3245814.45561.6981171,371,772.43
Cyclone10LP10CL27411512.482933657.23182,495.58
MAX10M50DAF18323311.673444014.14341,201.56
ICE40UP53653811.981401676.5063,707
MachXO25621/12.802092675.4179,994.73
M2S01023232212.213223931.30235,484.75
PGL12G22981015.895869309.78484,108.66
Artix-711101412.125887126.56569,412.14
Table A3. The score for different DT architectures.
Table A3. The score for different DT architectures.
Number of DepthsDatasetNumber of Leaf Nodes
8163264
4Gutter Oil96.0996.09//
Smart Grid97.9897.98//
Gutter Oil96.4696.46//
Wine Quality81.9883.69//
Rain0.84810.848//
Power Consumption0.980.992//
5Gutter Oil96.0996.1896.18/
Smart Grid97.9897.9897.98/
Gutter Oil96.4696.6396.7/
Wine Quality81.9883.2183.51/
Rain0.84810.84510.8404/
Power Consumption0.980.99280.9963/
6Gutter Oil96.0996.1896.1896.18
Smart Grid97.9897.9897.9897.98
Gutter Oil96.4696.6396.8596.85
Wine Quality81.9882.583.5183.25
Rain0.84810.84470.83660.839
Power Consumption0.980.99280.99660.9976
Table A4. DT deployment results on the low-cost FPGAs.
Table A4. DT deployment results on the low-cost FPGAs.
DatasetDeviceLUTsDSPsLatencyPowerPLPCPLP
Gutter OilEF2M45681/12.28416.855120.52132,621.57
EG4S20661/11.52325.633751.54187,201.70
GW2A41708.62409.263528.20596,265.12
Cyclone10LP10CL36006.33263166483,033.65
MAX10M50DAF36305.703021722146,370.34
ICE40UP545107.92133.901060.7640,309.02
MachXO2494/8.86184.711635.8348,911.23
M2S01039609.872792752.34164,864.87
PGL12G27109.775485356.15278,519.90
Artix-726409.405875517.80440,872.22
Smart GridEF2M4527207.95387.383081.2179,803.41
EG4S2027207.62281.082142.35106,903.45
GW2A31307.86400.693150.21532,385.33
Cyclone10LP10CL28405.932611546.9577,192.66
MAX10M50DAF28705.383011620.28137,724.06
ICE40UP528505.87122716.1427,213.32
MachXO2285/5.951841095.1732,745.52
M2S01029106.382931868.75111,938.36
PGL12G30706.335433436.65178,705.64
Artix-718808.025884715.76376,789.22
Gutter OilEF2M451412018.90446.368436.73218,511.20
EG4S201463017.63374.876608.87329,782.61
GW2A693010.17437.504450.69752,166.19
Cyclone10LP10CL63608.062642128.10106,192.39
MAX10M50DAF64807.113032153.72183,066.54
ICE40UP571507.921351069.4740,639.86
MachXO2774/8.861871656.0749,516.55
M2S01068509.353162955.86177,056.25
PGL12G679012.825537088.35368,594.41
Artix-7655010.205906018480,838.20
Wine QualityEF2M4526608.26387.383199.7382,873.11
EG4S2026608.39286.442403.78119,948.59
GW2A31107.94401.593187.39538,668.59
Cyclone10LP10CL28405.77261.421507.0975,203.61
MAX10M50DAF28705.54301.811670.82142,019.71
ICE40UP530505.60121677.625,748.8
MachXO2304/5.54179991.6629,650.63
M2S01030106.72291.471958.41117,308.58
PGL12G31607.22443.833204.90166,654.61
Artix-717708.204884001.60319,727.84
RainEF2M4526907.65382.182925.2475,763.62
EG4S2026907.79281.842196.36109,598.15
GW2A31507.94400.683180.23537,458.69
Cyclone10LP10CL28005.77261.291508.6975,283.55
MAX10M50DAF28305.09301.381533.12130,315.21
ICE40UP530305.80121701.8026,668.40
MachXO2301/5.941741033.2130,893.04
M2S01030606.57289.281900.28113,826.79
PGL12G31207.23443.863210166,919.77
Artix-717807.424873613.54288,721.85
Power ConsumptionEF2M451437019.76436.518625.48223,399.86
EG4S201424017.51365.566399.56319,338.21
GW2A679010.24439.194495.09759,670.07
Cyclone10LP10CL65508.52264.802256.10112,579.19
MAX10M50DAF66506.87303.442083.72177,116.41
ICE40UP576509.651281235.4646,947.33
MachXO2783/9.961941932.8257,791.38
M2S010709010.47313.763286.36196,853.21
PGL12G668015.50454.357044.24366,300.60
Artix-771608.354924108.20328,245.18
Table A5. The score for different k-NN architectures.
Table A5. The score for different k-NN architectures.
Datasetk Values (Number of Neighbors)
24816
Gutter Oil99.3699.0998.7398.46
Smart Grid75.7877.378.3879.39
Gutter Oil99.3699.0798.7398.46
Wine Quality77.9377.4181.2374.9
Rain0.83660.84660.84940.8527
Power Consumption0.9960.99540.99440.9915
Table A6. The k-NN deployment results on the low-cost FPGAs.
Table A6. The k-NN deployment results on the low-cost FPGAs.
DatasetDeviceLUTsDSPsLatencyPowerPLPCPLP
Gutter OilEF2M4512511517.26446.367703.32199,515.87
EG4S209421816.08315.815077.64253,374.31
GW2A99165.18397.252059.34348,029.14
Cyclone10LP10CL984610.63282.493001.74149,786.76
MAX10M50DAF98569.33335.913135.05266,479.08
ICE40UP597967.811341046.5439,768.52
MachXO21062/8.84185.471638.6348,994.96
M2S010990613.32325.574337.50259,816.40
PGL12G//////
Artix-75921811.925977116.24568,587.58
Smart GridEF2M45//////
EG4S2010,7572918.071371.3224,783.831,236,713.13
GW2A5708137.84888.166960.471,176,319.55
Cyclone10LP10CL48101311.89304.673621.31180,703.25
MAX10M50DAF4847139.90369.193653.14310,516.48
ICE40UP5494989.32135.791265.5848,092.09
MachXO2//////
M2S01050781313.41327.234386.45262,748.42
PGL12G//////
Artix-732313912.055947157.70571,900.23
Gutter OilEF2M4513611516.92452.827663.01198,471.83
EG4S2010581816.62331.425507.27274,812.90
GW2A104765.18406.452107.04356,089.22
Cyclone10LP10CL1076610.63282.793005.77149,988.17
MAX10M50DAF108868.83335.662964.55251,986.68
ICE40UP5107767.811351054.3540,065.30
MachXO21241/7.62185.471414.0242,279.30
M2S0101117612.94325.724213.81252,407.44
PGL12G//////
Artix-76731811.575976907.29551,892.47
Wine QualityEF2M45//////
EG4S2044352918.96672.6912,752.20636,334.94
GW2A3415116.46578.873739.51631,977.72
Cyclone10LP10CL30731111.10194.792162.56107,911.67
MAX10M50DAF3067119.273543280.87278,874.12
ICE40UP5384889.131351232.4246,831.77
MachXO24348/7.181861335.6739,936.41
M2S01035681113.35235.653145.95188,442.66
PGL12G//////
Artix-720373312.105116183.10494,029.69
RainEF2M45//////
EG4S2011,76724181491.2126,843.271,339,479.23
GW2A629787.76886.536882.981,163,223.64
Cyclone10LP10CL5397811.89305.543633.79181,325.98
MAX10M50DAF5412810.09362.043651.17310,349.74
ICE40UP5//////
MachXO2//////
M2S0106218812.94471.376097.62365,247.23
PGL12G//////
Artix-737542411.505095853.50467,694.65
Power ConsumptionEF2M4517351517.75485.838625.01223,387.78
EG4S2010712115.30365.835598.36279,358.05
GW2A109575.18321.681667.57281,819.93
Cyclone10LP10CL116079.07288.612617.40130,608.46
MAX10M50DAF1173710.54339.083573.90303,781.77
ICE40UP5115979.361271188.2145,152.06
MachXO21864/8.81195.091719.1251,401.82
M2S0101232711.86212.882524.48151,216.63
PGL12G//////
Artix-77342111.955346381.30509,865.87
Table A7. SVM deployment results on the low-cost FPGAs.
Table A7. SVM deployment results on the low-cost FPGAs.
DatasetDeviceLUTsDSPsLatencyPowerPLPCPLP
Gutter OilEF2M45829813.81426.675890.66152,568.13
EG4S20853813.45365.114909.28244,973.20
GW2A67186.51329.782145.52362,592.23
Cyclone10LP10CL740810.59274.362906.30145,024.14
MAX10M50DAF741810.06342.583447.04292,998.40
ICE40UP5765815.561512349.4189,277.54
MachXO22663/15.572173378.91101,029.32
M2S010766812.66315.133988.89238,934.52
PGL12G663811.73589.166910.26359,333.40
Artix-741689.125905380.80429,925.92
Smart GridEF2M45950813.69456.226245.71161,763.80
EG4S201000814.01355.154976.29248,316.96
GW2A89086.75361.032437.66411,964.72
Cyclone10LP10CL963811.33307.443484.22173,862.45
MAX10M50DAF964810.46364.083809.00323,765.42
ICE40UP5955815.90154.982464.1793,638.31
MachXO23752/15.212163284.7198,212.89
M2S010975813.48325.744390.70263,003.13
PGL12G775811.22587.976596.44343,014.64
Artix-748288.785895171.42413,196.46
Gutter OilEF2M45829813.81426.675890.66152,568.13
EG4S20853813.45365.114909.28244,973.20
GW2A67186.51329.782145.52362,592.23
Cyclone10LP10CL740810.59274.362906.30145,024.14
MAX10M50DAF741810.06362.583648.28310,103.80
ICE40UP5765814.90152.992279.2086,609.61
MachXO22667/15.19217.113297.0298,580.82
M2S010766812.07319.073851.48230,703.77
PGL12G663811.73569.166675.68347,135.24
Artix-741609.124874441.44354,871.06
Wine QualityEF2M45918813.52444.326005.85155,551.42
EG4S20994813.72355.154873.65243,195.38
GW2A85886.68354.762369.78400,493.40
Cyclone10LP10CL932810.96276.473030.39151,216.34
MAX10M50DAF933810.52362.873817.76324,509.20
ICE40UP5976815.56155.142413.8991,727.65
MachXO23070/15.91217.113454.24103,281.66
M2S010966812.44320.803990.75239,046.04
PGL12G743812.10569.606891.02358,333.08
Artix-746589.204904508.00360,189.20
RainEF2M45523813.27400.725317.57137,725.00
EG4S20533812.95312.134041.45201,668.17
GW2A54185.91316.571869.98316,027.45
Cyclone10LP10CL50289.83275.692710.86135,271.90
MAX10M50DAF50389.95320.903193.92271,483.00
ICE40UP5527813.51147.761996.7775,877.43
MachXO2864 15.21217.433307.5598,895.60
M2S010570811.38184.512100.12125,797.09
PGL12G540812.50567.527095.14368,947.02
Artix-734488.924904370.80349,226.92
Power ConsumptionEF2M45523813.85397.725509.19142,688.01
EG4S20533812.66315.823997.32199,466.32
GW2A53485.91314.481857.63313,939.04
Cyclone10LP10CL504810.02275.912765.72138,009.52
MAX10M50DAF50789.59321.223081.14261,897.09
ICE40UP5516813.471481994.0075,772.15
MachXO2851/15.21216.433292.3398,440.76
M2S010519812.53193.702427.30145,395.56
PGL12G531810.86467.435078.16264,064.30
Artix-734288.664904243.40339,047.66

References

  1. Li, H.; Ota, K.; Dong, M. Learning IoT in edge: Deep learning for the Internet of Things with edge computing. IEEE Netw. 2018, 32, 96–101. [Google Scholar] [CrossRef] [Green Version]
  2. Sakr, F.; Bellotti, F.; Berta, R.; De Gloria, A. Machine Learning on Mainstream Microcontrollers. Sensors 2020, 20, 2638. [Google Scholar] [CrossRef]
  3. Deploy Machine Learning Models on Mobile and IoT Devices. Available online: https://www.tensorflow.org/lite (accessed on 1 April 2021).
  4. STMicroelectronics X-CUBE-AI—AI Expansion Pack for STM32CubeMX. Available online: http://www.st.com/en/embedded-software/x-cube-ai.html (accessed on 1 April 2021).
  5. Lai, L.; Suda, N.; Chandra, V. CMSIS-NN: Efficient Neural Network Kernels for Arm Cortex-M CPUs. arXiv 2018, arXiv:1801.06601. [Google Scholar]
  6. DiCecco, R.; Lacey, G.; Vasiljevic, J.; Chow, P.; Taylor, G.; Areibi, S. Caffeinated FPGAs: FPGA Framework for Convolutional Neural Networks. In Proceedings of the IEEE 2016 International Conference on Field-Programmable Technology (FPT), Xi’an, China, 7–9 December 2016; pp. 265–268. [Google Scholar]
  7. Brandalero, M.; Ali, M.; Le Jeune, L.; Hernandez, H.G.M.; Veleski, M.; da Silva, B.; Lemeire, J.; Van Beeck, K.; Touhafi, A.; Goedemé, T. AITIA: Embedded AI Techniques for Embedded Industrial Applications. In Proceedings of the IEEE 2020 International Conference on Omni-layer Intelligent Systems (COINS), Barcelona, Spain, 31 August–2 September 2020; pp. 1–7. [Google Scholar]
  8. Kathail, V. Xilinx Vitis Unified Software Platform. In Proceedings of the 2020 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, Seaside, CA, USA, 23–25 February 2020; pp. 173–174. [Google Scholar]
  9. Aydonat, U.; O’Connell, S.; Capalija, D.; Ling, A.C.; Chiu, G.R. An OpenclTM Deep Learning Accelerator on Arria 10. In Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, Monterey, CA, USA, 22–25 February 2017; pp. 55–64. [Google Scholar]
  10. Intelligent Automation, Inc. DeepIP-FNN. Available online: https://www.xilinx.com/products/intellectual-property/1-15kaxa2.html (accessed on 2 May 2021).
  11. Intel Intel® FPGA Technology Solutions for Artificial Intelligence (AI). Available online: https://www.intel.com/content/www/us/en/artificial-intelligence/programmable/solutions.html (accessed on 2 May 2021).
  12. Shawahna, A.; Sait, S.M.; El-Maleh, A. FPGA-Based Accelerators of Deep Learning Networks for Learning and Classification: A Review. IEEE Access 2019, 7, 7823–7859. [Google Scholar] [CrossRef]
  13. Holanda Noronha, D.; Zhao, R.; Goeders, J.; Luk, W.; Wilton, S.J. On-Chip Fpga Debug Instrumentation for Machine Learning Applications. In Proceedings of the 2019 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, Seaside, CA, USA, 24–26 February 2019; pp. 110–115. [Google Scholar]
  14. Saqib, F.; Dutta, A.; Plusquellic, J.; Ortiz, P.; Pattichis, M.S. Pipelined Decision Tree Classification Accelerator Implementation in FPGA (DT-CAIF). IEEE Trans. Comput. 2013, 64, 280–285. [Google Scholar] [CrossRef]
  15. Attaran, N.; Puranik, A.; Brooks, J.; Mohsenin, T. Embedded Low-Power Processor for Personalized Stress Detection. IEEE Trans. Circuits Syst. II Express Briefs 2018, 65, 2032–2036. [Google Scholar] [CrossRef]
  16. Batista, G.C.; Oliveira, D.L.; Saotome, O.; Silva, W.L. A Low-Power Asynchronous Hardware Implementation of a Novel SVM Classifier, with an Application in a Speech Recognition System. Microelectron. J. 2020, 105, 104907. [Google Scholar] [CrossRef]
  17. Roukhami, M.; Lazarescu, M.T.; Gregoretti, F.; Lahbib, Y.; Mami, A. Very Low Power Neural Network FPGA Accelerators for Tag-Less Remote Person Identification Using Capacitive Sensors. IEEE Access 2019, 7, 102217–102231. [Google Scholar] [CrossRef]
  18. Wang, C.; Gong, L.; Yu, Q.; Li, X.; Xie, Y.; Zhou, X. DLAU: A Scalable Deep Learning Accelerator Unit on FPGA. IEEE Trans. Comput.-Aided Des. Integr. Circuits Syst. 2016, 36, 513–517. [Google Scholar] [CrossRef]
  19. Ge, F.; Wu, N.; Xiao, H.; Zhang, Y.; Zhou, F. Compact Convolutional Neural Network Accelerator for Iot Endpoint Soc. Electronics 2019, 8, 497. [Google Scholar] [CrossRef] [Green Version]
  20. Jindal, M.; Gupta, J.; Bhushan, B. Machine Learning Methods for IoT and Their Future Applications. In Proceedings of the IEEE 2019 International Conference on Computing, Communication, and Intelligent Systems (ICCCIS), Greater Noida, India, 18–19 October 2019; pp. 430–434. [Google Scholar]
  21. Qian, B.; Su, J.; Wen, Z.; Jha, D.N.; Li, Y.; Guan, Y.; Puthal, D.; James, P.; Yang, R.; Zomaya, A.Y. Orchestrating the Development Lifecycle of Machine Learning-Based Iot Applications: A Taxonomy and Survey. ACM Comput. Surv. (CSUR) 2020, 53, 1–47. [Google Scholar] [CrossRef]
  22. Pedregosa, F.; Varoquaux, G.; Gramfort, A.; Michel, V.; Thirion, B.; Grisel, O.; Blondel, M.; Prettenhofer, P.; Weiss, R.; Dubourg, V. Scikit-Learn: Machine Learning in Python. J. Mach. Learn. Res. 2011, 12, 2825–2830. [Google Scholar]
  23. Meshram, V.; Patil, K.; Hanchate, D. Applications of Machine Learning in Agriculture Domain: A State-of-Art Survey. Int. J. Adv. Sci. Technol. 2020, 29, 5319–5343. [Google Scholar]
  24. Gong, Z.; Zhong, P.; Hu, W. Diversity in Machine Learning. IEEE Access 2019, 7, 64323–64350. [Google Scholar] [CrossRef]
  25. Yang, L.; Shami, A. On Hyperparameter Optimization of Machine Learning Algorithms: Theory and Practice. Neurocomputing 2020, 415, 295–316. [Google Scholar] [CrossRef]
  26. Venieris, S.I.; Bouganis, C.-S. FpgaConvNet: A Framework for Mapping Convolutional Neural Networks on FPGAs. In Proceedings of the 2016 IEEE 24th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM), Washington, DC, USA, 1–3 May 2016; pp. 40–47. [Google Scholar]
  27. Faraji, S.R.; Abillama, P.; Singh, G.; Bazargan, K. Hbucnna: Hybrid Binary-Unary Convolutional Neural Network Accelerator. In Proceedings of the 2020 IEEE International Symposium on Circuits and Systems (ISCAS), Seville, Spain, 12–14 October 2020; pp. 1–5. [Google Scholar]
  28. Akima, H. A New Method of Interpolation and Smooth Curve Fitting Based on Local Procedures. J. ACM (JACM) 1970, 17, 589–602. [Google Scholar] [CrossRef]
  29. Chen, H.; Jiang, L.; Yang, H.; Lu, Z.; Fu, Y.; Li, L.; Yu, Z. An Efficient Hardware Architecture with Adjustable Precision and Extensible Range to Implement Sigmoid and Tanh Functions. Electronics 2020, 9, 1739. [Google Scholar] [CrossRef]
  30. Ramachandran, S. Synthesis of Designs–Synplify Tool. In Digital VLSI Systems Design: A Design Manual for Implementation of Projects on FPGAs and ASICs Using Verilog; Springer: Berlin/Heidelberg, Germany, 2007; pp. 255–292. [Google Scholar]
  31. Verimake Gutter Oil Dataset. Available online: https://github.com/verimake-team/Gutteroildetector/tree/master/data (accessed on 2 May 2021).
  32. Schäfer, B.; Grabow, C.; Auer, S.; Kurths, J.; Witthaut, D.; Timme, M. Taming Instabilities in Power Grid Networks by Decentralized Control. Eur. Phys. J. Spec. Top. 2016, 225, 569–582. [Google Scholar] [CrossRef]
  33. Arzamasov, V.; Böhm, K.; Jochem, P. Towards Concise Models of Grid Stability. In Proceedings of the 2018 IEEE International Conference on Communications, Control, and Computing Technologies for Smart Grids (SmartGridComm), Aalborg, Denmark, 29–31 October 2018; pp. 1–6. [Google Scholar]
  34. Cortez, P.; Cerdeira, A.; Almeida, F.; Matos, T.; Reis, J. Modeling Wine Preferences by Data Mining from Physicochemical Properties. Decis. Support Syst. 2009, 47, 547–553. [Google Scholar] [CrossRef] [Green Version]
  35. Climate Data Online-Map Search-Bureau of Meteorology. Available online: http://www.bom.gov.au/climate/data/ (accessed on 2 May 2021).
  36. Individual Household Electric Power Consumption Data Set. Available online: https://archive.ics.uci.edu/ml/datasets/individual+household+electric+power+consumption (accessed on 2 May 2021).
  37. Dellamonica, J.; Lerolle, N.; Sargentini, C.; Beduneau, G.; Di Marco, F.; Mercat, A.; Richard, J.-C.M.; Diehl, J.-L.; Mancebo, J.; Rouby, J.-J. Accuracy and Precision of End-Expiratory Lung-Volume Measurements by Automated Nitrogen Washout/Washin Technique in Patients with Acute Respiratory Distress Syndrome. Crit. Care 2011, 15, 1–8. [Google Scholar] [CrossRef] [Green Version]
  38. Hu, Y.; Zhu, Y.; Chen, H.; Graham, R.; Cheng, C.-K. Communication Latency Aware Low Power NoC Synthesis. In Proceedings of the IEEE 43rd annual Design Automation Conference, San Francisco, CA, USA, 24–28 July 2006; pp. 574–579. [Google Scholar]
  39. Garofalo, A.; Rusci, M.; Conti, F.; Rossi, D.; Benini, L. PULP-NN: Accelerating Quantized Neural Networks on Parallel Ultra-Low-Power RISC-V Processors. Philos. Trans. R. Soc. A 2020, 378, 20190155. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  40. Slater, W.S.; Tiwari, N.P.; Lovelly, T.M.; Mee, J.K. Total Ionizing Dose Radiation Testing of NVIDIA Jetson Nano GPUs. In Proceedings of the 2020 IEEE High Performance Extreme Computing Conference (HPEC), Waltham, MA, USA, 22–24 September 2020; pp. 1–3. [Google Scholar]
  41. Lang, R.; Lescisin, M.; Mahmoud, Q.H. Selecting a Development Board for Your Capstone or Course Project. IEEE Potentials 2018, 37, 6–14. [Google Scholar] [CrossRef]
  42. Crocioni, G.; Pau, D.; Delorme, J.-M.; Gruosso, G. Li-Ion Batteries Parameter Estimation with Tiny Neural Networks Embedded on Intelligent IoT Microcontrollers. IEEE Access 2020, 8, 122135–122146. [Google Scholar] [CrossRef]
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Article Metrics

Citations

Article Access Statistics

Multiple requests from the same IP address are counted as one view.