# Energy-Efficient Architecture for CNNs Inference on Heterogeneous FPGA

^{1}

^{2}

^{*}

## Abstract

**:**

## 1. Introduction

## 2. Background and Motivations

## 3. The Proposed SIMD CNN Accelerator

_{M}ifmaps in parallel and to produce T

_{N}ofmaps contemporaneously. Thus, the generic K × K CONV layer, receiving M ifmaps and producing N ofmaps, is completed within ns = (M/T

_{M}) × (N/T

_{N}) computational steps performed as follows. The SIMD buffer receives input data through an AXI-stream (AXIS) [38] and, after a latency depending on the number of columns W in the ifmaps, it prepares T

_{M}K × K convolution windows for the subsequent SIMD Convolution Engine (CE). The latter is designed to perform double MAC operations by enabling a further intra-feature map parallelism based on a new efficient processing strategy. In the meantime, the module Store Kernels reads the K × K × T

_{M}× T

_{N}kernel coefficients from the external memory through AXI-Full transactions [38] and delivers them to the SIMD CE, which performs the T

_{M}× T

_{N}planned convolutions in parallel. The Accumulate module then accumulates the obtained results by exploiting a local memory buffer and finally outputs the intermediate ofmaps. The finite state machine (FSM) orchestrates the activities of all modules, considering the current layer information provided by the PS through an AXI-Lite interface [38], and the intermediate steps already performed. For this purpose, the bidirectional CTRL bus is used. As soon as all the intermediate ofmaps are accumulated, the ReLU & Quantization module is activated; the quantized ofmaps are optionally sub-sampled by the Pooling module and then outputted. Input data and kernel coefficients are 8-bit unsigned and signed fixed-point numbers, respectively. To efficiently exploit the SIMD paradigm, two couples of 8-bit values belonging to two different ifmaps (i.e., ifp

_{t}and ifp

_{t+}

_{1}with t = 0, 2,..., M − 2) are accommodated within one 32-bit word, ensuring that two adjacent elements of ifp

_{t}are interleaved with two adjacent elements of ifp

_{t}

_{+1}. Data packed in this way are stored in the external memory in the raster order. It is worth pointing out that the proposed architecture is designed to output ofmaps already packed as described above. Therefore, no specific data re-adjustment is required between consecutive convolutional layers. A slightly different strategy is used to store the K × K convolution kernels, which are indeed packed within 64-bit words, ensuring that the coefficients having homologous positions within eight distinct kernels are packed in the same word and transferred from the external memory at the same time.

#### 3.1. Architecture of the SIMD Buffer

_{M}ifmaps received as input, and to furnish two adjacent values of distinct T

_{N}ofmaps. To this end, a window consisting of K × (K + 1) values must be patched over each ifmap to accommodate two adjacent convolution windows. This requires an unconventional buffer stage design. The SIMD reconfigurable buffer depicted in Figure 3 uses T

_{M}instances of the internal buffer IBuff, each consisting of K − 1 FIFOs and $K\times \left(nr+1\right)+nr+2$ registers, with $nr=\frac{K-1}{2}$ being the radius of the convolution window. Each IBuff internally splits the incoming 16-bit data into two pieces of 8-bit data that feed two different pipes: the former consisting of the nr + 1 registers R

_{1}, …, R

_{nr+1}, and the other composed of the nr + 2 registers R

_{nr+2}, …, R

_{2nr+2}and R

_{aux}. Such an additional register is required to correctly pair incoming values when nr is odd. To better explain why this is necessary, let us consider the example of Figure 4a, which shows the case in which the generic IBuff receives a 4 × 8 ifmap and arranges 3 × 3 convolution windows. It is important to note that, due to the zero padding, the first incoming pair of adjacent values A

_{1}and B

_{1}does not appear in the correct relative position for parallel operation in the two highlighted convolution windows. In fact, A

_{1}in the brown window corresponds to the padding zero value in the blue one, whereas B

_{1}in the brown window corresponds to A

_{2}in the blue one. To guarantee that the incoming data will be multiplied by the correct kernel coefficients, they must be properly recoupled before reaching the FIFO

_{1}. This is done through five registers, R

_{1}, R

_{2}, R

_{3}, R

_{4}and R

_{aux}, as shown in the timing diagram illustrated in Figure 4b. It is easy to verify that, when nr is even, incoming data are already correctly paired. In this case, the register R

_{aux}has no effect. The data-path then goes on through the subsequent FIFOs and registers that furnish data depending on nr, as summarized in Figure 4c, where the symbol ’&’ is used to indicate concatenations of two 8-bit registers.

#### 3.2. Design of the SIMD CE

_{t}(h,w) and B = ifp

_{t}(h,w + 1) two adjacent packed unsigned elements uploaded from the generic ifmap. With C being the generic signed kernel coefficient, for the above-mentioned purpose, two independent products A × C and B × C have to be computed in parallel. As schematized in Figure 5, the inputs A and B are re-arranged within the b-bit input Y of a DSP interposing eight zero bits to each other and zeroing the remaining MSBs of Y to guarantee that the operand A is always treated as an unsigned value. Conversely, the d-bit operand Z is used to input the sign extended 8-bit coefficient C. When the latter is negative, the DSP applies the 2 s complement notation to the overall result instead of the two separate products, thus necessitating an increment by one of the product A × C to compensate the introduced error. Due to the different data arrangement used, the designs of double MAC (DMAC) engines presented in [16,36,37] address this issue through logic resources external to the DSPs that perform multiplications. This approach negatively affects the computational time, since it breaks the chain of DSPs cascaded along dedicated fast routing resources. In the SIMD CE proposed here, as shown in Figure 5, the products A × C and B × C are accommodated within the (b + d)-bit output of the multiplier occupying the (b + d − 16) MSBs and the 16 LSBs, respectively. Due to this, the correction is done by adding one auxiliary u-bit operand X by using the accumulator internal to the same DSP slice that performs the multiplication. To increment by one the product A × C while leaving B × C unchanged, when C is negative, X must be set to 2

^{16}, thus asserting only its 17-th bit. Conversely, when C is positive, X must be set to zero. In this way, the cascaded DSPs used to perform the DMACs can complete their operations without encountering breaks along their dedicated fast chain.

_{N}Processing Elements (PEs) that compute distinct T

_{N}ofmaps in parallel by processing K × K × T

_{M}pairs of 8-bit data transferred by the SIMD buffer and the kernel coefficients provided by the module Store Kernels. The module SEPARATE routes the data streamed by the SIMD Buffer to the PEs, whereas the module Generate Out Stream arranges the results as ruled by the AXI4-Stream protocol. As can be seen in Figure 6, the generic PE consists of N

_{DMAC}DMACs, each responsible for processing $\frac{K\times K\times {T}_{M}}{{N}_{DMAC}}$ b-bit data through as many DSP slices configured to perform SIMD operations. Each DSP receives one packed b-bit operand and one kernel coefficient C as inputs and computes two parallel 16-bit products A × C and B × C. To perform the subsequent accumulations correctly, each SIMD result is re-arranged over u bits by the module INSERT GUARD BITS. The latter sign extends the 16-bit product B × C to $\frac{u}{2}$ bits, and left-shifts the 16-bit product A × C by $\left(\frac{u}{2}-16\right)$ positions. In this way, $\left(\frac{u}{2}-16\right)$ guard-bits are introduced between the two independent products, thus allowing up to ${2}^{\left(\frac{u}{2}-16\right)}$ accumulations to be performed in SIMD fashion. The $\frac{K\times K\times {T}_{M}}{{N}_{DMAC}}$ u-bit data obtained in this way are then dispatched to the subsequent $\frac{K\times K\times {T}_{M}}{2\times {N}_{DMAC}}$ DSPs configured as accumulators. Further cascaded DSP slices then accumulate the results produced in parallel by the DMACs involved in the generic PE. If u < 64, the two adjacent $(\frac{u}{2})$-bit packed values outputted by the generic PE are separately sign extended to 32 bits; re-arranged within one 64-bit word; and streamed out with the results coming from the other PEs.

_{N}ifmaps in parallel, as established by the chosen rectified activation function. The quantized results are finally streamed out towards either the external memory or the Pooling module, which can perform the downsampling by applying either the Max Pooling, or the Average Pooling, or the Stochastic spatial sampling. Such a choice can be dynamically modified via software. The Pooling module produces the first valid result after FS + 1 clock cycles and then furnishes a new output every clock cycle until two consecutive rows of the received ifmap are processed. During the subsequent FS cycles, the circuit just waits for the next downsampling window. Then, a new output value is produced at each clock cycle until two further rows have been processed, and so on.

#### 3.3. Implementation of the Fully Connectd Layers

## 4. Implementation of the Proposed CNN Accelerator on Heterogeneous FPGAs

- the software running on the PS uses the port M_GP0 to configure the DMAs and the CDMA IP cores through the AXI4-Lite protocol. Each module receives an appropriate task to transfer a certain amount of data from/to a specific area within the external DDR memory. The port M_GP0 is also used to configure the custom accelerator, by setting the stride, the number and the size of ifmaps and ofmaps for each layer of the accelerated CNN, as well as the type of pooling to be applied, and finally to start its operations;
- the AXI-Streams coming out from the four DMAs are synchronized by the AXIS-Combiner within a single data stream; contemporaneously, the CDMA transfers the kernels coefficients related to the current convolutional layer from the DDR to the Store Kernels module;
- the combined stream is purposely split by the AXIS-Broadcaster
_{0}into T_{M}separate streams to sustain the parallelism level on buffered ifmaps delivered to the custom accelerator; - the output stream produced by the custom accelerator is then separated into four 32-bit streams by the AXIS-Broadcaster
_{1}and moved to the external DDR by DMAs, thus properly preparing the ifmaps for the next convolutional layer; - the software routine run by the PS finally performs the FC and softmax layers.

_{M}= 8 and T

_{N}= 2. In this case, the total memory bandwidth requirement is 2.9 GB/s, which is well below the 4.16GB/s supported by the DDR memory controller [40]. The second implementation exploits the wider XC7Z045 device. Its higher resource count allows the parallelism level to be increased to T

_{N}= 8. In this case, the 5.2GB/s maximum memory bandwidth dictates the maximum clock frequency to 167 MHz.

_{M}and T

_{N}) and the convolution kernel size (K). Figure 9a,b plot the results related to the XC7Z020 and XC7Z045 device, respectively. In order to become familiarized with information contained in the diagrams, let’s examine the leftmost portion of Figure 9a. There, several kernel sizes (K = 3, 5, 7, 9, 11) are considered, with T

_{M}= 1. Obviously, the wider the convolution kernel, the higher the number of DSPs used by a single PE. This means that the maximum number of ofmaps computed in parallel (MaxT

_{N}) by the CE is limited by the amount of available resources. Referring to the cases in which T

_{M}= 1, the number of DSPs used by each PE ranges from 15 to 183 for K varying from 3 to 11, respectively, while up to 14 ofmaps can be processed in parallel. Of course, the larger XC7Z045 device allows convolution kernels wider than 11 to be also used in the PE. Figure 9 clearly shows that generally a theoretical speed-up (SU = T

_{M}× T

_{N}) with respect to the case in which T

_{M}= T

_{N}= 1 can be obtained with various configurations. As an example, SU = 16 can be achieved for different values of T

_{M}and T

_{N}(e.g., T

_{M}= 8 and T

_{N}= 2, T

_{M}= 16 and T

_{N}= 1, and so on). Each of the implementable solutions offers its own benefit, depending on the actually exploitable parallelism, which is bounded by the limited capability of DMAs and HP ports. Indeed, when AXI transactions wider than 64-bit are required, they are performed over more than one clock cycle. In such cases, the actual speed-up is consequently reduced with respect to the above-mentioned theoretical level.

_{M}or T

_{N}above 16 are influenced by such an effect. As an example, the configuration (T

_{M}= 17, T

_{N}= 3, K = 3) shows an actual speed-up of 24 over the theoretical SU = T

_{M}× T

_{N}= 51. Due to this limitation, several possible configurations that can be accommodated in the XC7Z045 device do not actually benefit from the increased parallelism. However, they can be efficiently exploited in high performance Ultrascale

^{TM}devices. In such cases, frame rates up to 55 fps can be achieved for the inference of VGG-16.

## 5. Conclusions

## Author Contributions

## Funding

## Conflicts of Interest

## References

- Sze, V.; Chen, H.; Yang, T.J.; Emer, J. Efficient Processing of Deep Neural Networks: A Tutorial and Survey. Proc. IEEE
**2017**, 105, 2295–2329. [Google Scholar] [CrossRef] [Green Version] - Ranjan, R.; Patel, V.M.; Chellappa, R. HyperFace: A Deep Multi-task Learning Framework for Face Detection, Landmark Localization, Pose Estimation, and Gender Recognition. IEEE Trans. Pattern Anal. Mach. Learn.
**2019**, 41, 121–135. [Google Scholar] [CrossRef] [PubMed] [Green Version] - Jaderberg, M.; Simonyan, K.; Vedaldi, A.; Zisserman, A. Reading Text in the Wild with Convolutional Neural Networks. Int. J. Comput. Vis.
**2016**, 116, 1–20. [Google Scholar] [CrossRef] [Green Version] - Zhang, Y.; Chan, W.; Jaitly, N. Very deep convolutional networks for end-to-end speech recognition. In Proceedings of the 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), New Orleans, LA, USA, 5–9 March 2017. [Google Scholar]
- Wang, X.; Zhang, W.; Wu, X.; Xiao, L.; Qian, Y.; Fang, Z. Real-time vehicle type classification with deep convolutional neural networks. J. Real Time Image Process.
**2019**, 16, 5–14. [Google Scholar] [CrossRef] - Du, L.; Du, Y.; Li, Y.; Su, J.; Kuan, Y.; Liu, C.; Chang, M.F. A Reconfigurable Streaming Deep Convolutional Neural Network Accelerator for Internet of Things. IEEE Trans. Circ. Syst.
**2018**, 65, 198–208. [Google Scholar] [CrossRef] [Green Version] - Simonyan, K.; Zisserman, A. Very Deep Convolutional Networks for Large-Scale Image Recognition. 2014. Available online: https://arxiv.org/abs/1409.1556 (accessed on 14 November 2019).
- Chatfield, K.; Simonyan, K.; Vedaldi, A.; Zisserman, A. Return of the Devil in the Details: Delving Deep into Convolutional Nets. Available online: https://arxiv.org/pdf/1405.3531.pdf (accessed on 14 November 2019).
- He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 770–779. [Google Scholar]
- Szegedy, C.; Liu, W.; Jia, Y.; Sermanet, P.; Reed, S.; Anguelov, D.; Erhan, D.; Vanhoucke, V.; Rabinovich, A. Going Deeper with Convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA, 7–12 June 2015; pp. 1–9. [Google Scholar]
- Judd, P.; Albericio, J.; Hetherington, T.; Aamodt, T.; Jerger, N.E.; Urtasun, R.; Moshovos, A. Proteus: Exploiting precision variability in deep neural networks. Parallel Comput.
**2018**, 73, 40–51. [Google Scholar] [CrossRef] - Gysel, P.; Motamedi, M.; Ghiasi, S. Hardware-Oriented Approximation of Convolutional Neural Networks. 2016. Available online: https://arxiv.org/abs/1604.03168 (accessed on 14 November 2019).
- Wu, S.; Li, G.; Chen, F.; Shi, L. Training and Inference with Integers in Deep Neural Networks. 2018. Available online: https://arxiv.org/abs/1802.04680 (accessed on 14 November 2019).
- Rodriguez, A.; Segal, E.; Meiri, E.; Fomenko, E.; Kim, Y.J.; Shen, H.; Ziv, B. Lower Numerical Precision Deep Learning Inference and Training. Available online: https://software.intel.com/en-us/articles/lower-numerical-precision-deep-learning-inference-and-training (accessed on 14 November 2019).
- Horowitz, M. Computing’s energy problem (and what we can do about it). In Proceedings of the 2014 IEEE International Solid-State Circuits Conference, San Francisco, CA, USA, 9–13 February 2014; pp. 10–14. [Google Scholar]
- Lian, X.; Liu, Z.; Song, Z.; Dai, J.; Zhou, W.; Ji, X. High-Performance FPGA-Based CNN Accelerator with Block-Floating-Point Arithmetic. IEEE Trans. VLSI Syst.
**2019**, 27, 1874–1885. [Google Scholar] [CrossRef] - Chen, Q.; Fu, X.; Song, W.; Cheng, K.; Lu, Z.; Zhang, C.; Li, L. An Efficient Streaming Accelerator for Low Bit-Width Convolutional Neural Networks. Electronics
**2019**, 8, 371. [Google Scholar] [CrossRef] [Green Version] - Zhang, M.; Li, L.; Wang, H.; Liu, Y.; Qin, H.; Zhao, W. Optimized Compression for Implementing Convolutional Neural Networks on FPGA. Electronics
**2019**, 8, 295. [Google Scholar] [CrossRef] [Green Version] - Tlelo-Cuautle, E.; Rangel-Magdaleno, J.; Gerardo de la Fraga, L. Engineering Applications of FPGAs; Springer Book: Basel, Switzerland, 2016. [Google Scholar]
- Pano-Azucena, A.D.; Tlelo-Cuautle, E.; Tan, S.X.D.; Ovilla-Martinez, B.; Gerardo de la Fraga, L. FPGA-Based Implementation of a Multilayer Perceptron Suitable for Chaotic Time Series Prediction. Technologies
**2018**, 6, 90. [Google Scholar] [CrossRef] [Green Version] - Zynq-7000 SoC Technical Reference Manual, UG585 (v1.12.2). 1 July 2018. Available online: www.xilinx.com (accessed on 14 November 2019).
- Zynq Ultrascale+ Device Technical Reference Manual, UG1085 (v. 1.8). 3 August 2018. Available online: www.xilinx.com (accessed on 14 November 2019).
- Stratix 10 GX/SX Device Overview. Available online: www.intel.com (accessed on 14 November 2019).
- Arria 5/10 SoC FPGAs. Available online: www.intel.com (accessed on 14 November 2019).
- HajiRassouliha, A.; Taberner, A.J.; Nash, M.P.; Nielsen, P.M.F. Suitability of recent hardware accelerators (DSPs, FPGAs, and GPUs) for computer vision and image processing algorithms. Signal Process. Image Commun.
**2018**, 68, 101–119. [Google Scholar] [CrossRef] - Venieris, S.I.; Bouganis, C.S. fpgaConvNet: Mapping Regular and Irregular Convolutional Neural Networks on FPGAs. IEEE Trans. Neural Netw. Learn. Syst.
**2018**, 30, 326–342. [Google Scholar] [CrossRef] [PubMed] [Green Version] - Chen, X.; Yu, Z. A Flexible and Energy-Efficient Convolutional Neural Network Acceleration with Dedicated ISA and Accelerator. IEEE Trans. VLSI Syst.
**2018**, 26, 1408–1412. [Google Scholar] [CrossRef] - Ma, Y.; Cao, Y.; Vrudhula, S.; Seo, J. Optimizing the Convolution Operation to Accelerate Deep Neural Networks on FPGA. IEEE Trans. VLSI Syst.
**2018**, 26, 1354–1367. [Google Scholar] [CrossRef] - Aimar, A.; Mostafa, H.; Calabrese, E.; Rios-Navarro, A.; Tapiador-Moralese, R.; Lungu, I.A.; Milde, M.B.; Corradi, F.; Linares-Barranco, A.; Liu, S.C.; et al. NullHop: A Flexible Convolutional Neural Network Accelerator Based on Sparse Representations of Feature Maps. IEEE Trans. Neural Netw. Learn. Syst.
**2018**, 30, 644–656. [Google Scholar] [CrossRef] [PubMed] [Green Version] - Guo, K.; Sui, L.; Qiu, J.; Yu, J.; Wang, J.; Yao, S.; Han, S.; Wang, Y.; Yang, H. Angel-Eye: A Complete Design Flow for Mapping CNN onto Embedded FPGA. IEEE Trans. CAD Integr. Circuits Syst.
**2018**, 37, 35–47. [Google Scholar] [CrossRef] - Li, G.; Li, F.; Zhao, T.; Cheng, J. Block Convolution: Towards Memory-Efficient Inference of Large-Scale CNNs on FPGA. In Proceedings of the 2018 Design, Automation & Test in Europe Conference & Exhibition, Dresden, Germany, 19–23 March 2018; pp. 1163–1166. [Google Scholar]
- Jin, X.; Xu, C.; Feng, J.; Wei, Y.; Xiong, J.; Yan, S. Deep learning with S-shaped rectified linear activation units. In Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence, Phoenix, AZ, USA, 12–17 February 2016. [Google Scholar]
- Cong, J.; Xiao, B. Minimizing computation in convolutional neural networks. In Proceedings of the 24th International Conference on Artificial Neural Networks, Hamburg, Germany, 15–19 September 2014; pp. 281–290. [Google Scholar]
- Meloni, P.; Capotondi, A.; Deriu, G.; Brian, M.; Conti, F.; Rossi, D.; Raffo, L.; Benini, L. NEURAghe: Exploiting CPU-FPGA Synergies for Efficient and Flexible CNN Inference Acceleration on Zynq SoCs. ACM Trans. Reconfig. Technol. Syst.
**2018**, 11, 18. [Google Scholar] [CrossRef] [Green Version] - Spagnolo, F.; Perri, S.; Frustaci, F.; Corsonello, P. Designing Fast Convolution Engines for Deep Learning Applications. In Proceedings of the 25th IEEE International Conference on Electronics, Circuits and Systems, Bordeaux, France, 9–12 December 2018. [Google Scholar]
- Kouris, A.; Venieris, S.I.; Bouganis, C.S. Cascade^CNN: Pushing the Performance Limits of Quantisation in Convolutional Neural Networks. In Proceedings of the 28th International Conference on Field Programmable Logic and Applications, Dublin, Ireland, 27–31 August 2018. [Google Scholar]
- Lee, S.; Kim, D.; Nguyen, D.; Lee, J. Double MAC on a DSP: Boosting the Performance of Convolutional Neural Networks on FPGAs. IEEE Trans. CAD Integr. Circuits Syst.
**2018**, 38, 888–897. [Google Scholar] [CrossRef] - AMBA AXI4, AXI4-Lite, and AXI4-Stream Protocol Assertion User Guide. Available online: www.infocenter.arm.com (accessed on 14 November 2019).
- Cortex-A9 NEON Media Processing Engine Technical Reference Manual, rev. r3p0. Available online: www.infocenter.arm.com (accessed on 14 November 2019).
- Xilinx Zynq-7000 External Memory Interfaces. Available online: https://www.xilinx.com/products/technology/memory.html#externalMemory (accessed on 14 November 2019).

**Figure 4.**Dispatching two adjacent windows by the proposed buffer architecture: (

**a**) example on a 4 × 8 input; (

**b**) the strategy used to recouple input data; (

**c**) data provided by the registers depending on nr.

**Figure 9.**Analysis of the computational capabilities achievable within: (

**a**) the XC7Z020 device; (

**b**) the XC7Z045 device.

Design (Device) | Freq. [MHz] | Gops | DSPs | Other Resources | DE (Gops/ DSPs) | CONVs Time [ms] | FCs+ Softmax Time [ms] | PEff (Gops/W) | |||
---|---|---|---|---|---|---|---|---|---|---|---|

LUTs | FFs | BRAMs [Mb] | |||||||||

New | ES ^{1}(XC7Z020 [18]) | 150 | 95.5 | 220 (100%) | 13,455 (25.3%) | 19,129 (18%) | 3.44 (70%) | 0.434 | 376.3 | 48 | 38.5 |

[30] | SA ^{2}(XC7Z020 [18]) | 150 | 84.3 | 190 (86.3%) | 29,867 (56%) | 35,489 (33%) | 3 (61%) | 0.443 | 364 | NP ^{4} | 24.1 |

[26] | ES (XC7Z020 [18]) | 125 | 48.53 | 220 (100%) | NA ^{3} | NA | NA | 0.22 | 633 | NP | 27.7 |

New | ES (XC7Z045 [18]) | 167 | 425.32 | 880 (97.8%) | 30,161 (13.8%) | 47,832 (10.9%) | 12.9 (67.5%) | 0.48 | 84.5 | 48 | 135 |

[27] | SA (XC7Z045 [18]) | 150 | 36.8 | 197 (21.8%) | 18,578 (8.5%) | 8049 (1.84%) | 0.773 (4%) | 0.186 | 1639.3 | NP | 79.7 |

[30] | SA (XC7Z045 [18]) | 214 | 137 | 780 (86.6%) | 182,616 (84%) | 127,653 (29%) | 17.08 (87%) | 0.175 | 224.6 | NP | 14.2 |

[26] | ES (XC7Z045 [18]) | 125 | 155.81 | 855 (95%) | NA | NA | NA | 0.182 | 249.5 | NP | 38.8 |

[31] | SA (XC7Z045 [18]) | 150 | 374.98 | 900 (100%) | 113,672 (52%) | 240,640 (55%) | 19.16 (100%) | 0.416 | 82.03 | NP | NA |

[34] | ES (XC7Z045 [18]) | 140 | 169 | 864 (96%) | 88,154 (35.1%) | 61,250 (14.1%) | 11.25 (58.7%) | 0.195 | 181.8 | 72.6 | 16.9 |

[29] | ES (XC7Z100 [18]) | 60 | 17.2 | 128 (6.3%) | 229,000 (83%) | 107,000 (19%) | 13.6 (51.1%) | 0.134 | 2269 * | *included | 27.4 |

[28] | ES (GX1150 [21]) | 200 | 715.9 | 1518 (100%) | 141,312 (32%) | NA | 43.6 (82%) | 0.47 | 43.2 * | *included | NA |

^{1}Embedded system;

^{2}Standalone accelerator;

^{3}Not available;

^{4}Not performed.

© 2019 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

## Share and Cite

**MDPI and ACS Style**

Spagnolo, F.; Perri, S.; Frustaci, F.; Corsonello, P.
Energy-Efficient Architecture for CNNs Inference on Heterogeneous FPGA. *J. Low Power Electron. Appl.* **2020**, *10*, 1.
https://doi.org/10.3390/jlpea10010001

**AMA Style**

Spagnolo F, Perri S, Frustaci F, Corsonello P.
Energy-Efficient Architecture for CNNs Inference on Heterogeneous FPGA. *Journal of Low Power Electronics and Applications*. 2020; 10(1):1.
https://doi.org/10.3390/jlpea10010001

**Chicago/Turabian Style**

Spagnolo, Fanny, Stefania Perri, Fabio Frustaci, and Pasquale Corsonello.
2020. "Energy-Efficient Architecture for CNNs Inference on Heterogeneous FPGA" *Journal of Low Power Electronics and Applications* 10, no. 1: 1.
https://doi.org/10.3390/jlpea10010001