# Early-Stage Neural Network Hardware Performance Analysis

^{1}

^{2}

^{*}

^{†}

## Abstract

**:**

## 1. Introduction

#### 1.1. Contribution

#### 1.2. Related Work

## 2. Method

#### 2.1. The Impact of Quantization on Hardware Implementation

#### 2.2. Data Path

#### 2.3. Communication

#### 2.4. Local Memory

#### 2.5. Roofline Analysis

#### Roofline Analysis Examples

## 3. Results

#### 3.1. Experimental Methodology

#### 3.2. System-Level Design Methodology

#### 3.3. Evaluation of Eyeriss Architecture

## 4. Discussion

#### 4.1. Conclusions

#### 4.2. Future Work

## Author Contributions

## Funding

## Data Availability Statement

## Conflicts of Interest

## Abbreviations

ASIC | Application-Specific Integrated Circuit |

CAD | Computer-Aided Design |

CNN | Convolutional Neural Network |

DDR | Double Data Rate (Memory) |

DL | Deep Learning |

EDA | Electronic Design Automation |

FLOPS | Floating point Operations |

FMA | Fused Multiply-Add |

FPGA | Field Programmable Gate Array |

GOPS | Giga Operations |

HDL | Hardware Description Language |

IC | Integrated Circuit |

IP | Intellectual Property |

MAC | Multiply Accumulate |

NN | Neural Network |

OPS | Operations |

PE | Processing Engine |

RAM | Random Access Memory |

SoC | System on a Chip |

SRAM | Static Random Access Memory |

TOPS | Tera Operations |

TSMC | Taiwan Semiconductor Manufacturing Company |

VLSI | Very Large-Scale Integration |

## References

- Qi, W.; Su, H.; Aliverti, A. A Smartphone-Based Adaptive Recognition and Real-Time Monitoring System for Human Activities. IEEE Trans. Hum.-Mach. Syst.
**2020**, 50, 414–423. [Google Scholar] [CrossRef] - Su, H.; Hu, Y.; Karimi, H.R.; Knoll, A.C.; Ferrigno, G.; Momi, E.D. Improved recurrent neural network-based manipulator control with remote center of motion constraints: Experimental results. Neural Netw.
**2020**, 131, 291–299. [Google Scholar] [CrossRef] [PubMed] - Su, H.; Qi, W.; Yang, C.; Sandoval, J.; Ferrigno, G.; Momi, E.D. Deep Neural Network Approach in Robot Tool Dynamics Identification for Bilateral Teleoperation. IEEE Robot. Autom. Lett.
**2020**, 5, 2943–2949. [Google Scholar] [CrossRef] - Su, H.; Qi, W.; Hu, Y.; Karimi, H.R.; Ferrigno, G.; De Momi, E. An Incremental Learning Framework for Human-like Redundancy Optimization of Anthropomorphic Manipulators. IEEE Trans. Ind. Inform.
**2020**. [Google Scholar] [CrossRef] - Howard, A.; Sandler, M.; Chu, G.; Chen, L.C.; Chen, B.; Tan, M.; Wang, W.; Zhu, Y.; Pang, R.; Vasudevan, V.; et al. Searching for MobileNetV3. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Seoul, Korea, 27 October–2 November 2019. [Google Scholar] [CrossRef]
- Wu, B.; Dai, X.; Zhang, P.; Wang, Y.; Sun, F.; Wu, Y.; Tian, Y.; Vajda, P.; Jia, Y.; Keutzer, K. FBNet: Hardware-Aware Efficient ConvNet Design via Differentiable Neural Architecture Search. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019; pp. 10726–10734. [Google Scholar]
- Sandler, M.; Howard, A.G.; Zhu, M.; Zhmoginov, A.; Chen, L.C. MobileNetV2: Inverted Residuals and Linear Bottlenecks. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 4510–4520. [Google Scholar]
- Ridnik, T.; Lawen, H.; Noy, A.; Friedman, I. TResNet: High Performance GPU-Dedicated Architecture. arXiv
**2020**, arXiv:2003.13630. [Google Scholar] - Gysel, P.; Motamedi, M.; Ghiasi, S. Hardware-oriented approximation of convolutional neural networks. arXiv
**2016**, arXiv:1604.03168. [Google Scholar] - Yang, J.; Shen, X.; Xing, J.; Tian, X.; Li, H.; Deng, B.; Huang, J.; Hua, X.S. Quantization Networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019. [Google Scholar]
- Jin, Q.; Yang, L.; Liao, Z. Towards Efficient Training for Neural Network Quantization. arXiv
**2019**, arXiv:1912.10207. [Google Scholar] - Esser, S.K.; McKinstry, J.L.; Bablani, D.; Appuswamy, R.; Modha, D.S. Learned step size quantization. In Proceedings of the International Conference on Learning Representations, Addis Ababa, Ethiopia, 26 April–1 May 2020. [Google Scholar]
- Zhao, X.; Wang, Y.; Cai, X.; Liu, C.; Zhang, L. Linear Symmetric Quantization of Neural Networks for Low-precision Integer Hardware. In Proceedings of the International Conference on Learning Representations, Addis Ababa, Ethiopia, 26 April–1 May 2020. [Google Scholar]
- Jouppi, N.P.; Young, C.; Patil, N.; Patterson, D.A.; Agrawal, G.; Bajwa, R.; Bates, S.; Bhatia, S.; Boden, N.; Borchers, A.; et al. In-datacenter performance analysis of a tensor processing unit. In Proceedings of the ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA), Toronto, ON, Canada, 24–28 June 2017; pp. 1–12. [Google Scholar]
- Raihan, M.A.; Goli, N.; Aamodt, T.M. Modeling deep learning accelerator enabled GPUs. In Proceedings of the IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS), Madison, WI, USA, 24–26 March 2019; pp. 79–92. [Google Scholar]
- Chen, Y.H.; Yang, T.J.; Emer, J.; Sze, V. Eyeriss v2: A flexible accelerator for emerging deep neural networks on mobile devices. IEEE J. Emerg. Sel. Top. Circuits Syst.
**2019**, 9, 292–308. [Google Scholar] [CrossRef] [Green Version] - Jiao, Y.; Han, L.; Jin, R.; Su, Y.J.; Ho, C.; Yin, L.; Li, Y.; Chen, L.; Chen, Z.; Liu, L.; et al. A 12 nm Programmable Convolution-Efficient Neural-Processing-Unit Chip Achieving 825TOPS. In Proceedings of the IEEE International Solid-State Circuits Conference-(ISSCC), San Francisco, CA, USA, 16–20 February 2020; pp. 136–140. [Google Scholar]
- Abts, D.; Ross, J.; Sparling, J.; Wong-VanHaren, M.; Baker, M.; Hawkins, T.; Bell, A.; Thompson, J.; Kahsai, T.; Kimmell, G.; et al. Think Fast: A Tensor Streaming Processor (TSP) for Accelerating Deep Learning Workloads. In Proceedings of the ACM/IEEE 47th Annual International Symposium on Computer Architecture (ISCA), Valencia, Spain, 20 May–3 June 2020; pp. 145–158. [Google Scholar]
- Jouppi, N.P.; Yoon, D.H.; Kurian, G.; Li, S.; Patil, N.; Laudon, J.; Young, C.; Patterson, D. A domain-specific supercomputer for training deep neural networks. Commun. ACM
**2020**, 63, 67–78. [Google Scholar] [CrossRef] - Chen, Y.H.; Krishna, T.; Emer, J.S.; Sze, V. Eyeriss: An Energy-Efficient Reconfigurable Accelerator for Deep Convolutional Neural Networks. IEEE J. Solid-State Circuits
**2017**, 52, 127–138. [Google Scholar] [CrossRef] [Green Version] - Han, S.; Liu, X.; Mao, H.; Pu, J.; Pedram, A.; Horowitz, M.A.; Dally, W.J. EIE: Efficient inference engine on compressed deep neural network. In Proceedings of the ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA), Seoul, Korea, 18–22 June 2016; pp. 243–254. [Google Scholar]
- Rivas-Gomez, S.; Pena, A.J.; Moloney, D.; Laure, E.; Markidis, S. Exploring the Vision Processing Unit as Co-Processor for Inference. In Proceedings of the IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW), Vancouver, BC, Canada, 21–25 May 2018. [Google Scholar] [CrossRef]
- Reddi, V.J.; Cheng, C.; Kanter, D.; Mattson, P.; Schmuelling, G.; Wu, C.J.; Anderson, B.; Breughe, M.; Charlebois, M.; Chou, W.; et al. MLPerf Inference Benchmark. arXiv
**2019**, arXiv:1911.02549. [Google Scholar] - Baskin, C.; Liss, N.; Zheltonozhskii, E.; Bronstein, A.M.; Mendelson, A. Streaming architecture for large-scale quantized neural networks on an FPGA-based dataflow platform. In Proceedings of the IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW), Vancouver, BC, Canada, 21–25 May 2018; pp. 162–169. [Google Scholar]
- Ankit, A.; Hajj, I.E.; Chalamalasetti, S.R.; Ndu, G.; Foltin, M.; Williams, R.S.; Faraboschi, P.; Hwu, W.W.; Strachan, J.P.; Roy, K.; et al. PUMA: A Programmable Ultra-Efficient Memristor-Based Accelerator for Machine Learning Inference. In Proceedings of the Twenty-Fourth International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS’19, Providence, RI, USA, 13–17 April 2019; Association for Computing Machinery: New York, NY, USA, 2019; pp. 715–731. [Google Scholar] [CrossRef]
- Umuroglu, Y.; Fraser, N.J.; Gambardella, G.; Blott, M.; Leong, P.; Jahre, M.; Vissers, K. FINN: A Framework for Fast, Scalable Binarized Neural Network Inference. In Proceedings of the ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, FPGA’17, Monterey, CA, USA, 22–24 February 2017; Association for Computing Machinery: New York, NY, USA, 2017; pp. 65–74. [Google Scholar] [CrossRef] [Green Version]
- Sze, V.; Chen, Y.H.; Yang, T.J.; Emer, J.S. How to Evaluate Deep Neural Network Processors: TOPS/W (Alone) Considered Harmful. IEEE Solid-State Circuits Mag.
**2020**, 12, 28–41. [Google Scholar] [CrossRef] - Lee, J.; Won, T.; Lee, T.K.; Lee, H.; Gu, G.; Hong, K. Compounding the Performance Improvements of Assembled Techniques in a Convolutional Neural Network. arXiv
**2020**, arXiv:2001.06268. [Google Scholar] - Baskin, C.; Schwartz, E.; Zheltonozhskii, E.; Liss, N.; Giryes, R.; Bronstein, A.M.; Mendelson, A. UNIQ: Uniform Noise Injection for Non-Uniform Quantization of Neural Networks. arXiv
**2018**, arXiv:1804.10969. [Google Scholar] - Williams, S.; Waterman, A.; Patterson, D. Roofline: An insightful visual performance model for multicore architectures. Commun. ACM
**2009**, 52, 65–76. [Google Scholar] [CrossRef] - McMahon, F.H. The Livermore Fortran Kernels: A Computer Test of the Numerical Performance Range; Technical Report; Lawrence Livermore National Lab.: Livermore, CA, USA, 1986. [Google Scholar]
- Wang, L.; Zhan, J.; Gao, W.; Jiang, Z.; Ren, R.; He, X.; Luo, C.; Lu, G.; Li, J. BOPS, Not FLOPS! A New Metric and Roofline Performance Model For Datacenter Computing. arXiv
**2018**, arXiv:1801.09212. [Google Scholar] - Parashar, A.; Raina, P.; Shao, Y.S.; Chen, Y.H.; Ying, V.A.; Mukkara, A.; Venkatesan, R.; Khailany, B.; Keckler, S.W.; Emer, J. Timeloop: A systematic approach to dnn accelerator evaluation. In Proceedings of the IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS), Madison, WI, USA, 24–26 March 2019; pp. 304–315. [Google Scholar]
- Wu, Y.N.; Sze, V. Accelergy: An architecture-level energy estimation methodology for accelerator designs. In Proceedings of the IEEE/ACM International Conference on Computer-Aided Design (ICCAD), Westminster, CO, USA, 4–7 November 2019. [Google Scholar]
- Mishra, A.; Nurvitadhi, E.; Cook, J.J.; Marr, D. WRPN: Wide Reduced-Precision Networks. In Proceedings of the International Conference on Learning Representations, Vancouver, BC, Canada, 30 April–3 May 2018. [Google Scholar]
- Jiang, Z.; Li, J.; Zhan, J. The Pitfall of Evaluating Performance on Emerging AI Accelerators. arXiv
**2019**, arXiv:1911.02987. [Google Scholar] - Shafiee, A.; Nag, A.; Muralimanohar, N.; Balasubramonian, R.; Strachan, J.P.; Hu, M.; Williams, R.S.; Srikumar, V. ISAAC: A convolutional neural network accelerator with in-situ analog arithmetic in crossbars. In Proceedings of the 43rd International Symposium on Computer Architecture, ISCA’16, Seoul, Korea, 18–22 June 2016; pp. 14–26. [Google Scholar] [CrossRef]
- Morcel, R.; Hajj, H.; Saghir, M.A.R.; Akkary, H.; Artail, H.; Khanna, R.; Keshavamurthy, A. FeatherNet: An Accelerated Convolutional Neural Network Design for Resource-constrained FPGAs. ACM Trans. Reconfigurable Technol. Syst. (TRETS)
**2019**, 12, 6:1–6:27. [Google Scholar] [CrossRef] - Wang, E.; Davis, J.J.; Cheung, P.Y.; Constantinides, G.A. LUTNet: Rethinking Inference in FPGA Soft Logic. arXiv
**2019**, arXiv:1904.00938. [Google Scholar] - Baskin, C.; Chmiel, B.; Zheltonozhskii, E.; Banner, R.; Bronstein, A.M.; Mendelson, A. CAT: Compression-Aware Training for bandwidth reduction. arXiv
**2019**, arXiv:1909.11481. [Google Scholar] - Chmiel, B.; Baskin, C.; Banner, R.; Zheltonozhskii, E.; Yermolin, Y.; Karbachevsky, A.; Bronstein, A.M.; Mendelson, A. Feature Map Transform Coding for Energy-Efficient CNN Inference. arXiv
**2019**, arXiv:1905.10830. [Google Scholar] - Russakovsky, O.; Deng, J.; Su, H.; Krause, J.; Satheesh, S.; Ma, S.; Huang, Z.; Karpathy, A.; Khosla, A.; Bernstein, M.; et al. ImageNet Large Scale Visual Recognition Challenge. Int. J. Comput. Vis. (IJCV)
**2015**, 115, 211–252. [Google Scholar] [CrossRef] [Green Version] - He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NY, USA, 26 June–1 July 2016; pp. 770–778. [Google Scholar]
- He, T.; Zhang, Z.; Zhang, H.; Zhang, Z.; Xie, J.; Li, M. Bag of Tricks for Image Classification with Convolutional Neural Networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019. [Google Scholar]
- Gong, R.; Liu, X.; Jiang, S.; Li, T.; Hu, P.; Lin, J.; Yu, F.; Yan, J. Differentiable Soft Quantization: Bridging Full-Precision and Low-Bit Neural Networks. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Seoul, Korea, 27 October–2 November 2019. [Google Scholar]
- Zagoruyko, S.; Komodakis, N. Wide residual networks. arXiv
**2016**, arXiv:1605.07146. [Google Scholar] - Cavigelli, L.; Rutishauser, G.; Benini, L. EBPC: Extended Bit-Plane Compression for Deep Neural Network Inference and Training Accelerators. IEEE J. Emerg. Sel. Top. Circuits Syst.
**2019**, 9, 723–734. [Google Scholar] [CrossRef]

**Figure 1.**Area vs. bitwidth for a $3\times 3$ PE with a single input and output channel. All of the weights and activations use the same bitwidth and the accumulator width is four bits larger, which is enough to store the result. The quadratic fit is $A=12.39{b}^{2}+86.07b-14.02$ with a goodness of fit ${R}^{2}=0.9999877$, where A is the area and b is the bitwidth of the PE.

**Figure 2.**Our $3\times 3$ kernel 8-bit processing engine (PE) layout using TSMC 28 nm technology. The carry-save adder can fit 12-bit numbers, which is large enough to store the output of the convolution.

**Figure 3.**Area vs. BOPS for a $3\times 3$ PE with a single input and output channel and variable bitwidth. The linear fit is $A=1.694B+153.46$ with a goodness of fit ${R}^{2}=0.998$, where A is the area and B is BOPS.

**Figure 4.**Area vs. BOPS for a $3\times 3$ PE with variable input (n) and output (m) feature dimensions, and variable bitwidth. Weights and activations use the same bitwidth and the accumulator width is set to ${log}_{2}\left(9m\right)\xb7{b}_{w}\xb7{b}_{a}$.

**Figure 6.**SRAM area as a function of memory bits. The data was taken from Synopsys 28 nm Educational Design Kit SRAM specifications. (

**a**) Single-port RAM area (A) vs. amount of data bits (B). The linear fit is $A=2.94B+3065$ with a goodness of fit ${R}^{2}=0.986$. (

**b**) Dual-port RAM area (A) vs. amount of data bits (B). The linear fit is $A=4.16B+3535$ with a goodness of fit ${R}^{2}=0.916$.

**Figure 7.**Roofline example. In the case of $\mathrm{App}1$, memory bandwidth prevents the program from achieving its expected performance. In the case of $\mathrm{App}2$, the same happens due to limited computational resources. Finally, $\mathrm{App}3$ represents a program that could achieve its maximum performance on a given system.

**Figure 8.**OPS roofline: $3\times 3$ kernel, input and output have 256 features of $14\times 14$ pixels, 1 mm${}^{2}$ accelerator with an 800-MHz frequency, and a DDR of $2.4$ GHz with 64-bit data bus.

**Figure 9.**OPS roofline: $3\times 3$ kernel, input and output have 64 features of $56\times 56$ pixels, 6 mm${}^{2}$ accelerator with with an 100-MHz frequency, and a DDR of $2.4$ GHz with 64-bit data bus.

**Figure 12.**Area (A) vs. BOPS (B) for a systolic array of $3\times 3$ PEs with variable input (n) and output (m) feature dimensions, and variable bitwidth. Weights and activations use the same bitwidth and the accumulator width is set to ${log}_{2}\left(9m\right)\xb7{b}_{w}\xb7{b}_{a}$.

**Figure 13.**ResNet-18 roofline analysis for all layers. Red dots are the performance required by the layer, and green dots are the equivalent performance using partial-sum computation. The blue curves connect points corresponding to the same layer and they are only displayed for convenience.

**Figure 14.**VGG-16 on Eyeriss [20] hardware. Red dots are the performance required by the layer, and green dots are the equivalent performance using partial-sum computation. The blue curves connect points corresponding to the same layer and they are only displayed for convenience.

**Table 1.**Key characteristics of 32-bit floating-point and 32-bit fixed-point multiplier designs. The fixed-point multiplier uses approximately eight times less area, gates, and power than the floating-point one.

Multiplier | Gates | Cells | Area $\left[\mathsf{\mu}{\mathbf{m}}^{2}\right]$ | Power $\left[\mathbf{m}\mathbf{W}\right]$ | |||
---|---|---|---|---|---|---|---|

Internal | Switching | Leakage | Dynamic | ||||

Floating-Point | 40,090 | 17,175 | 11,786 | 2.76 | 1.31 | 0.43 | 10.53 |

Fixed-Point | 5065 | 1726 | 1489 | 0.49 | 0.32 | 0.04 | 1.053 |

**Table 2.**Number of PEs with different bitwidths on 1 ${\mathrm{mm}}^{2}$ of silicon. Each PE can perform $3\times 3$ kernel multiplications.

32-Bit | 32-Bit | 16-Bit | 8-Bit | |
---|---|---|---|---|

Float | Fixed | Quant. | Quant. | |

PEs | 9 | 60 | 220 | 683 |

**Table 3.**The amount of computation (OPS/s) provided by the accelerators and memory throughput (OPS/bit) required by the 11th layer of ResNet-18.

32-Bit | 32-Bit | 16-Bit | 8-Bit | |
---|---|---|---|---|

Float | Fixed | Quant. | Quant. | |

GOPS/s | $72.00$ | $392.0$ | 1568 | 5408 |

OPS/bit | $5.82$ | $5.82$ | $11.63$ | $23.26$ |

**Table 4.**The amount of computation (OPS/s) provided by the accelerators and memory throughput (OPS/bit) required by the second layer of ResNet-18.

32-Bit | 32-Bit | 16-Bit | 8-Bit | 4-Bit | |
---|---|---|---|---|---|

Float | Fixed | Quant. | Quant. | Quant. | |

GOPS/s | $49.00$ | $324.0$ | 1296 | 3969 | $11,236$ |

OPS/bit | $9.16$ | $9.16$ | $18.32$ | $36.64$ | $73.27$ |

Language | Verilog HDL |

Logic Simulation | ModelSim 19.1 |

Synthesis | Synopsys Design Compiler 2017.09-SP3 |

Place and route | Cadence Innovus 2019.11 |

**Table 6.**Achievable performance of VGG-16 on Eyeriss hardware as seen from the roofline analysis. The latency is the amount of time the execution units need to calculate the data in that layer.

Layer | Latency | Latency from Roofline |
---|---|---|

[ms] | [ms] | |

conv1-1 | 7.7 | 158.9 (+1963.6%) |

conv1-2 | 165.2 | 191.4 (+15.9%) |

conv2-1 | 82.6 | 117.3 (+42%) |

conv2-2 | 165.2 | 165.2 |

conv3-1 | 82.6 | 82.6 |

conv3-2 | 165.2 | 165.2 |

conv3-3 | 165.2 | 165.2 |

conv4-1 | 82.6 | 84.2 |

conv4-2 | 165.2 | 165.2 |

conv4-3 | 165.2 | 165.2 |

conv5-1 | 41.3 | 120.9 (+192.7%) |

conv5-2 | 41.3 | 120.9 (+192.7%) |

conv5-3 | 41.3 | 120.9 (+192.7%) |

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |

© 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

## Share and Cite

**MDPI and ACS Style**

Karbachevsky, A.; Baskin, C.; Zheltonozhskii, E.; Yermolin, Y.; Gabbay, F.; Bronstein, A.M.; Mendelson, A.
Early-Stage Neural Network Hardware Performance Analysis. *Sustainability* **2021**, *13*, 717.
https://doi.org/10.3390/su13020717

**AMA Style**

Karbachevsky A, Baskin C, Zheltonozhskii E, Yermolin Y, Gabbay F, Bronstein AM, Mendelson A.
Early-Stage Neural Network Hardware Performance Analysis. *Sustainability*. 2021; 13(2):717.
https://doi.org/10.3390/su13020717

**Chicago/Turabian Style**

Karbachevsky, Alex, Chaim Baskin, Evgenii Zheltonozhskii, Yevgeny Yermolin, Freddy Gabbay, Alex M. Bronstein, and Avi Mendelson.
2021. "Early-Stage Neural Network Hardware Performance Analysis" *Sustainability* 13, no. 2: 717.
https://doi.org/10.3390/su13020717