# Optimized Compression for Implementing Convolutional Neural Networks on FPGA

^{1}

^{2}

^{*}

## Abstract

**:**

## 1. Introduction

## 2. Motivation for Compressing CNNs

## 3. Model Compression

#### 3.1. Reversed-Pruning and Peak-Pruning

#### 3.2. Data Quantization

#### 3.3. Efficient Storage

## 4. Hardware Implementation

#### 4.1. Overall Architecture

#### 4.2. Hardware-PE Architecture

- (1)
- The data transmission form was optimized. In previous research, the transmission of the feature maps was performed by regarding each graph as a unit [26]. As shown in Figure 10, it was assumed that the input of a layer of the network was three feature maps, which was orderly stored in memory. The traditional form of data transmission, which reads out the first image line by line and then the second and third images. However, the output of the calculation results starts when all the data have been input to the calculation module, which leads to a large data output delay and a waste of computing resources. Therefore, we intended to optimize the transmission of map data. We performed the transmission in the unit of pixels, which transmits the first pixel of each map at first, and then the second pixel of each map. This optimization allows the module to calculate and output some of the results during the data transmission. Consequently, the output delay decreased and the waste of computing resources significantly reduced.
- (2)
- Non-zero detection circuits were used to hierarchically broadcast the non-zero input data to each PE according to the advantage of the input activation sparsity, as shown in Figure 11. Multiplication occurred only when the input activation was a non-zero value, which greatly reduced computing resources.

- (1)
- Decoding circuits were customized to recover the original sparse weight matrix according to the CCOO format of sparse non-zero weights and its indices. Compared with Han’s method, our decoding circuits did not require complex logic designs and extra calculations benefitting from the efficient storage approach we proposed in model compression.
- (2)
- The convolver accomplished a window convolution operation, which was essentially multiplication. As illustrated in Figure 12, the kernel size was 2 × 2, and the traditional window convolution was slided by row. When we placed pooling layers between the convolutional layers and ReLU, the window convolution was slided according to the size of window pooling, reducing the clock cycles and memory occupation of temporarily unused results. Another advantage was that this pipelining decreases the data cache by a factor of 4 after convolution, without affecting the final result.

- (1)
- The adder sums all results from the convolver and bias from the input buffer or intermediate data from the output buffer if needed.
- (2)
- Max-pooling applies a 2 × 2 window to the input feature map, and outputs the maximum among them.
- (3)
- ReLU is a non-linear operator especially suitable for hardware implementation.

## 5. Performance Analysis

## 6. Conclusions

## Author Contributions

## Funding

## Conflicts of Interest

## References

- LeCun, Y.; Bengio, Y.; Hinton, G. Deep learning. Nature
**2015**, 521, 436. [Google Scholar] [CrossRef] [PubMed] - Du, C.; Gao, S. Image segmentation-based multi-focus image fusion through multi-scale convolutional neural network. IEEE Access
**2017**, 5, 15750–15761. [Google Scholar] [CrossRef] - Liu, Y.; Wu, Q.; Tang, L.; Shi, H. Gaze-assisted multi-stream deep neural network for action recognition. IEEE Access
**2017**, 5, 19432–19441. [Google Scholar] [CrossRef] - Erhan, D.; Szegedy, C.; Toshev, A.; Anguelov, D. Scalable object detection using deep neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 21–23 June 1994; pp. 2147–2154. [Google Scholar]
- Krizhevsky, A.; Sutskever, I.; Hinton, G.E. Imagenet classification with deep convolutional neural networks. In Proceedings of the Advances in Neural Information Processing Systems, Lake Tahoe, NV, USA, 3–6 December 2012; pp. 1097–1105. [Google Scholar]
- He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
- Jiang, W.; Chen, Y.; Jin, H.; Zheng, R.; Chi, Y. A novel GPU-based efficient approach for convolutional neural networks with small filters. J. Signal Process. Syst.
**2017**, 86, 313–325. [Google Scholar] [CrossRef] - Chen, T.; Du, Z.; Sun, N.; Wang, J.; Wu, C.; Chen, Y.; Temam, O. Diannao: A small-footprint high-throughput accelerator for ubiquitous machine-learning. ACM Sigplan Not.
**2014**, 49, 269–284. [Google Scholar] - Chen, Y.; Luo, T.; Liu, S.; Zhang, S.; He, L.; Wang, J.; Li, L.; Chen, T.; Xu, Z.; Sun, N. Dadiannao: A machine-learning supercomputer. In Proceedings of the 47th Annual IEEE/ACM International Symposium on Microarchitecture, Cambridge, UK, 13–17 December 2014; pp. 609–622. [Google Scholar]
- Liu, D.; Chen, T.; Liu, S.; Zhou, J.; Zhou, S.; Teman, O.; Feng, X.; Zhou, X.; Chen, Y. Pudiannao: A polyvalent machine learning accelerator. In Proceedings of the ACM SIGARCH Computer Architecture News, New York, NY, USA, 14–18 March 2015; pp. 369–381. [Google Scholar]
- Farabet, C.; Poulet, C.; Han, J.Y.; LeCun, Y. Cnp: An fpga-based processor for convolutional networks. In Proceedings of the FPL 2009 International Conference on Field Programmable Logic and Applications, Prague, Czech Republic, 31 August–2 September 2009; pp. 32–37. [Google Scholar]
- Farabet, C.; Martini, B.; Corda, B.; Akselrod, P.; Culurciello, E.; LeCun, Y. Neuflow: A runtime reconfigurable dataflow processor for vision. In Proceedings of the 2011 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), Colorado Springs, CO, USA, 20–25 June 2011; pp. 109–116. [Google Scholar]
- Meng, F.; Wang, X.; Shao, F.; Wang, D.; Hua, X. Energy-Efficient Gabor Kernels in Neural Networks with Genetic Algorithm Training Method. Electronics
**2019**, 8, 105. [Google Scholar] [CrossRef] - Liu, Z.; Chow, P.; Xu, J.; Jiang, J.; Dou, Y.; Zhou, J. A Uniform Architecture Design for Accelerating 2D and 3D CNNs on FPGAs. Electronics
**2019**, 8, 65. [Google Scholar] [CrossRef] - Han, S.; Pool, J.; Tran, J.; Dally, W. Learning both weights and connections for efficient neural network. In Proceedings of the Advances in Neural Information Processing Systems, Montreal, Canada, 7–12 December 2015; pp. 1135–1143. [Google Scholar]
- Han, S.; Mao, H.; Dally, W.J. Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding. arXiv, 2015; arXiv:1510.00149. [Google Scholar]
- Han, S.; Kang, J.; Mao, H.; Hu, Y.; Li, X.; Li, Y.; Xie, D.; Luo, H.; Yao, S.; Wang, Y. Ese: Efficient speech recognition engine with sparse lstm on fpga. In Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, Monterey, CA, USA, 22–24 February 2017; pp. 75–84. [Google Scholar]
- Han, S.; Liu, X.; Mao, H.; Pu, J.; Pedram, A.; Horowitz, M.A.; Dally, W.J. EIE: efficient inference engine on compressed deep neural network. In Proceedings of the 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA), Seoul, South Korea, 18–22 June 2016; pp. 243–254. [Google Scholar]
- Zeiler, M.D.; Fergus, R. Visualizing and understanding convolutional networks. In Proceedings of the European Conference on Computer Vision, Zurich, Switzerland, 6–12 September 2014; pp. 818–833. [Google Scholar]
- Simonyan, K.; Zisserman, A. Very deep convolutional networks for large-scale image recognition. arXiv, 2014; arXiv:1409.1556. [Google Scholar]
- LeCun, Y.; Denker, J.S.; Solla, S.A. Optimal brain damage. In the Advances in Neural Information Processing Systems; Morgan-Kaufmann: San Francisco, CA, USA, 1990; pp. 598–605. [Google Scholar]
- Hassibi, B.; Stork, D.G. Second order derivatives for network pruning: Optimal brain surgeon. In Advances in Neural Information Processing Systems; Morgan-Kaufmann: San Francisco, CA, USA, 1993; pp. 164–171. [Google Scholar]
- Gysel, P.; Motamedi, M.; Ghiasi, S. Hardware-oriented approximation of convolutional neural networks. arXiv, 2016; arXiv:1604.03168. [Google Scholar]
- Hubara, I.; Courbariaux, M.; Soudry, D.; El-Yaniv, R.; Bengio, Y. Quantized neural networks: Training neural networks with low precision weights and activations. J. Mach. Learn. Res.
**2017**, 18, 6869–6898. [Google Scholar] - LeCun, Y.; Bottou, L.; Bengio, Y.; Haffner, P. Gradient-based learning applied to document recognition. Proc. IEEE
**1998**, 86, 2278–2324. [Google Scholar] [CrossRef] [Green Version] - Farabet, C.; Martini, B.; Akselrod, P.; Talay, S.; LeCun, Y.; Culurciello, E. Hardware accelerated convolutional neural networks for synthetic vision systems. ISCAS
**2010**, 2010, 257–260. [Google Scholar] - Gokhale, V.; Jin, J.; Dundar, A.; Martini, B.; Culurciello, E. A 240 g-ops/s mobile coprocessor for deep neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, Columbus, OH, USA, 23–28 June 2014; pp. 682–687. [Google Scholar]
- Zhang, C.; Li, P.; Sun, G.; Guan, Y.; Xiao, B.; Cong, J. Optimizing fpga-based accelerator design for deep convolutional neural networks. In Proceedings of the 2015 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, Monterey, CA, USA, 22–24 February 2015; pp. 161–170. [Google Scholar]
- Qiu, J.; Wang, J.; Yao, S.; Guo, K.; Li, B.; Zhou, E.; Yu, J.; Tang, T.; Xu, N.; Song, S. Going deeper with embedded fpga platform for convolutional neural network. In Proceedings of the 2016 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, Monterey, CA, USA, 21–23 February 2016; pp. 26–35. [Google Scholar]

**Figure 1.**Network architecture of AlexNet. CONV: convolutional layer; FC: fully connected layer; POOL: pooling layer.

**Figure 5.**The sparsities of Weight (W) and input activation (IA) and reduction of multiplication and accumulation (MAC).

**Figure 8.**The overall architecture. FPGA: field programmable gate array. DDR is the abbreviations of double data rate synchronous dynamic random-access memory; PS is the abbreviations of processing system; PL is the abbreviations of programmable logic; PE is the abbreviations of processing element; DMA is the abbreviations of direct memory access.

Order | AlexNet | Reversed-Pruning | Peak-Pruning |
---|---|---|---|

1st | CONV1 | FC3 | FC1 |

2nd | CONV2 | FC2 | FC2 |

3rd | CONV3 | FC1 | FC3 |

4th | CONV4 | CONV5 | CONV3 |

5th | CONV5 | CONV4 | CONV4 |

6th | FC1 | CONV3 | CONV5 |

7th | FC2 | CONV2 | CONV2 |

8th | FC3 | CONV1 | CONV1 |

Layer | Weight | Han’s Pruning Sparsity | Reversed-Pruning Sparsity | Reversed Accuracy | Peak-Pruning Sparsity | Peak Accuracy |
---|---|---|---|---|---|---|

CONV1 | 35 K | 0.84 | 0.65 | 57.14% | 0.65 | 57.18% |

CONV2 | 307 K | 0.38 | 0.28 | 57.16% | 0.32 | 57.18% |

CONV3 | 885 K | 0.35 | 0.28 | 57.08% | 0.30 | 57.09% |

CONV4 | 663 K | 0.37 | 0.28 | 57.11% | 0.31 | 57.08% |

CONV5 | 442 K | 0.37 | 0.23 | 57.13% | 0.41 | 57.15% |

FC1 | 38 M | 0.09 | 0.08 | 57.16% | 0.05 | 57.67% |

FC2 | 17 M | 0.09 | 0.05 | 57.63% | 0.06 | 57.23% |

FC3 | 4 M | 0.25 | 0.08 | 57.99% | 0.27 | 57.07% |

Total | 61 M (57.13%) | 6.81 M (9×) | 4.85 M (13×) | 57.14% | 4.77 M (13×) | 57.18% |

Models | Original (FP32) | Pruned (FP32) | INT16 Quantization | INT8 Quantization |
---|---|---|---|---|

LeNet-5 | 99.06% | 99.10% | 98.83% | 98.31% |

AlexNet | 57.13% | 57.14% | 57.05% | 55.99% |

**Table 4.**Comparison of different sparse matrix storage formats. COO: coordinate; CSR: compressed sparse row; CSC: compressed sparse column; CCOO: compressed coordinate.

Storage Formats | Arrays | Numbers |
---|---|---|

COO | 3: Non-zero value; Rows indices; Columns indices | 3a |

CSR | 3: Non-zero value; Row indices; Column offsets | 2a + n + 1 |

CSC | 3: Non-zero value; Row offsets; Column indices | 2a + n + 1 |

CCOO (ours) | 2: Non-zero value; Row indices + Columns indices | 2a |

Original AlexNet | Han’s Pruning + Quantization [16] | Reversed-Pruning + Quantization | Peak-Pruning + Quantization | |
---|---|---|---|---|

Convolutional Layers | 8 MB | - | 1.2 MB | 1.5 MB |

Fully Connected Layers | 236 MB | - | 7.8 MB | 7.2 MB |

Total | 244 MB | 9.0 MB | 9.0 MB | 8.7 MB |

Compressibility | 1× | 27× | 27× | 28× |

Resource | Utilization | Available | Utilization % |
---|---|---|---|

LUT | 101,953 | 230,400 | 44.25 |

LUTRAM | 4790 | 101,760 | 4.71 |

FF | 127,577 | 460,800 | 27.69 |

BRAM | 198.50 | 312 | 63.62 |

URAM | 80 | 96 | 83.33 |

DSP | 696 | 1728 | 40.28 |

BUFG | 12 | 544 | 2.21 |

**Table 7.**Evaluation results on the central processing unit (CPU), graphics processing unit (GPU), and our accelerator.

Platform | CPU | GPU | FPGA |
---|---|---|---|

Vendor | Intel i7-6700 | NVIDIA GTX 1080 Ti | Xilinx ZCU104 |

Technology | 14 nm | 16 nm | 16 nm |

Power (W) | 65 | 250 | 17.67 |

Latency (ms) | 834.69(CONV) 926.26(Overall) | 5.11(CONV) 6.15(Overall) | 4.58(CONV)102.76(Overall) |

Speedup | 1.0(CONV) 1.0(Overall) | 163.3(CONV) 150.6(Overall) | 182.3(CONV)9.1(Overall) |

Throughput (GOP/s) | 1.59(CONV) 1.56(Overall) | 260.27(CONV) 235.77(Overall) | 290.40(CONV)14.11(Overall) |

Energy efficiency (GOP/s/W) | 0.02(CONV) 0.02(Overall) | 1.04(CONV) 0.94(Overall) | 16.44(CONV)0.80(Overall) |

Ratio | 1.0(CONV) 1.0(Overall) | 52.0(CONV) 47.0(Overall) | 822.0(CONV)40.0(Overall) |

CVPRW2014 [27] | FPGA2015 [28] | FPGA2016 [29] | Ours | |
---|---|---|---|---|

Platform | Zynq XC7Z045 | Virtex7 VX485T | Zynq XC7Z045 | Zynq XCZU7EV |

Frequency (MHz) | 150 | 100 | 150 | 300 |

Quantization Strategy | 16-bit fixed | 32-bit float | 16-bit fixed | 8-bit int |

Throughput (GOP/s) | 23.18 | 61.62 | 187.80(CONV) 136.97(Overall) | 290.40(CONV)14.11(Overall) |

Power (W) | 8 | 18.61 | 9.63 | 17.67 |

Energy efficiency (GOP/s/W) | 2.90 | 3.31 | 19.50(CONV) 14.22(Overall) | 16.44(CONV) 0.80(Overall) |

© 2019 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

## Share and Cite

**MDPI and ACS Style**

Zhang, M.; Li, L.; Wang, H.; Liu, Y.; Qin, H.; Zhao, W.
Optimized Compression for Implementing Convolutional Neural Networks on FPGA. *Electronics* **2019**, *8*, 295.
https://doi.org/10.3390/electronics8030295

**AMA Style**

Zhang M, Li L, Wang H, Liu Y, Qin H, Zhao W.
Optimized Compression for Implementing Convolutional Neural Networks on FPGA. *Electronics*. 2019; 8(3):295.
https://doi.org/10.3390/electronics8030295

**Chicago/Turabian Style**

Zhang, Min, Linpeng Li, Hai Wang, Yan Liu, Hongbo Qin, and Wei Zhao.
2019. "Optimized Compression for Implementing Convolutional Neural Networks on FPGA" *Electronics* 8, no. 3: 295.
https://doi.org/10.3390/electronics8030295