#
DSCU: Accelerating CNN Inference in FPGAs with Dual Sizes of Compute Unit^{ †}

^{*}

^{†}

## Abstract

**:**

## 1. Introduction

- We propose DSCU, which selects the best combination of CUs through dynamic programming to solve a common problem in accelerating CNN inference, redundant computation.
- We introduce a CNN accelerator design of DSCU for the Xilinx Zynq ZU3EG.
- We conduct a comprehensive evaluation of DSCU over multiple CONV-layers and FPGA-customized networks.

## 2. Background and Motivation

#### 2.1. Convolutional Neural Network

Algorithm 1: Function CONV |

#### 2.2. High-Level Synthesis of FPGAs

#### 2.3. Related Work on Loop-Tiling CNN Accelerator

#### 2.4. Motivation

## 3. The DSCU Approach

#### 3.1. Workflow of DSCU

#### 3.2. Architecture of DSCU

#### 3.3. Design Details of Accelerated CNN Inference

Algorithm 2: Computing a single CONV3×3 layer |

Algorithm 3: Function CU-CONV3×3 |

#### 3.4. The Latency Model of Basic Unit in DSCU

#### 3.5. Task Scheduling with Dynamic Programming for a Single Layer

**Definition**

**1.**

**Definition**

**2.**

**Definition**

**3.**

**Definition**

**4.**

**Definition**

**5.**

#### 3.6. Generation of a CNN Solution by Voting

- Firstly, all optimal single layer solutions were obtained by the single-layer scheduling. It was assumed that there are k different solutions.
- Secondly, k types of solutions were voted on. One vote was counted for each layer that used the ith solution, $i\in [1,k]$.
- Finally, the solution with the highest number of votes was selected as the CNN’s final solution.

Algorithm 4: Voting from multipe layers |

## 4. Results

#### 4.1. Experimental Setup

**Evaluation methodology:**DSCU was deployed on the FPGA of Ultra96 V2. A test program based on Xilinx PYNQ framework was built on the ARM of Ultra96 V2. This program had these functions: The data were loaded into DDR3 memory. The PYNQ interface of the FPGA was invoked to map the DSCU hardware physical address to the memory. DSCU was enabled and the timer was started at the same time. The timer was stopped when DSCU hardware execution ended. The execution time was the latency for DSCU. We evaluated DSCU from two aspects, namely, inference latency and redundant computation rate, and we conducted a set of experiments as follows:

- On one hand, we used DSCU to run CNNs comparing with other accelerators for an overall evaluation, in order to verify that DSCU can complete CNN inferences faster.
- On the other hand, we forced the effect of DSCU on redundant computing problems. We chose some customized CNNs and customized layers with different input feature maps for testing.

#### 4.2. Overall Performance

#### 4.3. Observed Experiments with Redundant Computation

## 5. Conclusions

## Author Contributions

## Funding

## Institutional Review Board Statement

## Informed Consent Statement

## Data Availability Statement

## Conflicts of Interest

## References

- Qiu, H.; Ma, Y.; Li, Z.; Liu, S.; Sun, J. BorderDet: Border Feature for Dense Object Detection. In Proceedings of the Computer Vision—ECCV 2020, Glasgow, UK, 23–28 August 2020; pp. 549–564. [Google Scholar]
- Xu, B.; Lu, C.; Guo, Y.; Wang, J. Discriminative Multi-Modality Speech Recognition. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2020, Seattle, WA, USA, 13–19 June 2020; pp. 14421–14430. [Google Scholar]
- Peng, H.; Li, J.; Wang, S.; Wang, L.; Gong, Q.; Yang, R.; Li, B.; Yu, P.S.; He, L. Hierarchical Taxonomy-Aware and Attentional Graph Capsule RCNNs for Large-Scale Multi-Label Text Classification. IEEE Trans. Knowl. Data Eng.
**2021**, 33, 2505–2519. [Google Scholar] [CrossRef] [Green Version] - Jia, Y.; Shelhamer, E.; Donahue, J.; Karayev, S.; Long, J.; Girshick, R.B.; Guadarrama, S.; Darrell, T. Caffe: Convolutional Architecture for Fast Feature Embedding. In Proceedings of the ACM International Conference on Multimedia, Orlando, FL, USA, 3–7 November 2014; pp. 675–678. [Google Scholar]
- Chen, T.; Du, Z.; Sun, N.; Wang, J.; Wu, C.; Chen, Y.; Temam, O. DianNao: A small-footprint high-throughput accelerator for ubiquitous machine-learning. In Proceedings of the Architectural Support for Programming Languages and Operating Systems, ASPLOS’14, Salt Lake City, UT, USA, 1–5 March 2014; pp. 269–284. [Google Scholar]
- Xu, P.; Zhang, X.; Hao, C.; Zhao, Y.; Zhang, Y.; Wang, Y.; Li, C.; Guan, Z.; Chen, D.; Lin, Y. AutoDNNchip: An Automated DNN Chip Predictor and Builder for Both FPGAs and ASICs. In Proceedings of the FPGA’20: The 2020 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, Seaside, CA, USA, 23–25 February 2020; pp. 40–50. [Google Scholar]
- Zhang, X.; Lu, H.; Hao, C.; Li, J.; Cheng, B.; Li, Y.; Rupnow, K.; Xiong, J.; Huang, T.; Shi, H.; et al. SkyNet: A hardware-efficient method for object detection and tracking on embedded systems. In Proceedings of the Conference on Machine Learning and Systems (MLSys), Austin, TX, USA, 2–4 March 2020. [Google Scholar]
- Peemen, M.; Setio, A.A.A.; Mesman, B.; Corporaal, H. Memory-centric accelerator design for Convolutional Neural Networks. In Proceedings of the 2013 IEEE 31st International Conference on Computer Design (ICCD), Asheville, NC, USA, 6–9 October 2013; pp. 13–19. [Google Scholar]
- Zhang, C.; Li, P.; Sun, G.; Guan, Y.; Xiao, B.; Cong, J. Optimizing FPGA-Based Accelerator Design for Deep Convolutional Neural Networks. In Proceedings of the 2015 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, FPGA’15, Monterey, CA, USA, 22–24 February 2015; pp. 161–170. [Google Scholar]
- Hao, C.; Zhang, X.; Li, Y.; Huang, S.; Xiong, J.; Rupnow, K.; Hwu, W.; Chen, D. FPGA/DNN Co-Design: An Efficient Design Methodology for 1oT Intelligence on the Edge. In Proceedings of the 2019 56th ACM/IEEE Design Automation Conference (DAC), Las Vegas, NV, USA, 2–6 June 2019; pp. 1–6. [Google Scholar]
- Bao, Z.; Guo, J.; Li, X.; Zhang, W. MSCU: Accelerating CNN Inference with Multiple Sizes of Compute Unit on FPGAs. In Proceedings of the 2021 IEEE 14th International Symposium on Embedded Multicore/Many-Core Systems-on-Chip (MCSoC), Singapore, 20–23 December 2021; pp. 106–113. [Google Scholar]
- Howard, A.G.; Zhu, M.; Chen, B.; Kalenichenko, D.; Wang, W.; Weyand, T.; Andreetto, M.; Adam, H. MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications. arXiv
**2017**, arXiv:1704.04861. [Google Scholar] - Cong, J.; Liu, B.; Neuendorffer, S.; Noguera, J.; Vissers, K.; Zhang, Z. High-Level Synthesis for FPGAs: From Prototyping to Deployment. IEEE Trans. Comput.-Aided Des. Integr. Circuits Syst.
**2011**, 30, 473–491. [Google Scholar] [CrossRef] [Green Version] - Xilinx. Vitis High-Level Synthesis User Guide; Xilinx: San Jose, CA, USA, 2020. [Google Scholar]
- Intel. Intel SoC FPGAs; Intel: Santa Clara, CA, USA, 2020. [Google Scholar]
- Wang, J.; Yu, S.; Yue, J.; Yuan, Z.; Yuan, Z.; Yang, H.; Li, X.; Liu, Y. High PE Utilization CNN Accelerator with Channel Fusion Supporting Pattern-Compressed Sparse Neural Networks. In Proceedings of the 2020 57th ACM/IEEE Design Automation Conference (DAC), San Francisco, CA, USA, 20–24 July 2020; pp. 1–6. [Google Scholar]
- Tu, F.; Yin, S.; Ouyang, P.; Tang, S.; Liu, L.; Wei, S. Deep Convolutional Neural Network Architecture With Reconfigurable Computation Patterns. IEEE Trans. Very Large Scale Integr. (VLSI) Syst.
**2017**, 25, 2220–2233. [Google Scholar] [CrossRef] - Wang, D.; Xu, K.; Jia, Q.; Ghiasi, S. ABM-SpConv: A Novel Approach to FPGA-Based Acceleration of ConvolutionaI NeuraI Network Inference. In Proceedings of the 2019 56th ACM/IEEE Design Automation Conference (DAC), Las Vegas, NV, USA, 2–6 June 2019; pp. 1–6. [Google Scholar]
- Knapheide, J.; Stabernack, B.; Kuhnke, M. A High Throughput MobileNetV2 FPGA Implementation Based on a Flexible Architecture for Depthwise Separable Convolution. In Proceedings of the 2020 30th International Conference on Field-Programmable Logic and Applications (FPL), Gothenburg, Sweden, 31 August–4 September 2020; pp. 277–283. [Google Scholar]
- Shen, Y.; Ferdman, M.; Milder, P. Maximizing CNN Accelerator Efficiency Through Resource Partitioning. SIGARCH Comput. Archit. News
**2017**, 45, 535–547. [Google Scholar] [CrossRef] [Green Version] - Zhao, Y.; Gao, X.; Guo, X.; Liu, J.; Wang, E.; Mullins, R.; Cheung, P.Y.K.; Constantinides, G.; Xu, C.Z. Automatic Generation of Multi-Precision Multi-Arithmetic CNN Accelerators for FPGAs. In Proceedings of the 2019 International Conference on Field-Programmable Technology (ICFPT), Tianjin, China, 9–13 December 2019; pp. 45–53. [Google Scholar]
- Kwadjo, D.T.; Mbongue, J.M.; Bobda, C. Exploring a Layer-based Pre-implemented Flow for Mapping CNN on FPGA. In Proceedings of the 2021 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW), Portland, OR, USA, 17–21 June 2021; pp. 116–123. [Google Scholar]
- Bao, Z.; Zhan, K.; Zhang, W.; Guo, J. LSFQ: A Low Precision Full Integer Quantization for High-Performance FPGA-Based CNN Acceleration. In Proceedings of the 2021 IEEE Symposium in Low-Power and High-Speed Chips (COOL CHIPS), Tokyo, Japan, 14–16 April 2021; pp. 1–6. [Google Scholar]
- Lecun, Y.; Bottou, L.; Bengio, Y.; Haffner, P. Gradient-based learning applied to document recognition. Proc. IEEE
**1998**, 86, 2278–2324. [Google Scholar] [CrossRef] [Green Version] - Xu, X.; Zhang, X.; Yu, B.; Hu, X.S.; Rowen, C.; Hu, J.; Shi, Y. Dac-sdc low power object detection challenge for uav applications. IEEE Trans. Pattern Anal. Mach. Intell.
**2019**, 43, 392–403. [Google Scholar] [CrossRef] [PubMed] [Green Version]

**Figure 1.**Comparison of actual computation and theoretical computation on Skynet [7].

**Figure 2.**Three common situations in CNN accelerator [11].

**Figure 3.**Workflow of DSCU [11].

**Figure 4.**Architecture of DSCU [11].

**Figure 5.**The process of CONV-layer computing in DSCU [11].

**Figure 6.**The execution sequence of a CONV3×3 [11].

Symbol | Description |
---|---|

$Tw$ | The width of the feature map for a CU |

$Th$ | The height of the feature map for a CU |

$Tin$ | The input channel of a CU (it is also the channel of the feature map) |

$Tout$ | The output channel of a CU |

K | The kernel of the computation layer |

$Facto{r}_{P}$ | The degree of parallelism for channel dimension |

$Facto{r}_{Mac}$ | Computations that can be performed synchronously in CONV-layer |

$Facto{r}_{Pool}$ | Computations that can be performed synchronously in MAXPOOL-layer |

Resource Utilization | DSP | BRAM | LUT | FF |
---|---|---|---|---|

DSCU | 88% (317/360) | 49% (106/216) | 66% (46,675/70,560) | 36% (50,154/141,120) |

**Table 3.**Performance of comparsion between DSCU and existing accelerators [11].

ICCD2013 [8] | FPGA2015 [9] | DAC2019 [10] | DSCU | |
---|---|---|---|---|

Precision | fixed point | 32 bit float | weight: 11 bits activation: 9 bits | weight: 8 bits activation: 8 bits |

Frequency | 150 MHz | 100 MHz | 215 MHz | 300 MHz |

Platform | Virtex6 VLX240T | Virtex7 VX485T | Zynq ZU3EG | Zynq ZU3EG |

FPGA capacity | 37680 slices, 768 DSP | 75900 slices, 2800 DSP | 8800 slices, 360 DSP | 8800 slices, 360 DSP |

CNN | – | Alexnet | Skynet | Ultranet [23] |

Model size | 2.74 GMAC | 1.33 GLOP | 0.46 GMAC | 0.20 GMAC |

Performance | 17.0 GOPs | 61.62 GOPs | 23.15 GOPs | 29.59 GOPs |

Performance Density | $4.5\times {10}^{-4}$ GOPs/slice | $8.12\times {10}^{-4}$ GOPs/slice | $2.63\times {10}^{-3}$ GOPs/slice | $3.36\times {10}^{-3}$ GOPs/slice |

**Table 4.**The CU configuration for a single CU accelerator and DSCU [11].

Accelerator Configuration (Tin, Tout, Tw, Th) | Single CU | DSCU | |
---|---|---|---|

CU | CU1 | CU2 | |

MNIST-Lenet | 16 × 16 × 16 × 16 | 16 × 16 × 8 × 8 | 16 × 16 × 8 × 8 |

DJI-UAV-Skynet | 16 × 16 × 40 × 40 | 16 × 16 × 20 × 20 | 16 × 16 × 20 × 20 |

DJI-UAV-Ultranet | 16 × 16 × 40 × 40 | 16 × 16 × 20 × 20 | 16 × 16 × 20 × 20 |

Accelerator | Ultranet | Skynet | Lenet | |||
---|---|---|---|---|---|---|

Single CU | DSCU | Single CU | DSCU | Single CU | DSCU | |

Latency (ms) | 456 | 291 | 972 | 664 | 3.54 | 2.8 |

Speedup | $\times 1$ | $\times 1.57$ | $\times 1$ | $\times 1.46$ | $\times 1$ | $\times 1.26$ |

Redundant computation rate | 70.5% | 40.5% | 45.3% | 20.0% | 8.3% | 8.3% |

Resource usage | ||||||

FFs | 38,891 | 50,154 | 41,560 | 53,321 | 23,387 | 29,873 |

LUTs | 35,296 | 46,675 | 36,102 | 49,821 | 19,782 | 28,165 |

DSPs | 231 | 317 | 261 | 359 | 99 | 201 |

BRAMs | 92 | 106 | 85 | 127 | 35 | 52 |

No. | Configuration | ||
---|---|---|---|

Input Size | Layer Type | Output Size | |

1 2 3 4 5 6 | 32 × 20 × 10 32 × 104 × 104 32 × 208 × 208 32 × 416 × 416 32 × 160 × 80 32 × 320 × 160 | CONV3×3(32,64) ↓ Relu(64,64) ↓ MaxPooling(64,64) | 64 × 10 × 5 64 × 52 × 52 64 × 104 × 104 64 × 208 × 208 64 × 80 × 40 64 × 160 × 80 |

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |

© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

## Share and Cite

**MDPI and ACS Style**

Bao, Z.; Guo, J.; Zhang, W.; Dang, H.
DSCU: Accelerating CNN Inference in FPGAs with Dual Sizes of Compute Unit. *J. Low Power Electron. Appl.* **2022**, *12*, 11.
https://doi.org/10.3390/jlpea12010011

**AMA Style**

Bao Z, Guo J, Zhang W, Dang H.
DSCU: Accelerating CNN Inference in FPGAs with Dual Sizes of Compute Unit. *Journal of Low Power Electronics and Applications*. 2022; 12(1):11.
https://doi.org/10.3390/jlpea12010011

**Chicago/Turabian Style**

Bao, Zhenshan, Junnan Guo, Wenbo Zhang, and Hongbo Dang.
2022. "DSCU: Accelerating CNN Inference in FPGAs with Dual Sizes of Compute Unit" *Journal of Low Power Electronics and Applications* 12, no. 1: 11.
https://doi.org/10.3390/jlpea12010011