# RISC-V Virtual Platform-Based Convolutional Neural Network Accelerator Implemented in SystemC

^{1}

^{2}

^{*}

## Abstract

**:**

## 1. Introduction

## 2. Related Work

## 3. RISC-V VP Based CNN DLA

#### 3.1. SystemC-Based RISC-V VP

#### 3.2. CNN DLA Overview

#### 3.3. GFSR Register Set in DLA

#### 3.4. Data Loader Module and Buffer

#### 3.5. CPIPE and APIPE Module

#### 3.6. DNN Applications on the RISC-V DLA System

#### 3.7. Extention Issues

## 4. Verification and Analysis with Experiments

#### 4.1. Darknet Running on DLA with RISC-V VP

#### 4.2. Buffer Effects in DLA System

#### 4.3. Architecture Parallelism

#### 4.4. Quantization Effect

## 5. Conclusions

## Author Contributions

## Funding

## Conflicts of Interest

## Abbreviations

CNN | Convolutional Neural Network |

ESL | Electronic System Level |

DLA | Deep Learning Accelerator |

DNN | Deep Neural Network |

AI | Artificial Intelligence |

IoT | Internet of Thing |

ISA | Instruction Set Architecture |

VP | Virtual Platform |

RTL | Register Translation Level |

VHDL | VHSIC Hardware Description Language |

RNN | Recurrent Neural Network |

FGPA | Field-Programmable Gate Array |

GP | General purpose Processor |

YOLO | You only look once |

TLM | Transaction-Level Modeling |

GFSR | Global Funcgion Set Register |

CPIPE | Convolution PIPE |

APIPE | Activation-Pooling PIPE |

PE | Processing Element |

## References

**Figure 3.**The memory-mapped DLA Registers and its access from software application in RISC-V VP platform.

**Figure 6.**Experimental results of running on RISC-V VP DLA system and running of Original Darknet library for layer 0, 1 and 13 of Darknet Yolo tiny v3 Nueral Network Model, which are convolution of 416 × 416 image with 16 filters (

**a**,

**b**), max pooling of 16 416 × 416 images (

**c**,

**d**), and convolution of 13 × 13 image with 1024 filters (

**e**,

**f**), respectively.

**Figure 7.**Comparison of each pixel values between RISC-V VP DLA system and Original Darknet Model running for layer 0, 1 and 13 of Darknet Yolo tiny v3 Nueral Network Model, which are convolution of 416 × 416 image with 16 filters (

**a**), max pooling of 16 416 × 416 images (

**b**), and convolution of 13 × 13 image with 1024 filters (

**c**), respectively. Blue points represent pixels for original Darknet and orange points represent pixels for RISC-V VP DLA system.

**Figure 8.**It plots access frequency of each buffer in developed DLA according to the change in buffer size from 1 MB to 8 MB for each CNN configuration.

**Figure 9.**It plots the amount of data transferred between memory and each buffer according to the the buffer size from 1 MB to 8 MB for each CNN configuration.

**Figure 10.**It plots simulated execution times of each submodule in DLA for CNN running with inputs 256, 512, 1024 and outputs 32, 128, 245 for 13 × 13 images and 1 × 1 filter sizes.

**Figure 11.**It shows the simulated excution time in each sub-module by performing convolution and maxpooling operations of 3 NN operations according to the precision level with 1 M and 8 M buffers, respectively; (

**a**) input3/output16, (

**b**) input9/output8, (

**c**) input15/output4.

**Figure 12.**It plots the ratio of actual data transfer to the amount of data required for NN operations according to the precision level with 1 M and 8 M buffers, respectively; (

**a**) input3/output (4, 8, 16) and (

**b**) input9/output (4, 8, 16).

Approach | List | Features |
---|---|---|

RTL with FPGA | Y. Chen [13] D. Shin [21] Flex [22] T. Fujii [23] | course-grain reconfigurable both CNN and RNN reconfigurable IP dynamic reconfigurable |

Coprocessor with GP or RISC-V | V. Gokhale [25] N. Wu [30] Z. Li [31] R. Porter [32] G Zhang [33] | interface with general processor and external memory core-interconnected accelerator support custom instruction for convolution extension for in-pipeline hardware customized for Yolo |

Framework | Timeloop [26] ScaleSim [27] MAESTRO [28] LAMBDA [29] | inference performance and energy cycle-accurate and analytical model for scaling various data form for generating statistics support modeling for communication and memory sub-system |

ESL with SystemC | Y. Lee [38] S. Kim [39] S. Lim (This Paper) | raised abstraction level cycle-accurate-based support RISC-V VP and software interface |

(unit: ns) | Loader Module | ||||
---|---|---|---|---|---|

Memory Delay per Byte | Router | Data | Requester | ||

5 | 11 | 43 | 10 | ||

CPIPE | APIPE | ||||

con2CPIPE | 4 PEs | CPIPEDone | con2APIPE | 4 PEs | APIPEDone |

555 | 1114 | 67 | 208 | 399 | 100 |

Wall-Clock Simulation Time for Various NN Workloads | |||||||||
---|---|---|---|---|---|---|---|---|---|

(# of input, # of output) | (3, 4) | (3, 8) | (3, 16) | (9, 4) | (9, 8) | (9, 16) | (15, 4) | (15, 8) | (15, 16) |

wall-clock time (s) | 7.62 | 14.08 | 27.03 | 18.22 | 34.7 | 65.98 | 29.2 | 55.6 | 107.91 |

