# Model Parallelism Optimization for CNN FPGA Accelerator

^{1}

^{2}

^{*}

## Abstract

**:**

## 1. Introduction

- (1)
- We decoupled the CNN network structure using group convolution and a new channel shuffle process to replace the original convolution and channel shuffle techniques. This loosened the connections between feature maps and provided a high efficiency and low memory usage for each device.
- (2)
- We designed a parallel FPGA accelerator for the classic CNN model ShuffleNet using model parallelism. Additionally, this accelerator was optimized with several parallel strategies in the heterogeneous parallel programming framework OpenCL. This accelerator could leverage multiple devices to speed up inferencing and relieve resource constraints on the individual devices.

## 2. Background and Related Works

#### 2.1. ShuffleNet

#### 2.2. OpenCL for FPGA

#### 2.3. Related Work

## 3. Approach

#### 3.1. Convolution

#### 3.2. Inside Shuffle

Algorithm 1. Inside-Shuffle. |

Input: data: output from the upper layer |

I_input: Store a portion of the data needed for each device, I_input ⊂ data |

N: a parameter for internal shuffle of data |

Output: result: reordered data is used to provide to the next layer |

1.Create local variable I_input |

2.Read data from Channel to I_input |

3.for i = 0, 1, …, Channel/(N * Device_Number) do |

4. for j = 0, 1, …, N do |

5. for k = 0, 1, …, input_size do |

6. Convert the one-dimensional array I_input into |

a three-dimensional array (i, j, k) |

7. Transpose(i, j) #Tranpose(·) is used to exchange data |

8. Store result to the channel in the original order. |

9. end for |

10. end for |

11.end for |

#### 3.3. Other Layers

## 4. Optimizations

#### 4.1. Pipelined Computation

#### 4.2. Channel Communication

#### 4.3. Aggregate Read

#### 4.4. Kernel Vectorization

#### 4.5. Overall Architecture

## 5. Experiment

#### 5.1. Experimental Setup

^{®}Xeon

^{®}Gold6128 CPU as the OpenCL host and DMA for data transmission between devices. The board support package (BSP) was a layer between the motherboard hardware and the driver in the operating system. It provided a functional package for the upper layer drivers to access the hardware device registers and make them work better on the hardware motherboard.

#### 5.2. Accuracy

#### 5.3. Performance

## 6. Conclusions

## Author Contributions

## Funding

## Data Availability Statement

## Conflicts of Interest

## Abbreviations

CPU | Central processing unit |

FPGA | Field-programmable gate array |

GPU | Graphics processing unit |

DDR2 | Double data rate 2 |

ASCI | Application-specific integrated circuit |

HLS | High-level synthesis |

DRAM | Dynamic random-access memory |

MFLOPs | Million floating-point operations per second |

FC | Fully connected layer |

CL | Convolutional layer |

HDL | Hardware description language |

VHDL | Very-high-speed integrated circuit hardware description language |

SDK | Software development kit |

DMA | Direct memory access |

## References

- Li, Z.; Liu, F.; Yang, W.; Peng, S.; Zhou, J. A survey of Convolutional Neural Networks: Analysis, applications, and prospects. IEEE Trans. Neural Netw. Learn. Syst.
**2021**, 32, 6999–7019. [Google Scholar] [CrossRef] [PubMed] - Ma, Y.; Huang, C. Facial expression recognition based on deep learning and attention mechanism. In Proceedings of the 2021 3rd International Conference on Advanced Information Science and System (AISS 2021), Sanya, China, 26–28 November 2021; Volume 2021, pp. 30–35. [Google Scholar] [CrossRef]
- Sang, H.; Xiang, L.; Chen, S.; Chen, B.; Yan, L. Image Recognition Based on Multiscale Pooling Deep Convolution Neural Networks. Complexity
**2020**, 2020, 6180317. [Google Scholar] [CrossRef] - Zhang, C.; Li, P.; Sun, G.; Guan, Y.; Xiao, B.; Cong, J. Optimizing FPGA-based accelerator design for deep convolutional Neural Networks. In Proceedings of the 2015 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, Monterey, CA, USA, 22–24 February 2015. [Google Scholar] [CrossRef]
- Gokhale, V.; Jin, J.; Dundar, A.; Martini, B.; Culurciello, E. A 240 G-ops/S mobile coprocessor for Deep Neural Networks. In Proceedings of the 2014 IEEE Conference on Computer Vision and Pattern Recognition Workshops, Columbus, OH, USA, 23–28 June 2014. [Google Scholar] [CrossRef]
- Chen, Y.; Luo, T.; Liu, S.; Zhang, S.; He, L.; Wang, J.; Li, L.; Chen, T.; Xu, Z.; Sun, N.; et al. Dadiannao: A machine-learning supercomputer. In Proceedings of the 2014 47th Annual IEEE/ACM International Symposium on Microarchitecture, Cambridge, UK, 13–17 December 2014. [Google Scholar] [CrossRef]
- Munshi, A. The opencl specification. In Proceedings of the 2009 IEEE Hot Chips 21 Symposium (HCS), Stanford, CA, USA, 23–25 August 2009. [Google Scholar] [CrossRef]
- Du, J.; Zhu, X.; Shen, M.; Du, Y.; Lu, Y.; Xiao, N.; Liao, X. Model parallelism optimization for distributed inference via decoupled CNN structure. IEEE Trans. Parallel Distrib. Syst.
**2021**, 32, 1665–1676. [Google Scholar] [CrossRef] - Zhang, X.; Zhou, X.; Lin, M.; Sun, J. ShuffleNet: An extremely efficient convolutional neural network for mobile devices. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018. [Google Scholar] [CrossRef]
- Howard, A.G.; Zhu, M.; Chen, B.; Kalenichenko, D.; Wang, W.; Weyand, T.; Andreetto, M.; Adam, H. Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv
**2017**, arXiv:1704.04861. [Google Scholar] - Intel. Intel FPGA SDK for Opencl Pro Edition: Programming Guide. Available online: https://www.mouser.com/datasheet/2/612/aocl_programming_guide-1301807.pdf (accessed on 23 August 2022).
- Xilinx. Vivado High Level Synthesis. Available online: https://www.xilinx.com/video/hardware/vivado-high-level-synthesis.html (accessed on 23 August 2022).
- Canis, A.; Choi, J.; Fort, B.; Lian, R.; Huang, Q.; Calagar, N.; Gort, M.; Qin, J.J.; Aldham, M.; Czajkowski, T.; et al. From software to accelerators with LegUp high-level synthesis. In Proceedings of the 2013 International Conference on Compilers, Architecture and Synthesis for Embedded Systems (CASES), Montreal, QC, Canada, 29 September–4 October 2013. [Google Scholar] [CrossRef] [Green Version]
- Krizhevsky, A. One weird trick for parallelizing convolutional neural networks. arXiv
**2014**, arXiv:1404.5997. [Google Scholar] - Jin, H.; Qamar, S.; Zheng, R.; Ahmad, P. Single binding of data and model parallelisms to parallelize convolutional neural networks through multiple machines. J. Intell. Fuzzy Syst.
**2018**, 35, 5449–5466. [Google Scholar] [CrossRef] - Mao, J.; Chen, X.; Nixon, K.W.; Krieger, C.; Chen, Y. MoDNN: Local Distributed Mobile Computing System for Deep Neural Network. In Design, Automation & Test in Europe Conference & Exhibition (DATE); IEEE: Piscataway, NJ, USA, 2017. [Google Scholar] [CrossRef]
- Zhao, Z.; Barijough, K.M.; Gerstlauer, A. DeepThings: Distributed Adaptive Deep Learning Inference on resource-constrained IOT edge clusters. IEEE Trans. Comput.-Aided Des. Integr. Circuits Syst.
**2018**, 37, 2348–2359. [Google Scholar] [CrossRef] - Michael, M.; Mikael, H.; Yann, L.C. Fast training of convolutional networks through ffts. arXiv
**2013**, arXiv:1312.5851. [Google Scholar] - Kang, D.; Kim, E.; Bae, I.; Egger, B.; Ha, S. C-good. In Proceedings of the International Conference on Computer-Aided Design (ICCAD), San Diego, CA, USA, 5–8 November 2018. [Google Scholar] [CrossRef]
- Huynh, L.N.; Balan, R.K.; Lee, Y. DeepSense. In Proceedings of the 2016 Workshop on Wearable Systems and Applications—WearSys ’16, Singapore, 30 June 2016. [Google Scholar] [CrossRef]
- Huynh, L.N.; Lee, Y.; Balan, R.K. Deepmon. In Proceedings of the 15th Annual International Conference on Mobile Systems, Applications, and Services, Niagara Falls, NY, USA, 19–23 June 2017. [Google Scholar] [CrossRef]
- Liu, Z.; Dou, Y.; Jiang, J.; Xu, J.; Li, S.; Zhou, Y.; Xu, Y. Throughput-optimized FPGA accelerator for deep convolutional neural networks. ACM Trans. Reconfigurable Technol. Syst.
**2017**, 10, 17. [Google Scholar] [CrossRef] - Wang, D.; An, J.; Ke, X. PipeCNN: An OpenCL-Based FPGA Accelerator for Large-Scale Convolution Neuron Networks. arXiv
**2016**, arXiv:1611.02450. [Google Scholar] - Suda, N.; Chandra, V.; Dasika, G.; Mohanty, A.; Ma, Y.; Vrudhula, S.; Seo, J.S.; Cao, Y. Throughput-optimized opencl-based FPGA accelerator for large-scale convolutional neural networks. In Proceedings of the 2016 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, Monterey, CA, USA, 21–23 February 2016. [Google Scholar] [CrossRef]
- Aydonat, U.; O’Connell, S.; Capalija, D.; Ling, A.C.; Chiu, G.R. An opencl™ deep learning accelerator on arria 10. In Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, Monterey, CA, USA, 22–24 February 2017. [Google Scholar] [CrossRef] [Green Version]

Layer | Output Size | KSize | Stride | Repeat | Output Channels (G Groups) | ||||
---|---|---|---|---|---|---|---|---|---|

G = 1 | G = 2 | G = 3 | G = 4 | G = 8 | |||||

Image | 224 × 224 | 3 × 3 | 3 | 3 | 3 | 3 | 3 | ||

CL | 112 × 112 | 3 × 3 | 2 | 1 | 24 | 24 | 24 | 24 | 24 |

MaxPool | 56 × 56 | 2 | |||||||

Stage2 | 28 × 28 | 2 | 1 | 144 | 200 | 240 | 272 | 384 | |

28 × 28 | 1 | 3 | 144 | 200 | 240 | 272 | 384 | ||

Stage3 | 14 × 14 | 2 | 1 | 288 | 400 | 480 | 544 | 768 | |

14 × 14 | 1 | 7 | 288 | 400 | 480 | 544 | 768 | ||

Stage4 | 7 × 7 | 2 | 1 | 576 | 800 | 960 | 1088 | 1536 | |

7 × 7 | 1 | 3 | 576 | 800 | 960 | 1088 | 1536 | ||

GlobalPool | 1 × 1 | 7 × 7 | |||||||

FC | 1000 | 1000 | 1000 | 1000 | 1000 |

CNN Model | Channel Shuffle | Group | Top-1 Accuracy | |
---|---|---|---|---|

MNIST | CIFAR-10 | |||

ShuffleNet | Original Shuffle | 2 | 98.90% | 84.44% |

Inside Shuffle | 2 | 99.00% | 84.75% | |

Original Shuffle | 3 | 98.70% | 83.52% | |

Inside Shuffle | 3 | 98.83% | 84.11% |

Device | Version | Optimizations | Number | Time (ms) | Power (W) | |
---|---|---|---|---|---|---|

Channel | Others | |||||

Inspur F10A | ShuffleNet | X | X | 1 | 1045.198 | 23.5 |

X | √ | 1 | 329.869 | 23.4 | ||

X | √ | 2 | 228.358 | 20.5 | ||

I-ShuffleNet | √ | √ | 1 | 186.635 | 23.5 | |

√ | √ | 2 | 130.521 | 21.1 |

Number of FPGA Devices | ALUTs | FFs | RAMs | DSPs |
---|---|---|---|---|

1 | 139,651 (16%) | 224,468 (13%) | 1707 (63%) | 137 (9%) |

2 | 148,715 (17%) | 249,448 (15%) | 1265 (47%) | 143 (9%) |

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

## Share and Cite

**MDPI and ACS Style**

Wang, J.; Tong, W.; Zhi, X.
Model Parallelism Optimization for CNN FPGA Accelerator. *Algorithms* **2023**, *16*, 110.
https://doi.org/10.3390/a16020110

**AMA Style**

Wang J, Tong W, Zhi X.
Model Parallelism Optimization for CNN FPGA Accelerator. *Algorithms*. 2023; 16(2):110.
https://doi.org/10.3390/a16020110

**Chicago/Turabian Style**

Wang, Jinnan, Weiqin Tong, and Xiaoli Zhi.
2023. "Model Parallelism Optimization for CNN FPGA Accelerator" *Algorithms* 16, no. 2: 110.
https://doi.org/10.3390/a16020110