# Low-Power FPGA Implementation of Convolution Neural Network Accelerator for Pulse Waveform Classification

^{1}

^{2}

^{3}

^{*}

## Abstract

**:**

## 1. Introduction

## 2. System Design

#### 2.1. System Design Flow

#### 2.2. Data Collection and Preprocessing

#### 2.3. Algorithm Design: Structure of CNN

#### 2.4. Hardware System Design

Algorithm 1. Algorithm of CONV layer. |

for j in range(n): # loop 1 for i in range(m): # loop 1.1 load input[i][:]; load weights[j][i][3]; conv_out[j][i][:] = CONV(input[i][:], weights[j][i]); store conv_out[j][i][:]; for j in range(n): # loop 2 for k in range(l): # loop 2.1 load conv_out[j][:][k]; accu_out[j][k] = ACCU(conv_out[j][:][k]); accu_out[j][k] += bias[j] for j in range(n): # loop 3 for k in range(l): # loop 3.1 relu_out[j][k] = accu_out[j][k]>0 ? accu_out[j][k]:0; for k in range(l/2): # loop 3.1 maxp_out[j][k] = max(relu_out[j][2*k], relu_out[j][2*k+1]); |

#### 2.4.1. System Architecture Design

#### 2.4.2. Computation Modules Design

#### 2.4.3. Control Modules Design

## 3. Optimization Methods and Results

#### 3.1. Network Model Design and Parameter Reduction

- Downsampling the data. We can use downsampling to reduce the amount of computation significantly while the accuracy of classification remains high.
- Use more CONV layers to reduce the FC layer’s weights. The more CONV layers, the shorter length of the FC layer’s input tensor, and the fewer FC layer’s parameters.
- Modify the CONV layers’ structure to reduce CONV layers’ parameters. Reducing the ratio of output channels to input channels can effectively reduce the parameters of the convolution layers.

#### 3.2. Hardware System Optimization

#### 3.2.1. Memory Access Optimization Method 1: Continuous Read Mode

#### 3.2.2. Memory Access Optimization Method 2: Task Pipelining

#### 3.2.3. Memory Access Optimization Method 3: Use BRAM

#### 3.2.4. Memory Access Optimization Results

#### 3.3. Comparison with Related Studies

## 4. Conclusions

## Author Contributions

## Funding

## Conflicts of Interest

## References

- Wang, N.; Yu, Y.; Huang, D.; Xu, B.; Liu, J.; Li, T.; Xue, L.; Shan, Z.; Chen, Y.; Wang, J. Pulse diagnosis signals analysis of fatty liver disease and cirrhosis patients by using machine learning. Sci. World J.
**2015**, 2015. [Google Scholar] [CrossRef] [PubMed][Green Version] - Charbonnier, S.; Galichet, S.; Mauris, G.; Siché, J.P. Statistical and fuzzy models of ambulatory systolic blood pressure for hypertension diagnosis. IEEE Trans. Instrum. Meas.
**2000**, 49, 998–1003. [Google Scholar] [CrossRef] - He, D.; Wang, L.; Fan, X.; Yao, Y.; Geng, N.; Sun, Y.; Xu, L.; Qian, W. A new mathematical model of wrist pulse waveforms characterizes patients with cardiovascular disease—A pilot study. Med. Eng. Phys.
**2017**, 48, 142–149. [Google Scholar] [CrossRef] [PubMed] - Gomes Ribeiro Moura, N.; Sá Ferreira, A. Pulse waveform analysis of chinese pulse images and its association with disability in hypertension. JAMS J. Acupunct. Meridian Stud.
**2016**, 9, 93–98. [Google Scholar] [CrossRef] [PubMed] - Zhang, Z.; Zhang, Y.; Yao, L.; Song, H.; Kos, A. A sensor-based wrist pulse signal processing and lung cancer recognition. J. Biomed. Inform.
**2018**, 79, 107–116. [Google Scholar] [CrossRef] [PubMed] - Fei, Z. Contemporary Sphygmology in Traditional Chinese Medicine; People’s Medical Publishing House: Beijing, China, 2003. [Google Scholar]
- Hu, X.; Zhu, H.; Xu, J.; Xu, D.; Dong, J. Wrist pulse signals analysis based on Deep Convolutional Neural Networks. In Proceedings of the 2014 IEEE Conference on Computational Intelligence in Bioinformatics and Computational Biology (CIBCB 2014), Honolulu, HI, USA, 21–24 May 2014. [Google Scholar] [CrossRef]
- Wang, Y.-Y.L.; Hsu, T.-L.; Jan, M.-Y.; Wang, W.-K. Theory and applications of the harmonic analysis of arterial pressure pulse wave. J. Med. Biol. Eng.
**2010**, 30, 125–131. [Google Scholar] [CrossRef] - Lu, G.; Jiang, Z.; Ye, L.; Huang, Y. Pulse feature extraction based on improved gaussian model. In Proceedings of the Proceedings—2014 International Conference on Medical Biometrics, ICMB 2014, Shenzhen, China, 30 May–1 June 2014; pp. 90–94. [Google Scholar]
- Tang, A.C.Y.; Chung, J.W.Y.; Wong, T.K.S. Digitalizing traditional chinese medicine pulse diagnosis with artificial neural network. Telemed. e-Health
**2012**, 18, 446–453. [Google Scholar] [CrossRef] [PubMed] - Xu, L.S.; Meng, M.Q.H.; Wang, K.Q. Pulse image recognition using fuzzy neural network. In Proceedings of the Annual International Conference of the IEEE Engineering in Medicine and Biology Society, Lyon, France, 22–26 August 2007; Volume 36, pp. 3148–3151. [Google Scholar] [CrossRef]
- Chen, Y.; Zhang, L.; Zhang, D.; Zhang, D. Wrist pulse signal diagnosis using modified Gaussian models and Fuzzy C-Means classification. Med. Eng. Phys.
**2009**, 31, 1283–1289. [Google Scholar] [CrossRef] [PubMed] - Shu, J.J.; Sun, Y. Developing classification indices for Chinese pulse diagnosis. Complement. Ther. Med.
**2007**, 15, 190–198. [Google Scholar] [CrossRef] [PubMed][Green Version] - Liu, Y.H.; Yang, Q.H.; Shi, H.F. Pulse feature analysis and extraction based on pulse mechanism analysis. In Proceedings of the 2009 WRI World Congress on Computer Science and Information Engineering, CSIE 2009, Los Angeles, CA, USA, 31 March–2 April 2009; Volume 7, pp. 53–56. [Google Scholar] [CrossRef]
- Hudoba, G. Vascular health diagnosis by pulse wave analysis. In Proceedings of the SAMI 2010—8th International Symposium on Applied Machine Intelligence and Informatics, Herlany, Slovakia, 28–30 January 2010; pp. 89–91. [Google Scholar] [CrossRef]
- Sareen, M.; Abhinav, A.; Prakash, P.; Anand, S. Wavelet decomposition and feature extraction from pulse signals of the radial artery. In Proceedings of the 2008 International Conference on Advanced Computer Theory and Engineering, Phuket, Thailand, 20–22 December 2008; pp. 551–555. [Google Scholar] [CrossRef]
- Zhang, P.Y.; Wang, H.Y. A framework for automatic time-domain characteristic parameters extraction of human pulse signals. EURASIP J. Adv. Signal Process.
**2008**, 2008. [Google Scholar] [CrossRef][Green Version] - Joshi, A.; Chandran, S.; Jayaraman, V.K.; Kulkarni, B.D. Arterial pulse system modern methods for traditional indian. In Proceedings of the 2007 Annual International Conference of the IEEE Engineering in Medicine and Biology Society, Lyon, France, 22–26 August 2007; pp. 608–611. [Google Scholar] [CrossRef]
- Li, J.; Cao, Y.; Liu, Q.; Jiao, Q. Determination of urinary L-citrulline by enzymatic method. Chin. J. Anal. Chem.
**2006**, 34, 379–381. [Google Scholar] [CrossRef] - Wang, K.; Wang, L.; Wang, D.; Xu, L. SVM classification for discriminating cardiovascular disease patients from non-cardiovascular disease controls using pulse waveform variability analysis. Lect. Notes Comput. Sci.
**2005**, 109–119. [Google Scholar] [CrossRef] - Wang, H.; Cheng, Y. A quantitative system for pulse diagnosis in traditional Chinese medicine. In Proceedings of the Annual International Conference of the IEEE Engineering in Medicine and Biology Society, Shanghai, China, 17–18 January 2006; Volume 7, pp. 5676–5679. [Google Scholar] [CrossRef]
- Qiu, J.; Wang, J.; Yao, S.; Guo, K.; Li, B. Going deeper with embedded FPGA Platform for Convolutional Neural Network. In Proceedings of the 2016 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, Monterey, CA, USA, 21 February 2016; pp. 26–35. [Google Scholar] [CrossRef]
- Ma, Y.; Suda, N.; Cao, Y.; Seo, J.S.; Vrudhula, S. Scalable and modularized RTL compilation of Convolutional Neural Networks onto FPGA. In Proceedings of the FPL 2016—26th International Conference on Field-Programmable Logic and Applications, Lausanne, Switzerland, 29 August–2 September 2016. [Google Scholar] [CrossRef]
- Ma, Y.; Cao, Y.; Vrudhula, S.; Seo, J.S. Optimizing loop operation and dataflow in FPGA acceleration of deep convolutional neural networks. In Proceedings of the FPGA 2017—The 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, Monterey, CA, USA, 22–24 February 2017; pp. 45–54. [Google Scholar] [CrossRef]
- Zhang, C. Optimizing FPGA-based accelerator design for deep convolutional neural networks. In Proceedings of the FPGA 2015—The 2015 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, Monterey, CA, USA, 22–24 February 2015; pp. 161–170. [Google Scholar] [CrossRef]
- Li, S.; Sun, K.; Luo, Y.; Yadav, N.; Choi, K. Novel CNN-based AP2D-net accelerator: An area and power efficient solution for real-time applications on mobile FPGA. Electron
**2020**, 9, 832. [Google Scholar] [CrossRef] - Gong, L.; Wang, C.; Li, X.; Chen, H.; Zhou, X. MALOC: A fully pipelined FPGA accelerator for convolutional neural networks with all layers mapped on chip. IEEE Trans. Comput. Des. Integr. Circuits Syst.
**2018**, 37, 2601–2612. [Google Scholar] [CrossRef] - Zhang, C.; Wu, D.; Sun, J.; Sun, G.; Luo, G.; Cong, J. Energy-efficient CNN implementation on a deeply pipelined FPGA cluster. In Proceedings of the 2016 International Symposium on Low Power Electronics and Design, ISLPED 2016, San Francisco, CA, USA, 8–10 August 2016; pp. 326–331. [Google Scholar] [CrossRef]
- Di Cecco, R.; Lacey, G.; Vasiljevic, J.; Chow, P.; Taylor, G.; Areibi, S. Caffeinated FPGAs: FPGA framework for convolutional neural networks. In Proceedings of the 2016 International Conference on Field-Programmable Technology, FPT 2016, Xi’an, China, 7–9 December 2016; pp. 265–268. [Google Scholar] [CrossRef][Green Version]
- Guo, K.; Sui, L.; Qiu, J.; Yu, J.; Wang, J.; Yao, S.; Han, S.; Wang, Y.; Yang, H. Angel-Eye: A complete design flow for mapping CNN onto embedded FPGA. IEEE Trans. Comput. Des. Integr. Circuits Syst.
**2018**, 37, 35–47. [Google Scholar] [CrossRef] - Geng, T.; Wang, T.; Sanaullah, A.; Yang, C.; Patel, R.; Herbordt, M. A framework for acceleration of CNN training on deeply-pipelined FPGA clusters with work and weight load balancing. In Proceedings of the 28th International Conference on Field Programmable Logic and Applications (FPL), Dublin, Ireland, 27–31 August 2018; pp. 394–398. [Google Scholar] [CrossRef]
- Courbariaux, M.; Hubara, I.; Soudry, D.; El-Yaniv, R.; Bengio, Y. Binarized neural networks: Training deep neural networks with weights and activations constrained to +1 or -1. arXiv
**2016**, arXiv:1602.02830. [Google Scholar] - Chen, C.; Li, Z.; Zhang, Y.; Zhang, S.; Hou, J.; Zhang, H. A 3D wrist pulse signal acquisition system for width information of pulse wave. Sensors
**2020**, 20, 11. [Google Scholar] [CrossRef] [PubMed][Green Version]

**Figure 1.**Four typical types of pulse waveforms. (

**a**) Ping pulse; (

**b**) Xian pulse; (

**c**) Hua pulse; (

**d**) Se pulse.

**Figure 2.**The system design flow includes data collection and preprocessing, algorithm design, and hardware design and optimization.

**Figure 3.**Data collection and preprocessing. (

**a**) roughly select a segment; (

**b**) remove high frequency noise; (

**c**) remove baseline wander; (

**d**) select an appropriate area as a piece of data; (

**e**) normalize, label, and store the data.

**Figure 4.**CNN model and convolution (CONV) layer. (

**a**) a CNN model includes 2 CONV layers and 2 FC layers; (

**b**) 3 operation types of CONV layer; (

**c**) complete calculation procedures of convolution layer.

**Figure 5.**CNN accelerator system architecture includes 3 types of computation module, system controller (SYS_CTRL), memory controller (MEM_CTRL), and Double-Data-Rate SDRAM controller (DDR_CTRL). The red arrows mean the direction of data flow.

**Figure 6.**Convolution kernel (CK) design. (

**a**) CK’s timing diagram. (

**b**) CK’s (Register-Transfer-Level) RTL circuit design.

**Figure 7.**Control modules design. (

**a**) State transition diagram of System controller (SYS_CTRL) module. (

**b**) State transition diagram of DDR controller (DDR_CTRL) module.

**Figure 8.**CNN network model design and parameter reduction. (

**a**) The effects of the number of CONV layers on CNN model. (

**b**) The effects of the ratio of output channels and input channels. (

**c**) The effects of the number of sample points of input tensor. (

**d**) The CNN model which has relative high accuracy, less required buffer size, and a few parameters. (

**e**) The trend curves of loss and accuracy during training. (

**f**) Visualization of FC1’s output tensor.

**Figure 9.**Memory access module design and optimization. (

**a**) Write CONV’s output to DDR. (

**b**) Read CONV’s output from DDR. (

**c**) Usual solution to discrete read address. (

**d**) Timing diagram of usual solution to discrete read address. (

**e**) Optimized solution to discrete read address. (

**f**) Timing diagram of optimized solution to discrete read address.

**Figure 11.**Optimized system architecture. Two blocks of BRAM which form the ping-pong structure simplify the memory access process and reduce unnecessary latency.

No. | Layer 1 | Layer 2 | Layer 3 | Layer 4 | Layer 5 | Layer 6–9 |
---|---|---|---|---|---|---|

1 | (1–16) ^{1} | (16–32) | (32–32) | (32–32) | (32–32) | (32–32), (32–32), (128,100)^{2}, (100,4) |

2 | (1–8) | (8–32) | (32–32) | (32–32) | (32–32) | |

3 | (1–8) | (8–16) | (16–32) | (32–32) | (32–32) | |

4 | (1–4) | (4–16) | (16–32) | (32–32) | (32–32) | |

5 | (1–4) | (4–8) | (8–16) | (16–32) | (32–32) | |

6 | (1–2) | (2–4) | (4–8) | (8–16) | (16–32) |

^{1}(a–b): a, b are the number of input channels and output channels.

^{2}(a–b): a, b are the number of input features and output features.

Layer | Solution 1/Cycles | Solution 2/Cycles | Solution 3/Cycles | Solution 4/Cycles |
---|---|---|---|---|

CONV1 | 33,908 | 10,320 | 4027 | 3411 |

CONV2 | 77,650 | 14,309 | 6069 | 4710 |

CONV3 | 136,134 | 21,171 | 9884 | 6648 |

CONV4 | 267,820 | 35,892 | 19,037 | 11,292 |

CONV5 | 535,148 | 67,012 | 37,794 | 20,868 |

CONV6 | 533,049 | 60,047 | 30,894 | 20,484 |

CONV7 | 298,832 | 30,694 | 14,344 | 10,812 |

FC1 | 76,335 | 9981 | 4614 | 4254 |

FC2 | 4184 | 530 | 303 | 238 |

Total | 1,963,060 | 249,956 | 126,966 | 82,717 |

Component (Total) | Solution 1 | Solution 2 | Solution 3 | Solution 4-1 | Solution 4-2 |
---|---|---|---|---|---|

Clock (MHz) | 100 | 100 | 100 | 100 | 170 |

BRAMs (36 Kb) | 26.30% | 35.93% | 68.52% | 39.63% | 39.63% |

DSPs | 61.25% | 61.25% | 61.25% | 61.25% | 61.25% |

LUT (63,400) | 37.63% | 37.80% | 39.92% | 29.76% | 29.83% |

LUTRAM (19,000) | 12.35% | 12.36% | 12.39% | 7.42% | 7.43% |

Flip-flop (F/F) (126,800) | 28.30% | 28.66% | 28.25% | 23.72% | 23.75% |

Latency (ms) | 19.631 | 2.499 | 1.270 | 0.827 | 0.487 |

Power (W) | 1.63 | 1.638 | 1.645 | 0.714 | 1.089 |

[27] | [30] | [31] | [26] | Our Work | |
---|---|---|---|---|---|

CNN Model | AlexNet | VGG16 | VGG16 | AP2D-Net | Self-Design |

Platform | Vertix-7 VX690T | Zynq XC7Z020 | Virtex-7 VX690T | Ultra96 | Artix XC7A100T |

Clock (MHz) | 150 | 214 | 150 | 300 | 100 |

BRAMs (36 Kb) | 2192 | 85.5 | 1220 | 162 | 53.5 |

DSPs | 2980 | 190 | 2160 | 287 | 147 |

Flip-flop (F/F) | 281.8 K | 35.5 K | - | 94.3 K | 30.08 K |

Latency (ms) | - | 364 | 106.6 | 32.8 | 0.827 |

Power (W) | 31.2 | - | 35 | 5.59 | 0.714 |

© 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

## Share and Cite

**MDPI and ACS Style**

Chen, C.; Li, Z.; Zhang, Y.; Zhang, S.; Hou, J.; Zhang, H. Low-Power FPGA Implementation of Convolution Neural Network Accelerator for Pulse Waveform Classification. *Algorithms* **2020**, *13*, 213.
https://doi.org/10.3390/a13090213

**AMA Style**

Chen C, Li Z, Zhang Y, Zhang S, Hou J, Zhang H. Low-Power FPGA Implementation of Convolution Neural Network Accelerator for Pulse Waveform Classification. *Algorithms*. 2020; 13(9):213.
https://doi.org/10.3390/a13090213

**Chicago/Turabian Style**

Chen, Chuanglu, Zhiqiang Li, Yitao Zhang, Shaolong Zhang, Jiena Hou, and Haiying Zhang. 2020. "Low-Power FPGA Implementation of Convolution Neural Network Accelerator for Pulse Waveform Classification" *Algorithms* 13, no. 9: 213.
https://doi.org/10.3390/a13090213