# AHEAD: Automatic Holistic Energy-Aware Design Methodology for MLP Neural Network Hardware Generation in Proactive BMI Edge Devices

^{1}

^{2}

^{*}

## Abstract

**:**

## 1. Introduction

- A novel holistic design methodology, for the first time, bridges the gap between the BMI developers and the hardware developers for automatic energy-aware MLP hardware generation with trained MLP parameters and golden datasets.
- An energy-aware MLP hardware generation for proactive BMI control with automatic nonuniform fixed-point bit-width identification capabilities.
- Fully automatic methodology frees the resources of domain experts across the developers to do the iterative, tedious, labor-intensive, error-prone floating-to-fixed point conversion and low-power hardware design task.
- The design methodology is independent of machine learning tools and programming languages.

## 2. Background of Proactive BMI Control

#### 2.1. Plan4Act System Architecture

_{1}denotes the action to open the door and turn on the light in the toilet; and C

_{2}represents the action to open the door and turn on the light on the terrace. Action sequences AB may lead to the different following actions C

_{1}or C

_{2}, since the common path is AB.

_{1}while acquiring the neuronal data of either action A or actions AB, and to avoid the false case ABC

_{2}. It is different from reactive control whereby the ABC

_{1}actions are executed in sequence; herein, the actions are translated into a complex proactive BMI control problem in sequence: the BMI needs to acquire, analyze, and identify sequence-predicting neuronal activity in the brain while executing the tasks; the mathematical models based on the interaction of neuronal activity and plasticity mechanisms are developed to understand this sequence-predicting neural activity and to provide the algorithmic basis of neural signal decoder design.

#### 2.2. MLP Network of the Neural Decoder

## 3. The AHEAD Methodology

#### 3.1. AHEAD—System Overview

#### 3.2. Stage 1: Automatic Bit-Width Identification

#### 3.3. Stage 2: Configurable High-Performance Low-Power MLP Microarchitecture

## 4. Detailed Realization of the AHEAD Methodology

#### 4.1. MLP Hardware Generation

#### 4.2. Automatic Test Bench Generation (ATBG)

#### 4.3. Bit-Width Identification (BWID)

## 5. Experimental Results

#### 5.1. BMI Recalibration Case 1

#### 5.2. BMI Recalibration Case 2

## 6. Discussion

## Author Contributions

## Funding

## Acknowledgments

## Conflicts of Interest

## References

- Hirata, M. Brain machine-interfaces for motor and communication control. In Cognitive Neuroscience Robotics B; Springer: Berlin, Germany, 2016; pp. 227–251. [Google Scholar]
- Bablani, A.; Edla, D.R.; Tripathi, D.; Cheruku, R. Survey on Brain-Computer Interface: An Emerging Computational Intelligence Paradigm. ACM Comput. Surv. (CSUR)
**2019**, 52, 1–32. [Google Scholar] [CrossRef] - Miralles, F.; Vargiu, E.; Dauwalder, S.; Solà, M.; Müller-Putz, G.; Wriessnegger, S.C.; Pinegger, A.; Kübler, A.; Halder, S.; Käthner, I.; et al. Brain Computer Interface on Track to Home. Available online: https://www.hindawi.com/journals/tswj/2015/623896/ (accessed on 5 March 2020).
- MindSee Project. 2020. Available online: http://mindsee.eu/ (accessed on 5 March 2020).
- Micera, S.; Carpaneto, J.; Raspopovic, S.; Granata, G.; Mazzoni, A.; Oddo, C.M.; Cipriani, C.; Stieglitz, T.; Mueller, M.; Navarro, X.; et al. Toward the Development of a Neuro-Controlled Bidirectional Hand Prosthesis. In International Workshop on Symbiotic Interaction; Springer: Berlin, Germany, 2015; pp. 105–110. [Google Scholar]
- SI-CODE Project. 2020. Available online: https://www.sicode.eu/ (accessed on 5 March 2020).
- Weston, P. Battle for Control of Your Brain: Microsoft Takes on Facebook with Plans for a Mind-Reading HEADBAND That Will Let You Use Devices with the Power of Thought. 2018. Available online: http://www.dailymail.co.uk/sciencetech/article-5274823/Microsoft-takes-Facebook-mind-reading-technology.html (accessed on 5 March 2020).
- Brown, K.V. Here Are the First Hints of How Facebook Plans to Read Your Thoughts. 2018. Available online: https://gizmodo.com/here-are-the-first-hints-of-how-facebook-plans-to-read-1818624773 (accessed on 5 March 2020).
- Musk, E. An integrated brain-machine interface platform with thousands of channels. J. Med. Internet Res.
**2019**, 21, e16194. [Google Scholar] [CrossRef] [PubMed] - Plan4Act Project. 2017. Available online: http://plan4act-project.eu/index.php/about/ (accessed on 5 March 2020).
- Berger, M.; Gail, A. The Reach Cage environment for wireless neural recordings during structured goal-directed behavior of unrestrained monkeys. bioRxiv
**2018**, 305334. [Google Scholar] [CrossRef] - Gallego, J.A.; Perich, M.G.; Naufel, S.N.; Ethier, C.; Solla, S.A.; Miller, L.E. Cortical population activity within a preserved neural manifold underlies multiple motor behaviors. Nat. Commun.
**2018**, 9, 1–13. [Google Scholar] [CrossRef] [Green Version] - Gallego, J.A.; Perich, M.G.; Chowdhury, R.H.; Solla, S.A.; Miller, L.E. A stable, long-term cortical signature underlying consistent behavior. BioRxiv
**2018**, 447441. [Google Scholar] [CrossRef] - Che, S.; Li, J.; Sheaffer, J.W.; Skadron, K.; Lach, J. Accelerating compute-intensive applications with GPUs and FPGAs. In Proceedings of the IEEE Symposium on Application Specific Processors, Anaheim, CA, USA, 8–9 June 2008; pp. 101–107. [Google Scholar]
- Nurvitadhi, E.; Venkatesh, G.; Sim, J.; Marr, D.; Huang, R.; Ong Gee Hock, J.; Liew, Y.T.; Srivatsan, K.; Moss, D.; Subhaschandra, S.; et al. Can FPGAs beat GPUs in accelerating next-generation deep neural networks? In Proceedings of the ACM/SIGDA International Symposium on Field-Programmable Gate Arrays (FPGA), Monterey, CA, USA, 22–24 February 2017; pp. 5–14. [Google Scholar]
- Wang, D.; Hao, Y.; Zhu, X.; Zhao, T.; Wang, Y.; Chen, Y.; Chen, W.; Zheng, X. FPGA implementation of hardware processing modules as coprocessors in brain-machine interfaces. In Proceedings of the 2011 IEEE Annual International Conference of the IEEE Engineering in Medicine and Biology Society, Boston, MA, USA, 30 August–3 September 2011; pp. 4613–4616. [Google Scholar]
- Savich, A.W.; Moussa, M.; Areibi, S. The impact of arithmetic representation on implementing MLP-BP on FPGAs: A study. IEEE Trans. Neural Netw.
**2007**, 18, 240–252. [Google Scholar] [CrossRef] - Lippmann, R. An introduction to computing with neural nets. IEEE Assp Mag.
**1987**, 4, 4–22. [Google Scholar] [CrossRef] - Schuman, C.D.; Potok, T.E.; Patton, R.M.; Birdwell, J.D.; Dean, M.E.; Rose, G.S.; Plank, J.S. A survey of neuromorphic computing and neural networks in hardware. arXiv
**2017**, arXiv:1705.06963. [Google Scholar] - Misra, J.; Saha, I. Artificial neural networks in hardware: A survey of two decades of progress. Elsevier Neurocomput.
**2010**, 74, 239–255. [Google Scholar] [CrossRef] - Chippa, V.K.; Mohapatra, D.; Raghunathan, A.; Roy, K.; Chakradhar, S.T. Scalable effort hardware design: Exploiting algorithmic resilience for energy efficiency. In Proceedings of the IEEE Design Automation Conference, Anaheim, CA, USA, 13–18 June 2010; pp. 555–560. [Google Scholar]
- Nikolić, Z.; Nguyen, H.T.; Frantz, G. Design and implementation of numerical linear algebra algorithms on fixed point DSPs. EURASIP J. Adv. Signal Process.
**2007**, 2007, 087046. [Google Scholar] [CrossRef] [Green Version] - Xilinx. Reduce Power and Cost by Converting from Floating Point to Fixed Point. 2017. Available online: https://www.xilinx.com/support/documentation/white_papers/wp491-floating-to-fixed-point.pdf (accessed on 5 March 2020).
- Fixed-Point Refinement of Digital Signal Processing Systems. 2019. Available online: https://hal.inria.fr/hal-01941898/file/FixedPointRefinement.pdf (accessed on 5 March 2020).
- Sung, W.; Kum, K.I. Simulation-based word-length optimization method for fixed-point digital signal processing systems. IEEE Trans. Signal Process.
**1995**, 43, 3087–3090. [Google Scholar] [CrossRef] - Cantin, M.A.; Savaria, Y.; Lavoie, P. A comparison of automatic word length optimization procedures. In Proceedings of the IEEE International Symposium on Circuits and Systems, Proceedings (Cat. No. 02CH37353), Phoenix-Scottsdale, AZ, USA, 26–29 May 2002; Volume 2, p. II. [Google Scholar]
- Roy, S.; Banerjee, P. An algorithm for trading off quantization error with hardware resources for MATLAB-based FPGA design. IEEE Trans. Comput.
**2005**, 54, 886–896. [Google Scholar] [CrossRef] - Han, K. Automating tRansformations from Floating-Point to Fixed-Point for Implementing Digital Signal Processing Algorithms. Ph.D. Thesis, The University of Texas at Austin, Austin, TX, USA, 2006. [Google Scholar]
- Cong, J.; Liu, B.; Neuendorffer, S.; Noguera, J.; Vissers, K.; Zhang, Z. High-level synthesis for FPGAs: From prototyping to deployment. IEEE Trans. Comput.-Aided Des. Integrated Circ. Syst.
**2011**, 30, 473–491. [Google Scholar] [CrossRef] [Green Version] - Heelan, C.; Komar, J.; Vargas-Irwin, C.E.; Simeral, J.D.; Nurmikko, A.V. A mobile embedded platform for high performance neural signal computation and communication. In Proceedings of the 2015 IEEE Biomedical Circuits and Systems Conference (BioCAS), Atlanta, GA, USA, 22–24 October 2015; pp. 1–4. [Google Scholar]
- Ohbayashi, M.; Picard, N.; Strick, P.L. Inactivation of the dorsal premotor area disrupts internally generated, but not visually guided, sequential movements. J. Neurosci.
**2016**, 36, 1971–1976. [Google Scholar] [CrossRef] [PubMed] [Green Version] - Tanji, J. Sequential organization of multiple movements: Involvement of cortical motor areas. Annu. Rev. Neurosci.
**2001**, 24, 631–651. [Google Scholar] [CrossRef] [PubMed] [Green Version] - Parhi, K.K. VLSI Digital Signal Processing Systems: Design and Implementation; John Wiley & Sons: Hoboken, NJ, USA, 2007. [Google Scholar]
- Basterretxea, K.; Tarela, J.M.; Del Campo, I. Approximation of sigmoid function and the derivative for hardware implementation of artificial neurons. IEE Proc.-Circ. Dev. Syst.
**2004**, 151, 18–24. [Google Scholar] [CrossRef] - Armato, A.; Fanucci, L.; Pioggia, G.; De Rossi, D. Low-error approximation of artificial neuron sigmoid function and its derivative. Electron. Lett.
**2009**, 45, 1082–1084. [Google Scholar] [CrossRef] - Gomar, S.; Mirhassani, M.; Ahmadi, M. Precise digital implementations of hyperbolic tanh and sigmoid function. In Proceedings of the 2016 IEEE 50th Asilomar Conference on Signals, Systems and Computes, Pacific Grove, CA, USA, 6–9 November 2016; pp. 1586–1589. [Google Scholar]
- Vieira, S.M.; Mendonça, L.F.; Farinha, G.J.; Sousa, J.M. Modified binary PSO for feature selection using SVM applied to mortality prediction of septic patients. Appl. Soft Comput.
**2013**, 13, 3494–3504. [Google Scholar] [CrossRef] - Curtin, R.; Edel, M.; Lozhnikov, M.; Mentekidis, Y.; Ghaisas, S.; Zhang, S. mlpack 3: A fast, flexible machine learning library. J. Open Source Softw.
**2018**, 3, 726. [Google Scholar] [CrossRef] - Sanderson, C.; Curtin, R. Armadillo: A template-based C++ library for linear algebra. J. Open Source Softw.
**2016**, 1, 26. [Google Scholar] [CrossRef] - Xilinx. Vivado Design Suite User Guide: High-Level Synthesis (UG902). 2020. Available online: https://www.xilinx.com/support/documentation/sw_manuals/xilinx2019_2/ug902-vivado-high-level-synthesis.pdf (accessed on 5 March 2020).
- Xilinx. Vivado 2019.2—Design Flows Overview. 2019. Available online: https://www.xilinx.com/support/documentation/sw_manuals/xilinx2019_2/ug892-vivado-design-flows-overview.pdf (accessed on 5 March 2020).
- Choi, J.; Wang, Z.; Venkataramani, S.; Chuang, P.I.J.; Srinivasan, V.; Gopalakrishnan, K. Pact: Parameterized clipping activation for quantized neural networks. arXiv
**2018**, arXiv:1805.06085. [Google Scholar] - Jacob, B.; Kligys, S.; Chen, B.; Zhu, M.; Tang, M.; Howard, A.; Adam, H.; Kalenichenko, D. Quantization and training of neural networks for efficient integer-arithmetic-only inference. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 2704–2713. [Google Scholar]
- Han, S.; Mao, H.; Dally, W.J. Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding. arXiv
**2015**, arXiv:1510.00149. [Google Scholar]

**Figure 2.**Proactive brain–machine interface (BMI) system architecture. FPGA—field-programmable gate array.

**Figure 7.**Hardware implementation of PWL approximated function for hyperbolic tangent and sigmoid. (

**a**) Microarchitecture of the piecewise-linear (PWL) approximation function. AGU—address generation unit. (

**b**) Output comparison of the PWL 16-segments sigmoid and original sigmoid function.

Model Parameters | Case 1 | Case 2 |
---|---|---|

Number of Layer | 3 | 4 |

Number of Neurons in Input Layer | 800 | 768 |

Number of Neurons in Hidden Layer 1 | 20 (Sigmoid) | 48 (Sigmoid) |

Number of Neurons in Hidden Layer 2 | - | 20 (Sigmoid) |

Number of Neurons in Output Layer | 2 (Sigmoid) | 2 (Sigmoid) |

Total signal nodes for BWID | 18 | 24 |

Stage | Steps | Execution Time ^{1} |
---|---|---|

BWS-IBW | 00:15 | |

BWID | BWS-FBW | 11:45 |

BWO-FBW | 10:10 | |

Microarchitecture synthesis | energy-efficient hardware generation | 1:20 |

^{1}Times are in mm:ss format. BWS—bit-width selection; BWO—bit-width optimization.

Metrics | Type | FP32 | FP16 | Fixed-Point |
---|---|---|---|---|

Accuracy | Loss of Accuracy | 0% | 0% | 0% ^{1} |

Max Frequency | 103.7 MHz | 105.6 MHz | 106.2 MHz | |

Performance | Max Throughput | 30K | 31K | 125K |

Max Latency | 32.86 us | 32.28 us | 8.01 us | |

Power | Dynamic Power | 408 mW | 246 mW | 88 mW |

Slice LUTs (Utilization %) | 13,702 (25.76%) | 6955 (13.07%) | 2759 (5.19%) | |

Area | Slice Registers (Utilization %) | 15,543 (14.61%) | 9059 (8.51%) | 1610 (1.51%) |

DSP48E1s (Utilization %) | 103 (46.82%) | 82 (37.27%) | 21 (9.55%) | |

BRAMs (Utilization %) | 42 (30%) | 21.5 (15.36%) | 8.5 (6.07%) |

^{1}The resultant average bit-width is 7.47 bits. LUTs—look-up tables; BRAMs—block rams.

Stage | Steps | Execution Time ^{1} |
---|---|---|

BWS-IBW | 00:17 | |

BWID | BWS-FBW | 09:57 |

BWO-FBW | 20:40 | |

Microarchitecture synthesis | energy-efficient hardware generation | 1:23 |

^{1}Times are in mm:ss format.

Metrics | Type | FP32 | FP16 | Fixed-Point |
---|---|---|---|---|

Accuracy | Loss of Accuracy | 0% | 0% | 0% ^{1} |

Max Frequency | 104.5 MHz | 108.8 MHz | 114.6 MHz | |

Performance | Max Throughput | 29K | 30K | 124K |

Max Latency | 34.69 us | 33.36 us | 8.05 us | |

Power | Dynamic Power | 1319 mW | 604 mW | 221 mW |

Slice LUTs (Utilization %) | 37,375 (47.55%) | 19,801 (25.19%) | 12,050 (15.33%) | |

Area | Slice Registers (Utilization %) | 41,118 (26.16%) | 22,134 (14.08%) | 5347 (3.4%) |

DSP48E1s (Utilization %) | 243 (60.75%) | 194 (48.5%) | 1 (0.25%) | |

BRAMs (Utilization %) | 119 (44.91%) | 60 (22.64%) | 21 (7.92%) |

^{1}The resultant average bit-width is 6.95 bits.

© 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

## Share and Cite

**MDPI and ACS Style**

Huang, N.-S.; Chen, Y.-C.; Larsen, J.C.; Manoonpong, P.
AHEAD: Automatic Holistic Energy-Aware Design Methodology for MLP Neural Network Hardware Generation in Proactive BMI Edge Devices. *Energies* **2020**, *13*, 2180.
https://doi.org/10.3390/en13092180

**AMA Style**

Huang N-S, Chen Y-C, Larsen JC, Manoonpong P.
AHEAD: Automatic Holistic Energy-Aware Design Methodology for MLP Neural Network Hardware Generation in Proactive BMI Edge Devices. *Energies*. 2020; 13(9):2180.
https://doi.org/10.3390/en13092180

**Chicago/Turabian Style**

Huang, Nan-Sheng, Yi-Chung Chen, Jørgen Christian Larsen, and Poramate Manoonpong.
2020. "AHEAD: Automatic Holistic Energy-Aware Design Methodology for MLP Neural Network Hardware Generation in Proactive BMI Edge Devices" *Energies* 13, no. 9: 2180.
https://doi.org/10.3390/en13092180