Hardware/Software Co-Design Optimization for Training Recurrent Neural Networks at the Edge †
Abstract
:1. Introduction
- A novel hardware integration that bridges the gap between the FPTT algorithm and its practical deployment by designing a scalable training routine tailored for edge learning;
- An optimized digital architecture that leverages an open-source Chipyard framework to deliver a customizable hardware platform featuring efficient matrix multiplication accelerators and multicore RISC-V processors;
- A comprehensive evaluation of the system’s efficacy is demonstrated through experiments using sequential MNIST (S-MNIST) and Google Speech Commands (GSCv2) datasets, achieving 8.2-fold memory savings with only a 20% increase in latency;
- A scalable FPGA demonstrator to benchmark the trade-offs in resource utilization and performance for edge learning.
2. Background
2.1. Forward Propagation Through Time
2.2. Hardware–Software Co-Design Framework
3. Related Work
3.1. Hardware Acceleration of RNN Training
3.2. Edge Learning
3.3. Commercial Edge AI Hardware
3.4. Summary
4. Hardware and Software Co-Design
4.1. Computation Design
Algorithm 1 Partitioned sequence training by FPTT (improved subscripts) |
|
Algorithm 2 Derivation of by Back Propagation Through Time (BPTT) |
|
4.2. System Architecture
4.3. Optimizations and Explorations
4.4. Programming
- The Gemmini is compatible with high-level, mid-level and low-level programming for users needing different levels of manipulation.
- Careful distribution of parallelizable workload over multicore leads to efficient acceleration. In the context of RISC-V ISA, Hart (or Hardware Thread) represents an independent processing unit or a core. Then, each core or Hart in the system will be assigned a unique HartID, and we can allocate different software threads to the expected Hart by matching the HartID. On top of that, the barrier is used for thread synchronization.
4.5. HW/SW Co-Design Flow
- As a starting point, we use the default combination of Rocket and Gemmini for the hardware and perform the Instruction Set Architecture (ISA) migration on the software from x86 to RISC-V. Adaptations cover RISC-V/Gemmini-specific C libraries and binary toolsets. In this step, we execute the code on the ISA simulator, i.e., Spike, to efficiently debug the functionality. Then, we can obtain the performance profiling by FPGA-Accelerated Simulation, i.e., FireSim [24]. Note that FireSim is a framework for accelerating simulation only instead of conventional FPGA prototyping, as it uses the software peripheral models to ensure deterministic execution.
- Next, we perform architecture compression on Gemmini, including switching to BF16 calculations from FP32. To that end, the key software modification is BF16 emulation on the Rocket core, which does not directly support BF16. Furthermore, since Spike does not support BF16, we have to guarantee the correctness of the BF16 code using the slow RTL simulator by checking the individual functions one by one. Then, similarly, we use FireSim to test and profile the entire workload.
- Last, using the compressed Gemmini architecture, we further scale up the full system by utilizing more than one RISC-V core to inspect the efficiency of the multicore system, as well as a larger systolic array in Gemmini. To that end, we add the mechanism of parallel execution and synchronization to the C code. FireSim is also applied to this step.
5. Experimental Set-Up
6. Results
6.1. Accuracy on GSCv2 Dataset
6.2. Performance Estimation on FPGA for the GSCv2 Application
6.3. Effect of Compression
6.4. Comparison to the Cloud
6.5. Trade-Off in Design Points
6.6. Degree of Partition
7. Conclusions and Future Work
Author Contributions
Funding
Data Availability Statement
Conflicts of Interest
Abbreviations
BF16 | Brain Float 16 |
BPTT | Back Propagation Through Time |
FPTT | Forward Propagation Through Time |
FPGA | Field-Programmable Gate Array |
GRU | Gated Recurrent Unit |
LSTM | Long Short-Term Memory |
MAC | Multiply–Accumulate |
MNIST | Modified National Institute of Standards and Technology |
OS | Output Stationary |
PE | Processing Element |
RNN | Recurrent Neural Network |
RTL | Register-Transfer Level |
WS | Weight Stationary |
References
- Zhang, Y.; Gomony, M.D.; Corporaal, H.; Corradi, F. A Scalable Hardware Architecture for Efficient Learning of Recurrent Neural Networks at the Edge. In Proceedings of the 2024 IFIP/IEEE 32nd International Conference on Very Large Scale Integration (VLSI-SoC), Tangier, Morocco, 6–9 October 2024; pp. 1–4. [Google Scholar]
- Lalapura, V.S.; Amudha, J.; Satheesh, H.S. Recurrent neural networks for edge intelligence: A survey. ACM Comput. Surv. (CSUR) 2021, 54, 1–38. [Google Scholar] [CrossRef]
- Rumelhart, D.E.; Hinton, G.E.; Williams, R.J. Learning Internal Representations by Error Propagation. 1985; DTIC Document; DTIC: Fort Belvoir, VA, USA, 1987. [Google Scholar]
- Werbos, P.J. Backpropagation through time: What it does and how to do it. Proc. IEEE 1990, 78, 1550–1560. [Google Scholar] [CrossRef]
- Kag, A.; Saligrama, V. Training Recurrent Neural Networks via Forward Propagation Through Time. In Proceedings of the 38th International Conference on Machine Learning, Virtual, 18–24 July 2021; Meila, M., Zhang, T., Eds.; PMLR: Breckenridge, CO, USA, 2021; Volume 139, pp. 5189–5200. [Google Scholar]
- Williams, R.J.; Zipser, D. A learning algorithm for continually running fully recurrent neural networks. Neural Comput. 1989, 1, 270–280. [Google Scholar] [CrossRef]
- Menick, J.; Elsen, E.; Evci, U.; Osindero, S.; Simonyan, K.; Graves, A. A practical sparse approximation for real time recurrent learning. arXiv 2020, arXiv:2006.07232. [Google Scholar]
- Gruslys, A.; Munos, R.; Danihelka, I.; Lanctot, M.; Graves, A. Memory-efficient backpropagation through time. Adv. Neural Inf. Process. Syst. 2016, 29, 4132–4140. [Google Scholar]
- Amid, A.; Biancolin, D.; Gonzalez, A.; Grubb, D.; Karandikar, S.; Liew, H.; Magyar, A.; Mao, H.; Ou, A.; Pemberton, N.; et al. Chipyard: Integrated design, simulation, and implementation framework for custom socs. IEEE Micro 2020, 40, 10–21. [Google Scholar] [CrossRef]
- Bachrach, J.; Vo, H.; Richards, B.; Lee, Y.; Waterman, A.; Avižienis, R.; Wawrzynek, J.; Asanović, K. Chisel: Constructing hardware in a scala embedded language. In Proceedings of the 49th Annual Design Automation Conference, San Francisco, CA, USA, 3–7 June 2012; pp. 1216–1225. [Google Scholar]
- Cho, H.; Lee, J.; Lee, J. FARNN: FPGA-GPU hybrid acceleration platform for recurrent neural networks. IEEE Trans. Parallel Distrib. Syst. 2021, 33, 1725–1738. [Google Scholar] [CrossRef]
- Li, S.; Wu, C.; Li, H.; Li, B.; Wang, Y.; Qiu, Q. FPGA Acceleration of Recurrent Neural Network Based Language Model. In Proceedings of the 2015 IEEE 23rd Annual International Symposium on Field-Programmable Custom Computing Machines, Vancouver, BC, Canada, 2–6 May 2015; pp. 111–118. [Google Scholar] [CrossRef]
- Lin, J.; Zhu, L.; Chen, W.M.; Wang, W.C.; Gan, C.; Han, S. On-Device Training Under 256KB Memory. In Proceedings of the 36th International Conference on Neural Information Processing Systems, New Orleans, LA, USA, 28 November–9 December 2022; Koyejo, S., Mohamed, S., Agarwal, A., Belgrave, D., Cho, K., Oh, A., Eds.; Curran Associates, Inc.: Red Hook, NY, USA, 2022; Volume 35, pp. 22941–22954. [Google Scholar]
- Ren, H.; Anicic, D.; Runkler, T.A. Tinyol: Tinyml with online-learning on microcontrollers. In Proceedings of the 2021 International Joint Conference on Neural Networks (IJCNN), Virtual, 18–22 July 2021; pp. 1–8. [Google Scholar]
- Ravaglia, L.; Rusci, M.; Nadalini, D.; Capotondi, A.; Conti, F.; Benini, L. A TinyML Platform for On-Device Continual Learning with Quantized Latent Replays. IEEE J. Emerg. Sel. Top. Circuits Syst. 2021, 11, 789–802. [Google Scholar] [CrossRef]
- Kukreja, N.; Shilova, A.; Beaumont, O.; Huckelheim, J.; Ferrier, N.; Hovland, P.; Gorman, G. Training on the Edge: The why and the how. In Proceedings of the 2019 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW), Rio de Janeiro, Brazil, 20–24 May 2019; pp. 899–903. [Google Scholar]
- Yuan, G.; Ma, X.; Niu, W.; Li, Z.; Kong, Z.; Liu, N.; Gong, Y.; Zhan, Z.; He, C.; Jin, Q.; et al. Mest: Accurate and fast memory-economic sparse training framework on the edge. Adv. Neural Inf. Process. Syst. 2021, 34, 20838–20850. [Google Scholar]
- van der Burgt, A. AI in the Wild: Robust Evaluation and Optimized Fine-Tuning of Machine Learning Algorithms Deployed on the Edge. Essay, University of Twente, 2023. Available online: http://essay.utwente.nl/95066/1/Burgt_MA_EEMCS.pdf (accessed on 7 March 2025).
- Yin, B.; Corradi, F.; Bohté, S.M. Accurate online training of dynamical spiking neural networks through Forward Propagation Through Time. Nat. Mach. Intell. 2023, 5, 518–527. [Google Scholar] [CrossRef]
- Asanovic, K.; Avizienis, R.; Bachrach, J.; Beamer, S.; Biancolin, D.; Celio, C.; Cook, H.; Dabbelt, D.; Hauser, J.; Izraelevitz, A.; et al. The Rocket Chip Generator; Tech. Rep. UCB/EECS-2016; Electrical Engineering and Computer Sciences University of California at Berkeley: Berkeley, CA, USA, 2016; Volume 4, pp. 2–6. [Google Scholar]
- Genc, H.; Kim, S.; Amid, A.; Haj-Ali, A.; Iyer, V.; Prakash, P.; Zhao, J.; Grubb, D.; Liew, H.; Mao, H.; et al. Gemmini: Enabling systematic deep-learning architecture evaluation via full-stack integration. In Proceedings of the 2021 58th ACM/IEEE Design Automation Conference (DAC), San Francisco, CA, USA, 5–9 December 2021; pp. 769–774. [Google Scholar]
- Kalamkar, D.D.; Mudigere, D.; Mellempudi, N.; Das, D.; Banerjee, K.; Avancha, S.; Vooturi, D.T.; Jammalamadaka, N.; Huang, J.; Yuen, H.; et al. A Study of BFLOAT16 for Deep Learning Training. arXiv 2019, arXiv:1905.12322. [Google Scholar]
- Gookyi, D.A.N.; Lee, E.; Kim, K.; Jang, S.J.; Lee, S.S. Deep Learning Accelerators’ Configuration Space Exploration Effect on Performance and Resource Utilization: A Gemmini Case Study. Sensors 2023, 23, 2380. [Google Scholar] [CrossRef] [PubMed]
- Karandikar, S.; Mao, H.; Kim, D.; Biancolin, D.; Amid, A.; Lee, D.; Pemberton, N.; Amaro, E.; Schmidt, C.; Chopra, A.; et al. FireSim: FPGA-accelerated Cycle-exact Scale-out System Simulation in the Public Cloud. In Proceedings of the 45th Annual International Symposium on Computer Architecture, Los Angeles, CA, USA, 2–6 June 2018; pp. 29–42. [Google Scholar] [CrossRef]
- Warden, P. Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv 2019, arXiv:1804.03209. [Google Scholar]
- Le, Q.V.; Jaitly, N.; Hinton, G.E. A Simple Way to Initialize Recurrent Networks of Rectified Linear Units. arXiv 2015, arXiv:1504.00941. [Google Scholar]
- Lecun, Y.; Bottou, L.; Bengio, Y.; Haffner, P. Gradient-based learning applied to document recognition. Proc. IEEE 1998, 86, 2278–2324. [Google Scholar] [CrossRef]
Paper | Network | Task | Hardware |
---|---|---|---|
Lin et al. [13] | MobileNetV2-w0.35 ProxylessNAS-w0.3 MCUNet (5FPS) | ImageNet Visual Wake Words | STM32F746 Cortex-M7 320 KB-SRAM 1MB-Flash |
Ren et al. [14] | autoencoder NN | the modes of vibration of a fan | Arduino Nano 33 BLE Sense: Cortex-M4, 256 KB SRAM, 1MB flash |
Ravaglia et al. [15] | MobileNet-V1 | Core50 | VEGA: A 10-core RISC-V processor, 64 MB SRAM |
Kukreja et al. [16] | from ResNet-18 to ResNet-152 | image classification | ODROID XU4 board: 4 A15 cores, 4 A7 cores, Mali-T628 MP6 GPU, 2 GB LPDDR3 RAM |
Yuan et al. [17] | ResNet-32 | CIFAR-100 | Samsung Galaxy S20 smartphone: Qualcomm Adreno 650 mobile GPU |
Product | Processor(s) | SRAM | Flash/DRAM | Price (2025) |
---|---|---|---|---|
Sony Spresense Main Board https://developer.sony.com/spresense/product-specifications#secondary-menu-d, accessed on 7 March 2025 | 6-core Arm-Cortex-M4F (156 MHz) | 1.5 MB | 8 MB/- | ∼$65 https://shop-us.framos.com/Spresense-Main-Board-p112340655, accessed on 7 March 2025 |
Arduino Portenta H7 https://store.arduino.cc/products/portenta-h7, accessed on 7 March 2025 | Arm-Cortex-M7 (480 MHz) Arm-Cortex-M4 (240 MHz) Chrom-ART Graphics Accelerator | 1 MB | 16 MB/8 MB | ∼$99 https://store.arduino.cc/products/portenta-h7, accessed on 7 March 2025 |
Greenwaves GAP8 https://greenwaves-technologies.com/wp-content/uploads/2021/04/Product-Brief-GAP8-V1_9.pdf, accessed on 7 March 2025 | 8-core 64-bit RISC-V cluster (175 MHz) CNN Accelerator | 580 KB | 64 MB/8 MB | ∼$60 https://greenwaves-technologies.com/product/gapmod_module/, accessed on 7 March 2025 |
Greenwaves GAP9 https://greenwaves-technologies.com/wp-content/uploads/2022/06/Product-Brief-GAP9-Sensors-General-V1_14.pdf, accessed on 7 March 2025 | 9-core 64-bit RISC-V cluster (370 MHz) Cooperative AI Accelerator | 1.6 MB | -/- | ∼$90 https://greenwaves-technologies.com/gap9-store/, accessed on 7 March 2025 |
SiPEED MAix Go https://wiki.sipeed.com/soft/maixpy/en/develop_kit_board/maix_go.html, accessed on 7 March 2025 | 2-core 64-bit RISC-V (400 MHz) CNN Accelerator | 8 MB | 16 MB/- | ∼$40 https://www.waveshare.com/maix-go-aiot-developer-kit.htm, accessed on 7 March 2025 |
NXP i.MX 8ULP SOM https://www.ezurio.com/system-on-module/nxp-imx8/nitrogen8ulp-som, accessed on 7 March 2025 | 2-core Arm-Cortex-A35 (800 MHz) Arm-Cortex-M33 (216 MHz) Tensillica Hi-Fi 4 DSP (475 MHz) Fusion DSP (200 MHz) | 896 KB | -/- | ∼$114 https://nl.mouser.com/ProductDetail/Ezurio/N8ULP_SOM_2r16e_i?qs=mELouGlnn3csyA7i8SLrfg%3D%3D&utm_source=octopart&utm_medium=aggregator&utm_campaign=239-8ULPSOM2R16EI&utm_content=Ezurio, accessed on 7 March 2025 |
Division | Parameter | Default | Compressed |
---|---|---|---|
Arithmetic | inputType | FP32 | BF16 |
spatialArrayOutputType | |||
accType | |||
Data Scaling | mvin_scale_args.mul_t | ||
mvin_scale_acc_args.mul_t | |||
acc_scale_args.mul_t | |||
Systolic Array | dataflow | WS, OS | WS |
Scratchpad | sp_capacity | 256 KB | 128 KB |
Accumulator | acc_capacity | 64 KB | 32 KB |
Division | Parameter | Value Set |
---|---|---|
Gemmini | tileRows, tileColumns | 1 |
meshRows, meshColumns | 4, 8 | |
RISC-V core | number | 1, 2, 4, 8 |
Division | Name | Data Type |
---|---|---|
Dimensions | dim_I | size_t |
dim_J | size_t | |
dim_K | size_t | |
Matrix Address | A | const elem_t * |
B | const elem_t * | |
D | const void * | |
C | void * | |
Matrix Stride | stride_A | size_t |
stride_B | size_t | |
stride_D | size_t | |
stride_C | size_t | |
Scaling Factor | A_scale_factor | scale_t |
B_scale_factor | scale_t | |
D_scale_factor | scale_acc_t | |
scale | acc_scale_t | |
Activation | act | int |
Transpose | transpose_A | bool |
transpose_B | bool | |
Precision | full_C | bool |
low_D | bool | |
Bias option | repeating_bias | bool |
Dataflow | tiled_matmul_type | enum tiled_matmul_type_t |
Type | Forward Pass | Backward Pass | Parameter Update |
---|---|---|---|
Add/Sub | 52,684,800 | 131,263,600 | 105,705,600 |
Mul/Div | 52,787,200 | 145,514,000 | 66,066,000 |
Module | LUT | FF | BRAM | DSP | |
---|---|---|---|---|---|
Gemmini | 4 × 4 | 42,387 | 26,251 | 40 | 120 |
8 × 8 | 80,002 | 40,587 | 40 | 120 | |
Single RISC-V tile | 28,108 | 13,303 | 8 | 15 |
Power (W) | 4 × 4-1 | 4 × 4-2 | 4 × 4-4 | 4 × 4-8 | 8 × 8-1 | 8 × 8-2 | 8 × 8-4 | 8 × 8-8 | |
---|---|---|---|---|---|---|---|---|---|
Dynamic | clocks | 0.499 | 0.453 | 0.408 | 0.423 | 0.454 | 0.458 | 0.412 | 0.426 |
signals | 0.107 | 0.127 | 0.137 | 0.189 | 0.142 | 0.150 | 0.170 | 0.216 | |
logic | 0.206 | 0.279 | 0.283 | 0.334 | 0.301 | 0.308 | 0.321 | 0.368 | |
BRAM | 0.097 | 0.098 | 0.108 | 0.113 | 0.097 | 0.099 | 0.106 | 0.114 | |
DSP | 0.008 | 0.008 | 0.010 | 0.013 | 0.008 | 0.008 | 0.010 | 0.013 | |
PLL | 0.357 | 0.357 | 0.357 | 0.357 | 0.357 | 0.357 | 0.357 | 0.357 | |
MMCM | 0.305 | 0.305 | 0.305 | 0.305 | 0.305 | 0.305 | 0.305 | 0.305 | |
I/O | 1.325 | 1.358 | 1.327 | 1.353 | 1.330 | 1.325 | 1.330 | 1.330 | |
Static | 2.514 | 2.515 | 2.516 | 2.519 | 2.515 | 2.515 | 2.516 | 2.519 | |
Total | 5.422 | 5.501 | 5.452 | 5.606 | 5.509 | 5.526 | 5.527 | 5.648 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Zhang, Y.; Yin, B.; Gomony, M.D.; Corporaal, H.; Trinitis, C.; Corradi, F. Hardware/Software Co-Design Optimization for Training Recurrent Neural Networks at the Edge. J. Low Power Electron. Appl. 2025, 15, 15. https://doi.org/10.3390/jlpea15010015
Zhang Y, Yin B, Gomony MD, Corporaal H, Trinitis C, Corradi F. Hardware/Software Co-Design Optimization for Training Recurrent Neural Networks at the Edge. Journal of Low Power Electronics and Applications. 2025; 15(1):15. https://doi.org/10.3390/jlpea15010015
Chicago/Turabian StyleZhang, Yicheng, Bojian Yin, Manil Dev Gomony, Henk Corporaal, Carsten Trinitis, and Federico Corradi. 2025. "Hardware/Software Co-Design Optimization for Training Recurrent Neural Networks at the Edge" Journal of Low Power Electronics and Applications 15, no. 1: 15. https://doi.org/10.3390/jlpea15010015
APA StyleZhang, Y., Yin, B., Gomony, M. D., Corporaal, H., Trinitis, C., & Corradi, F. (2025). Hardware/Software Co-Design Optimization for Training Recurrent Neural Networks at the Edge. Journal of Low Power Electronics and Applications, 15(1), 15. https://doi.org/10.3390/jlpea15010015