Next Article in Journal
2D Spintronics for Neuromorphic Computing with Scalability and Energy Efficiency
Previous Article in Journal
Design of Ultra-Low-Power Rail-to-Rail Input Common Mode Range Standard-Cell-Based Comparators
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Hardware/Software Co-Design Optimization for Training Recurrent Neural Networks at the Edge †

1
Computer Architecture and Operating System (CAOS), Technical University of Munich (TUM), Bildungcampus 2, 74076 Heilbronn, Germany
2
Electronic Systems, Eindhoven University of Technology (TU/e), Flux, Groene Loper 19, 5612 AP Eindhoven, The Netherlands
*
Authors to whom correspondence should be addressed.
This paper is an extended version of our paper: Zhang, Y.; Gomony, M.D.; Corporaal, H.; Corradi, F. A Scalable Hardware Architecture for Efficient Learning of Recurrent Neural Networks at the Edge. In Proceedings of the 2024 IFIP/IEEE 32nd International Conference on Very Large Scale Integration (VLSI-SoC), Tangier, Morocco, 6–9 October 2024.
J. Low Power Electron. Appl. 2025, 15(1), 15; https://doi.org/10.3390/jlpea15010015
Submission received: 24 January 2025 / Revised: 28 February 2025 / Accepted: 6 March 2025 / Published: 11 March 2025

Abstract

:
Edge devices execute pre-trained Artificial Intelligence (AI) models optimized on large Graphical Processing Units (GPUs); however, they frequently require fine-tuning when deployed in the real world. This fine-tuning, referred to as edge learning, is essential for personalized tasks such as speech and gesture recognition, which often necessitate the use of recurrent neural networks (RNNs). However, training RNNs on edge devices presents major challenges due to limited memory and computing resources. In this study, we propose a system for RNN training through sequence partitioning using the Forward Propagation Through Time (FPTT) training method, thereby enabling edge learning. Our optimized hardware/software co-design for FPTT represents a novel contribution in this domain. This research demonstrates the viability of FPTT for fine-tuning real-world applications by implementing a complete computational framework for training Long Short-Term Memory (LSTM) networks utilizing FPTT. Moreover, this work incorporates the optimization and exploration of a scalable digital hardware architecture using an open-source hardware-design framework, named Chipyard and its implementation on a Field-Programmable Gate Array (FPGA) for cycle-accurate verification. The empirical results demonstrate that partitioned training on the proposed architecture enables an 8.2-fold reduction in memory usage with only a 0.2× increase in latency for small-batch sequential MNIST (S-MNIST) compared to traditional non-partitioned training.

1. Introduction

The standard methodology for implementing artificial intelligence on edge devices entails training models within the cloud environment on large GPU systems, followed by the deployment of their optimized variants on embedded devices. This approach is used for virtually all applications, such as speech recognition, video processing, surveillance and biomedical signal analysis [1,2]. However, it is very often the case that real-world data differ from training data, thus necessitating additional fine-tuning of the on-device model to address distinct user characteristics, such as individual accents or patient-specific biomedical signals. Thus, edge learning plays a crucial role in the robustness functionality of personalized applications that manage dynamic data due to its advantages in energy efficiency, reduced latency and enhanced privacy when compared to cloud-based retraining and redeployment.
Training RNNs for real-world applications at the edge is a real challenge, as most approaches relies on the standard Backpropagation Through Time (BPTT) algorithm. Although powerful, BPTT is a decades-old method, initially conceptualized by Rumelhart et al. in the late 1980s [3] and later refined by Paul Werbos [4]. At the time of its development, neural networks were not deep and memory efficiency was not a priority. However, modern edge applications demand training approaches tailored to resource-constrained systems. The main limitation of BPTT is that it needs to store all network states during backtracking, which results in a significant memory footprint, which is particularly prohibitive for embedded systems handling long sequences.
Kag et al. in [5] propose a novel training algorithm for RNNs termed Forward Propagation Through Time (FPTT). Unlike conventional methods based on BPTT, FPTT updates RNN parameters by optimizing an instantaneous, time-dependent risk function at each time step. This risk function is a dynamically evolving regularized loss influenced by previously observed losses and enables gradient computation with significantly reduced memory requirements.
In a situation in which training RNNs on-device at the edge is required, FPTT is a viable solution as it reduces memory footprint when compared to traditional BPTT methodologies. However, the FPTT algorithm depends on the execution of a complex dataflow and needs high precision operations, thereby posing challenges to current edge hardware architectures, which lack optimization for the efficient acceleration of FPTT.
In this paper, we address this problem by designing an optimized digital hardware architecture. Our architecture overcomes the challenges in sequential learning tasks by exploiting FPTT-based training to enable efficient RNN training directly on resource constrained devices and without the need to transmit data to the cloud. Specifically, the contributions of this work are as follows:
  • A novel hardware integration that bridges the gap between the FPTT algorithm and its practical deployment by designing a scalable training routine tailored for edge learning;
  • An optimized digital architecture that leverages an open-source Chipyard framework to deliver a customizable hardware platform featuring efficient matrix multiplication accelerators and multicore RISC-V processors;
  • A comprehensive evaluation of the system’s efficacy is demonstrated through experiments using sequential MNIST (S-MNIST) and Google Speech Commands (GSCv2) datasets, achieving 8.2-fold memory savings with only a 20% increase in latency;
  • A scalable FPGA demonstrator to benchmark the trade-offs in resource utilization and performance for edge learning.
The remainder of this paper is structured as follows. Section 2 presents the FPTT training algorithm and the framework for co-designing algorithms and architecture. Section 3 includes related work on RNN hardware acceleration, edge learning and AI hardware. Section 4 details the training process, the system architecture and its optimizations and explorations. Section 5 explains the experimental setup. Section 6 analyzes latency and memory footprint, highlighting the performance of the architecture. This work represents the first scalable digital acceleration architectures for FPTT training of RNNs for advancing efficient edge learning.

2. Background

Training RNNs efficiently and in embedded devices has been a significant focus of recent research. This is because the traditional BPTT algorithm, while effective, requires large memory consumption, particularly with long temporal sequences.
To address these limitations, several algorithms have been proposed since 1989, when Real-Time Recurrent Learning (RTRL) [6] was introduced, computing gradients in an online fashion and eliminating the need to store past activations. However, RTRL is a computationally complex algorithm that scales very poorly with the number of network parameters, making it impractical for large-scale applications.
Recent advances aim to balance memory efficiency and computational feasibility. The Sparse n-step Approximation (SnAp) presented by Jacob Menick et al. in [7] modifies RTRL by tracking the influence of parameters over a limited number of time steps, reducing computational demands and maintaining some online learning capabilities. Similarly, another algorithm, named Memory-Efficient Backpropagation Through Time (ME-BPTT), was introduced by Gruslys et al. in [8] with the aim of reducing memory usage while training RNNs. This was achieved by balancing the storage and (re)computation of intermediate network states through dynamic programming techniques. Although this approach effectively decreases memory consumption, it still necessitates additional forward computations to recompute some states during backpropagation, leading to increased computational overhead and a complex dataflow.
In contrast, the FPTT algorithm updates the RNN parameters by optimizing an instantaneous, time-dependent risk function at each time step, allowing for efficient gradient computation with low memory requirements. FPTT by design directly supports online learning and is well suited for resource-constrained environments. However, FPTT requires high-precision updates of the RNN parameters and imposes high computational demands, which current edge hardware architectures do not currently support.
Although modern algorithms like ME-BPTT and SnAp partially address the limitations of traditional BPTT by trading increased computation for reduced memory usage, FPTT is preferred because it enables online updates with minimal memory requirements and support for small batch sizes, making it particularly suitable for customized embedded hardware accelerators and real-world streaming applications.

2.1. Forward Propagation Through Time

We take advantage of the Forward Propagation Through Time (FPTT) algorithm [5], which updates the RNN parameters by optimizing an instantaneous risk function that evolves dynamically based on previous losses. This approach enables convergence to a stationary solution of the empirical RNN objective while addressing long-range dependencies and improving memory efficiency by breaking optimization tasks into smaller steps. For further details, see [5].
In summary, the FPTT online formulation interleaves forward and backward passes with parameter updates at each time step, dividing the input sequence of T steps into K parts. The backtracking of the computational graph of the FPTT algorithm is limited to the network states generated within a part rather than the complete input sequence as shown in Figure 1 for the case in which K = T , that is, backtracking of only the current time step. Furthermore, the partition factor K allows for the support for coarser granularities, splitting the sequence into K parts. In addition to the traditional loss l ( t ) , it is useful to incorporate the regularization term R ( t ) to ensure stabilization and convergence, with  α being a hyperparameter to regulate R ( t ) , and  R ( t ) penalizes the deviation of the network parameter θ from its running average during the training process, where t represents the time step and λ the running estimate of the time-varying risk function as defined in [5].
L ( t ) = l ( t ) + R ( t )
R ( t ) = α 2 | | θ θ ¯ ( t ) 1 2 α λ ( t ) | | 2
Figure 1 illustrates the difference between BPTT and FPTT in updating the network parameter at the time step t + 1 . BPTT backtracks on all network states generated so far, while FPTT uses the states of the current sequence part. This means that the states of the former parts can be released from the memory space, resulting in a smaller memory footprint in the training process. Note that backtracking in space through all layers is still required; however, FPTT alleviates the need for backtracking in time as this can be broken down into K parts. This makes the FPTT algorithm particularly attractive for RNNs where time backtracking is required by design.

2.2. Hardware–Software Co-Design Framework

To explore hardware architectures that can accelerate FPTT computation and their trade-offs, we opted for an Application-Specific Instruction-Set Processor (ASIP) architecture for FPTT computation, as it efficiently reuses matrix multiplication blocks to balance area savings and performance, unlike ASICs, which are less suited for complex, mixed operations. To do so, we use the Chipyard framework [9]. This framework enables us to design and evaluate a full-system solution. It comprises a collection of infrastructure that integrates various tools for the development flow of highly automated systems on the chip, covering hardware design, RTL simulation, FPGA prototyping, VLSI flow and software development. For hardware design, within this framework, a series of Hardware IPs are available for users to construct a system at their will, including some RISC-V-based CPUs and other domain-specific architectures like Gemmini. Hardware IPs in particular are usually called the generator, as they are implemented in Chisel [10], a scala-based hardware-construction language that generates low-level Verilog designed to map to FPGAs or a standard ASIC flow for synthesis. The open-source tools available within the Chipyard framework enable users to construct, customize, and explore the trade-offs of a complete working system using open-source hardware IPs and domain-specific architectures.

3. Related Work

There are many companies and research institutions interested in RNN training acceleration, edge learning and leading AI hardware products; here, we highlight the most recent developments in these areas.

3.1. Hardware Acceleration of RNN Training

RNN inference accelerations are mostly performed in CPUs, GPUs or FPGAs. At the same time, GPUs still dominate RNN training of large batches due to their abundant parallel computing units, along with software techniques to exploit GPUs for higher efficiency and utilization. For example, weight pruning and low-precision computation are widely applied.
On the other hand, Cho et al. [11] points out that in some situations where the small batch size is preferred, e.g., partial retraining for personalization, GPU performance suffers from low utilization. Therefore, they proposed a hybrid architecture called FARNN that combines GPU and FPGA. Specifically, GPU-inefficient tasks are off-loaded to dedicated computing units in the FPGA. The evaluation result indicates that FARNN outperforms the P100 GPU platform for RNN training by up to 4.2× with small batch sizes and long input sequences.
Li et al. [12] presents an FPGA implementation framework for the acceleration of RNNLM training. At the architectural level, they improve the parallelism of the RNN training scheme and reduce the requirement for computing resources to improve computing efficiency. The hardware implementation aims primarily to reduce the data communication load. A multithread-based computation engine is utilized, which can successfully mask the long-memory latency and reuse frequently accessed data.

3.2. Edge Learning

We investigate five works on edge neural network training, where various algorithm innovations are proposed to fit the model and its training process on devices limited in resources. They are summarized in Table 1. In this relatively new domain, most works focus on the acceleration of Convolution Neural Network (CNN)-based models, as they tend to require a lower memory footprint during the training process.
Lin et al. [13] propose an algorithm-system co-design framework to make on-device training possible with only 256 KB of memory, that is, fine-tuning a pre-trained model. quantization aware scaling is used to calibrate the gradient scales and stabilize 8-bit quantized training. To reduce the memory footprint, they propose Sparse Update to skip the gradient computation of less important layers and sub-tensors. The integration of the algorithm, the Tiny Training Engine, enables a 1000-fold reduction of the memory footprint for learning in PyTorch v1.9 and Tensorflow models while matching accuracy.
Ren et al. [14] based their work on a tiny MCU. They propose a novel system called TinyOL (TinyML with Online Learning), which enables incremental on-device training of an autoencoder model on streaming data by fine-tuning the last fully connected layer.
Ravaglia et al. [15] introduce a HW/SW platform for end-to-end continual learning based on a 10-core FP32-enabled parallel ultra-low power (PULP) processor. They leverage quantization of the frozen stage of the model and Latent Replays (LRs) to reduce their memory cost with minimal impact on accuracy. In particular, 8-bit compression of LR memory proves to be almost lossless (−0.26% with 3000LR) compared to the full-precision baseline implementation. In the 22 nm prototype of the VEGA platform, the proposed solution performs on average 65× faster than a low-power STM32 L4 microcontroller, being 37× more energy efficient.
Checkpointing (or reforwarding) means a subset of activations is stored during the forward pass, and the rest is discarded. The discarded data can be recovered by rerunning the forward propagation from the last available “checkpoint”. Kukreja et al. [16] show that with customized checkpointing, the memory footprint can be kept within 2 GB for ResNet training (up to 152 layers, up to the batch size of 8, up to image size 500).
Yuan et al. [17] propose Memory-Economic Sparse Training (MEST) targeting for accurate and fast execution on edge devices. The key idea about MEST is to optimize the training of the deep learning model on edge devices by using three key components: (i) reducing memory usage through sparse training techniques; (ii) improving accuracy with elastic mutation and soft-memory bound; (iii) enhancing training efficiency by dynamically adjusting the sparsity and prioritizing important training data.
However, none of the aforementioned edge learning devices enable on-device training of recurrent networks, as BPTT requires extensive on-chip memory, leaving cloud-based retraining as the only viable option for fine-tuning or adaptation.

3.3. Commercial Edge AI Hardware

Edge AI hardware covers a wide range of budgets for various requirements. Inspired by van der Burgt [18] and to understand the reasonable benefits of Edge AI devices, we list a few commercial products in Table 2, that is, multiple CPUs and a domain-specific accelerator. Domain-specific accelerators typically include a memory hierarchy with on-chip (SRAM) and external memories (Flash/DRAM).

3.4. Summary

The deployment of neural network training on resource-constrained platforms remains a significant challenge, which requires careful consideration of both hardware and software. Two primary obstacles need to be addressed before we can achieve acceptable latency despite the heavy computational demands on limited processing resources: memory inefficiencies and suboptimal data flow. So far, the state of the art has focused on five key areas of effort to tackle these two main challenges:
  • Identify the computation and bottlenecks and customize the hardware for them. Cho et al. [11] identify the GPU-inefficient tasks and offload them to dedicated computing units in FPGA. Li et al. [12] implement the hardware to reduce the data communication load.
  • Fully exploit the parallelism in the acceleration computation. Li et al. [12] improve the parallelism of the training scheme at the architectural level.
  • Skip the unnecessary computations. Lin et al. [13] skip the gradient computation of less important layers and sub-tensors, and Ren et al. [14] only fine-tune the last fully connected layer.
  • Trade more computations for less memory. Latent re-play is used by Ravaglia et al. [15] and Re-forwarding (Checkpointing) is used by Kukreja et al. [16].
  • Use compact data types to save memory. Ravaglia et al. [15] use 8-bit compressed latent replay memory.
In our methodology, detailed in the next section, we draw inspiration from these five key strategies and detail the software-hardware methods used for exploring FPGA architectures for training RNNs.

4. Hardware and Software Co-Design

Our methodology for the design space exploration of the hardware architecture is structured as follows: we begin by bridging the gap between the theoretical foundations of FPTT and its practical implementation in edge learning, presenting a scalable FPTT-based training routine in Section 4.1. Then, in Section 4.2, we introduce a reference architecture template, developed using a Chipyard-based hardware platform. Section 4.3 focuses on optimizations and explorations, while Section 4.5 outlines the HW/SW co-design flow we used to evaluate the complete system’s performance.

4.1. Computation Design

FPTT-based training utilizes the partitioning of input sequences into K segments. In comparison to conventional one-step training methodologies that employ BPTT, this approach necessitates increased computational resources while simultaneously reducing memory consumption. As illustrated in Figure 2, this method is in contrast to the traditional one-step processing approach to update the weights of recurrent neural networks.
FPTT-based training for sequential applications is presented in Algorithm 1.
Algorithm 1 Partitioned sequence training by FPTT (improved subscripts)
  • Input: learning rate η , regulator α
  • Input: training data B = { x i , y i } i = 1 N , sequence length T
  • Initialize: partition factor K [ 1 , T ] , stride s t = T K
  • Initialize: randomize θ (the weight W or the bias b) as θ ( 1 )
  • Initialize: running average θ ( 1 ) ¯ = θ ( 1 )
  • Initialize: running estimate λ ( 1 ) = 0
  • Shorthand: Network Parameters N P = { θ , λ , θ ¯ }
  • Shorthand: Network States as N S
  • for p = 1 to K do
  •     p a r t = p-th stride of { x i } i = 1 N
  •     N S [ ( p 1 ) · s t + 1 : p · s t ] = FW( θ ( p ) , p a r t , N S ( p 1 ) · s t )
  •     L ( p · s t ) θ ( p ) = BW( N P ( p ) , { y i } i = 1 N , N S [ ( p 1 ) · s t + 1 : p · s t ] )
  •     N P ( p + 1 ) = PU( L ( t ) θ ( p ) , N P ( p ) )
  • end for
  • Return:  θ ( K + 1 ) , L o s s ( y ^ ( T ) , y ( T ) )
Notation 1.
Suppose that the training batch B has N samples, i.e.,  B = { x i , y i } i = 1 N , expressed in a sequence of T time steps. To perform partitioned training, the sequence of T time steps can be divided into K parts, resulting in a stride of s t = T K time steps. These parts K require K iterations to process all parts and update the network parameter K times. During iteration p, the network utilizes the p-th sequence part of s t steps for forward pass (FW). Following this, the backward pass (BW) occurs in the generated s t network states to calculate the gradients to update the network parameters. The value of K lies within the range of integers in [ 1 , T ] ; divisions with a remainder are excluded. The extreme values for K, either 1 or T, imply no partition or the finest partition, respectively.
As depicted in Figure 2, the number of forward, backward and parameter updates is increased by K times in partitioned training compared with one-go training. To be specific, each of the K forward and backward passes are applied on T K + 1 network states instead of T + 1 . Nevertheless, each parameter update still requires the same amount of computation as in the one-go training.
In this subsection, we consider the example of K being identical to T to explain forward pass, backward pass and parameter update, as in that scenario, the number of parameter updates is the same as the number of time steps so that they can use the same subscript t.
Forward Pass (FW). It uses the network parameters, input sequence part and the existing network states to generate new network states based on the network topology. To that end, FPTT particularly benefits from the gating structure inherent in recurrent models such as LSTMs and GRUs [19]. Regarding terminal prediction problems, i.e., sequence classification, it is common to append a classifier layer for dimensional matching and inference. Therefore, in our case, we employ a simple network structure with one LSTM layer followed by a fully connected layer as part of the benchmark in Section 5.
We illustrate the forward pass at a time step t. For simplicity in computation, we concatenate the weight of the input connection and recurrent connection for each gate state in the LSTM layer as in Equation (3). In addition, we merge the bias of them as in Equation (4).
W ( t ) = [ W ( t ) x , W ( t ) h ] ( = i , f , g , o )
b ( t ) = b ( t ) x + b ( t ) h ( = i , f , g , o )
To match the size of merged weights, x ( t ) , namely, the input sequence at t is concatenated with h ( t 1 ) as well, i.e., LSTM hidden state at the previous time step t 1 .
x c ( t ) = [ x ( t ) , h ( t 1 ) ]
Afterward, we calculate the input gate i ( t ) , forget gate f ( t ) , candidate cell state g ( t ) and output gate o ( t ) at time step t by matrix multiplication [.], addition and activation, as shown in Equations (6)–(9).
i ( t ) = s i g m o i d ( i _ i n p u t ( t ) ) ; i _ i n p u t ( t ) = W ( t ) i · x c ( t ) + b ( t ) i
f ( t ) = s i g m o i d ( f _ i n p u t ( t ) ) ; f _ i n p u t ( t ) = W ( t ) f · x c ( t ) + b ( t ) f
g ( t ) = t a n h ( g _ i n p u t ( t ) ) ; g _ i n p u t ( t ) = W ( t ) g · x c ( t ) + b ( t ) g
o ( t ) = s i g m o i d ( o _ i n p u t ( t ) ) ; o _ i n p u t ( t ) = W ( t ) o · x c ( t ) + b ( t ) o
Then, by element-wise multiplication [∗] on previously calculated gate states, together with the cell state of last time step s ( t 1 ) , we obtain the current cell state s ( t ) as in Equation (10).
s ( t ) = g ( t ) i ( t ) + s ( t 1 ) f ( t )
Furthermore, we have the hidden state of LSTM layer h ( t ) l 1 in Equation (11) by applying hyperbolic activation of s ( t ) and multiplying it with output gate o ( t ) in an element-wise way.
h ( t ) l 1 = t a n h ( s ( t ) ) o ( t )
Similarly, we derive the hidden state h ( t ) l 2 and the output y ^ ( t ) of the fully connected layer in Equations (12) and (13).
h ( t ) l 2 = W ( t ) l 2 · h ( t ) l 1 + b ( t ) l 2
y ^ ( t ) = s o f t m a x ( h ( t ) l 2 )
Backward Pass (BW). This process utilizes the difference between the network states generated in the forward pass and the label or expected output to calculate the modification to be applied to the network parameters, i.e., gradients, which aims to improve the performance of the network on the train set. Gradient calculation becomes twofold in FPTT-based training. As in Equation (14), L ( t ) consists of the Loss l ( t ) and the regularization term R ( t ) , their derivations are separated.
L ( t ) θ ( t ) = ( l ( t ) + R ( t ) ) θ ( t ) = l ( t ) θ ( t ) + R ( t ) θ ( t )
The regularization term R ( t ) is a function local to the network parameter and independent of the network topology, as shown in Equation (2), then R ( t ) θ ( t ) can be determined by differentiation directly.
The running mean θ ¯ t and the running estimate λ ( t ) are considered constants with respect to θ ( t ) in the alternative optimization method [5], then in Equation (15) we have  R ( t ) θ ( t ) .
R ( t ) θ ( t ) = α ( θ ( t ) θ ¯ ( t ) ) 1 2 λ ( t )
l ( t ) is network-dependent, and  l ( t ) θ ( t ) can be derived by Back Propagation Through Time (BPTT) as listed in Algorithm 2.
Algorithm 2 Derivation of l ( t ) θ ( t ) by Back Propagation Through Time (BPTT)
  • Input: the time step from which to propagate backward: t
  • Input: expected network output at time step t: y ( t )
  • Shorthand:  d N S = l ( t ) N S ( t ) ( N S = { i _ i n p u t , f _ i n p u t , g _ i n p u t , o _ i n p u t , i , f , g , o , s , h l 1 , h l 2 } ; )
  • Shorthand:  d W = l ( t ) W ( t ) ; d b = l ( t ) b ( t ) ( = { i , f , g , o , l 2 } )
  • d h l 2 = y ^ ( t ) y ( t )
  • d W l 2 = d h l 2 · h ( t ) l 1
  • d b l 2 = s q u e e z e ( d h l 2 )
  • d h l 1 = d h l 2 · W ( t ) l 2
  • d s = d h l 1 · o ( t ) · t a n h ( s ( t ) )
  • d o = d h l 1 · t a n h ( s ( t ) )
  • d o _ i n p u t = d o · s i g m o i d ( o _ i n p u t ( t ) )
  • d W o = d o _ i n p u t · x c ( t )
  • d b o = s q u e e z e ( d o _ i n p u t )
  • d g = d s i ( t )
  • d i = d s g ( t )
  • d f = d s s ( t 1 )
  • d g _ i n p u t = d g s i g m o i d ( g _ i n p u t ( t ) )
  • d i _ i n p u t = d i s i g m o i d ( i _ i n p u t ( t ) )
  • d f _ i n p u t = d f t a n h ( f _ i n p u t ( t ) )
  • d W = d _ i n p u t · x c ( t ) ( = i , f , g )
  • d b = s q u e e z e ( d _ i n p u t ) ( = i , f , g )
  • Return:  l ( t ) θ ( t ) = { d W , d b } ( = i , f , g , o , l 2 )
Parameter Update (PU). With the gradients l ( t ) θ ( t ) and R ( t ) θ ( t ) that have been derived, we can update the network parameter θ , the running estimate λ and the running average θ ¯ in three sequential steps using an optimizer, e.g., gradient descent, as shown in Equations (16)–(18).
θ ( t + 1 ) = θ ( t ) η ( l ( t ) θ ( t ) + R ( t ) θ ( t ) )
λ ( t + 1 ) = λ ( t ) α ( θ ( t + 1 ) θ ¯ ( t ) )
θ ¯ ( t + 1 ) = 1 2 ( θ ¯ ( t ) + θ ( t + 1 ) ) 1 2 α λ ( t + 1 )
In a C-based software implementation in an Intel CPU of the forward, backward and update steps, matrix multiplication accounts for approximately 90% of the total execution time. Therefore, optimizing latency is highly dependent on accelerating the matrix multiplication process.

4.2. System Architecture

Using the Chipyard framework, we propose a heterogeneous embedded system of multiple efficient RISC-V cores (Rocket [20]) and a matrix multiplication accelerator (Gemmini [21]), as in Figure 3. Other functionalities of the system are organized by memory bus, periphery bus and control bus.
The RISC-V tile, integrated into the system by Tile Link, contains one core and private L1 caches. The number of tiles in the system determines the speedup of the parallelizable applications.
The Gemmini Accelerator, coupled to the RISC-V core by custom instructions, is essentially based on a systolic array of processing tiles with registers in between, as shown in Figure 4. Each tile is also an array of Processing Elements (PEs), which can be configured to two dataflow modes, i.e., Weight Stationary (WS) and Output Stationary (OS), to execute the multiply–accumulate (MAC) between matrices.

4.3. Optimizations and Explorations

We propose optimizations in Gemmini by compressing its area or resource utilization in the FPGA, targeting fast prototyping while maintaining the performance of the training process in it, as given in Table 3.
BF16 is globally applied in arithmetic and data scaling as Kalamkar et al. [22] reported that deep learning training using BF16 tensors achieves the same state-of-the-art results across domains as FP32 tensors in the same number of iterations and without changes to hyperparameters. Arithmetic units reside in the PE for input, output and accumulation, while data scaling is applied when moving data to/from scratchpad and accumulator. As a result, the memory space can be halved in both scratchpad and accumulator. In addition, only WS dataflow is instantiated instead of using both WS and OS to save space since WS is reported to offer a speedup of 3× relative to the OS [23].
In addition, to investigate the trade-off between the system budget and application speedup, we explore the size of the systolic array and the number of RISC-V cores as in Table 4. tileRows and tileColumns define the number of PEs inside a tile of combinational logic, and they are usually set to 1, leading to only one PE in the tile to avoid a long combinational path.
meshRows and meshColumns determine the number of tiles in the systolic array or mesh. They must be the same and be a number which is the power of 2 starting from 4 due to the limited software support. Therefore, 4 × 4 and 8 × 8 mesh, which can fit on most FPGA boards, are explored in the design space together with a limited number of RSIC-V cores.

4.4. Programming

We carefully choose the most suitable programming method with the system architecture and manage the efficient mapping of the proposed computation onto the hardware structure.
Gemmini
  • The Gemmini is compatible with high-level, mid-level and low-level programming for users needing different levels of manipulation.
High-level programming uses ONNX (Open Neural Network Exchange (https://onnx.ai/), accessed on 7 March 2025), which is an open format built to represent machine learning models. Most deep learning frameworks, like PyTorch and TensorFlow, can export models in ONNX format. To be specific, the Chipyard team forked ONNX Runtime, the mapper that distributes the ONNX workload on specific hardware, to enable ONNX models to run on Gemmini. Mid-level programming is used to utilize a hand-tuned C library of kernels. The library covers most functions in the domain of deep learning, such as matrix multiplication and max-pooling. It is also critical to ensure that a C header file describing the architecture configuration matches the real hardware structure. Low-level programming is used to program in assembly instructions, which is the way to design mid-level kernels and conduct performance tuning, rather than an efficient method of application development.
We opt for mid-level programming because the ONNX Runtime for high-level programming does not support Brain Float 16. To utilize Gemmini for acceleration by C programming, it is important to understand the interface of the tiled_matmul_auto function in the Gemmini C library. The function wraps custom RISC-V instructions extended to employ the Gemmini architecture, and it automatically tiles the source matrices, performs the calculation and then combines the results. Table 5 lists the arguments.
tiled_matmul_auto implements the matrix operation defined in Equation (19). Matrix A is multiplied by B. The result of the multiplication is added by D, and then a non-linear activation function σ is applied to the sum. Moreover, the final result is assigned to matrix C. In the context of neural networks, A and B are input activation and weight, D is the bias, σ is the activation function and C is the output.
C = a c t i v a t i o n ( A · B + D )
dim_I and dim_J are the row dimension and the column dimension of the matrix C. dim_K is the remaining shared dimension by A and B. Arguments A, B, C and D define the source address of the array that stores the matrix. The four following arguments, i.e., stride_* (* = A, B, D, C), represent the column dimension of the corresponding matrix. By four scaling factors, it is possible to apply data scaling on the target matrix. For example, when we want to average the matrix C by the batch size, we can set the argument scale to 1 b a t c h _ s i z e . With regard to the activation function, values 0 to 4 stand for no activation, ReLU, layer normalization, IGELU and softmax, respectively. On top of that, we can choose whether to transpose the matrix A and B, and whether to use full or low precision of matrix C and D. If the size of D is identical to a single row in C, repeating_bias should be set to repeat the bias on all rows of C. Lastly, two dataflows are available, i.e., Weight Stationary (WS) and Output Stationary (OS).
Using the Gemmini C library, we are able to map all types of matrix multiplications, with or without transposition, on the Gemmini architecture efficiently.
Multiple Rocket
  • Careful distribution of parallelizable workload over multicore leads to efficient acceleration. In the context of RISC-V ISA, Hart (or Hardware Thread) represents an independent processing unit or a core. Then, each core or Hart in the system will be assigned a unique HartID, and we can allocate different software threads to the expected Hart by matching the HartID. On top of that, the barrier is used for thread synchronization.
In such a way, we can effectively distribute element-wise operations between matrices or vectors, e.g., Hadamard Product, over RISC-V cores in the system to obtain significant acceleration.

4.5. HW/SW Co-Design Flow

The HW/SW codesign flow explains, at a high level, the way we coordinate hardware customization and software adaptation to approach an RNN-efficient and scalable acceleration solution by the three-step methodology codesign flow.
  • As a starting point, we use the default combination of Rocket and Gemmini for the hardware and perform the Instruction Set Architecture (ISA) migration on the software from x86 to RISC-V. Adaptations cover RISC-V/Gemmini-specific C libraries and binary toolsets. In this step, we execute the code on the ISA simulator, i.e., Spike, to efficiently debug the functionality. Then, we can obtain the performance profiling by FPGA-Accelerated Simulation, i.e., FireSim [24]. Note that FireSim is a framework for accelerating simulation only instead of conventional FPGA prototyping, as it uses the software peripheral models to ensure deterministic execution.
  • Next, we perform architecture compression on Gemmini, including switching to BF16 calculations from FP32. To that end, the key software modification is BF16 emulation on the Rocket core, which does not directly support BF16. Furthermore, since Spike does not support BF16, we have to guarantee the correctness of the BF16 code using the slow RTL simulator by checking the individual functions one by one. Then, similarly, we use FireSim to test and profile the entire workload.
  • Last, using the compressed Gemmini architecture, we further scale up the full system by utilizing more than one RISC-V core to inspect the efficiency of the multicore system, as well as a larger systolic array in Gemmini. To that end, we add the mechanism of parallel execution and synchronization to the C code. FireSim is also applied to this step.
In general, FireSim serves as the core of our method as it provides the use of design point resources as hardware costs, together with an accurate cycle performance profile.

5. Experimental Set-Up

Validation. To validate the trainability of FPTT for fine-tuning, we first use one real-life application, the expanded Google Speech Commands (GSCv2) dataset [25]. It provides 35 keyword classes plus an ‘unknown’ category, covering 36,923 training and 11,005 test instances. Each audio sample is subjected to a spectral decomposition using Mel-frequency cepstral coefficient (MFCC) analysis. This process utilizes 40 s-order bandpass filters arranged logarithmically along the Mel scale (20 Hz–4 kHz), followed by standard deviation–based normalization. After processing, each audio sample is represented as a sequence of 100 time steps. To evaluate speaker adaptation, we isolate samples of the target speaker that appears the most frequently in the main data set. That speaker is identified by the tag ‘c50f55b8’ with only 94 recordings in the train set and 222 recordings in the test set. We first pre-train our model with samples in the main dataset, i.e., from all speakers but not involving the target speaker. Then, we fine-tune the pre-trained model using 94 samples and test it using 222 samples from the target speaker.
Benchmarks. For performance profiling of the architecture running on the FPGA, we use a more lightweight sequential MNIST dataset (S-MNIST) [26,27] for efficient system trade-off and FPTT algorithm analysis. The reason is that the clock frequency of large designs on the FPGA is up to tens of megahertz, which leads to the slow execution of complex applications. In S-MNIST, each sample has a sequence length T of 784 time steps. In terms of the network, we use a two-layer structure as mentioned in Section 4.1, which includes an LSTM with 128 cells and a fully connected classifier with 10 cells. To mimic a basic edge learning scenario, the model is initially pre-trained on the S-MNIST training set on the cluster and then fine-tuned using 4 samples from the test set on our proposed architecture. The partition factor K is set to {1, 2, 7, 14, 28, 56}, in which large K values are avoided for trainability [5].
Baselines. To check the efficiency of the proposed edge system, we also run the benchmarks on server-level machines, i.e., NVIDIA GPU L4 and V100, which are available on the Google Colab for free trial.
Simulation. FPGA-accelerated simulation framework, FireSim, uses AMD/Xillinx UltraScale+ VCU118 to be able to accomodate system scaling exploration. The frequency set for the synthesis and implementation is 10 MHz, because loose timing requirement enables fitting larger design points. Nevertheless, we assume the frequency for future silicon implementation would be around 500 MHz, according to commercial Edge AI products in Table 2.

6. Results

We present and analyze the results from five aspects. First, we demonstrate the feasibility of the FPTT algorithm for small-batch fine-tuning. Then, the upsides and downsides of compressing Gemmini are investigated in terms of resource utilization and benchmark application performance. We further compare the performance of our dedicated architecture with data center-level GPUs. Moreover, how the performance scales with resources is studied, and last, we look into the benefits and drawbacks introduced by partitioning on the benchmark.

6.1. Accuracy on GSCv2 Dataset

In Figure 5, we evaluate the FPTT in speaker personalization using the GSCv2 dataset. This example illustrates the effectiveness of the FPGA system in adapting to personalized speech with limited personal data while preserving network accuracy.
Training has two phases. The pretraining phase (epochs 0–9) utilized audio samples from all speakers except the target speaker, mimicking pre-training in the cloud without access to the final user voice. During pre-training, the model reached 95% training accuracy and 85% test accuracy on average for all test speakers (i.e., main test). While achieving 90% accuracy on the evaluation of the target speaker (at epoch 7), the performance of the model showed considerable variability, probably due to the absence of the target speaker data in the training set. However, in the subsequent fine-tuning phase (epochs 9–12), the model was refined using 94 samples from the target speaker, simulating deployment in the FPGA hardware platform and interaction with this user. In this phase, the error rates have been significantly reduced, and below 1% and the network demonstrated rapid convergence in both accuracy and loss during the whole network fine-tuning while maintaining accurate and robust speech-recognition performance, suggesting the preservation of pre-trained features concurrent with adaptation to speaker-specific acoustic patterns.

6.2. Performance Estimation on FPGA for the GSCv2 Application

To assess the computational efficiency of our system, we measured the total number of floating point operations (BF16) required per training sample. Table 6 summarizes the breakdown of those operations for the forward pass, backward pass and parameter update stages. Please note that these operations are for the GSCv2 dataset for partitioning parameter K = 100.
Given that our FPGA implementation runs at 500 MHz, and the execution per batch takes approximately 2.14 s, the real performance of this application is:
Real GFLOPS GSCv 2 = 554 M F L O P s ( B F 16 ) 2.14 s 259 M F L O P S ( B F 16 )
To evaluate the theoretical peak performance of our architecture, we consider the compute throughput of the Gemmini systolic array and the RISC-V cores. The systolic array in our design is an 8 × 8 matrix multiplication unit (MMU) operating at 500 MHz. Each Processing Element (PE) performs a multiply–accumulate (MAC) operation in BF16 per two clock cycles. Thus, the total peak performance of the Gemmini accelerator is:
Peak GFLOPS Gemmini = 8 × 8 × 0.5 × 500 × 10 6 = 32 GFLOPS
In addition to the systolic array, our system integrates 8 RISC-V cores, each capable of performing a floating-point (BF16) operation every two clock cycles. Given the same 500 MHz operating frequency, the theoretical peak performance of the RISC-V subsystem is:
Peak GFLOPS RISC - V = 8 × 0.5 × 500 × 10 6 = 2 GFLOPS
Thus, the total theoretical peak performance of our architecture, combining both the Gemmini MMU and the RISC-V cores, is:
Total Peak GFLOPS = 32 + 2 = 34 GFLOPS
However, a real-application can never be at peak performance due to the complex dataflow, and our FPGA implementation achieves approximately 259 MFLOPs (BF16), which is only a fraction of the theoretical peak performance. This highlights that the application requires a large amount of data movement and a very complex dataflow.

6.3. Effect of Compression

Figure 6 compares the default and compressed Gemmini of globally applied BF16, halved memory and removal of OS. It reveals that compressed Gemmini in the architecture leads to a 50% reduction in the required number of LUT, BRAM and DSP, while the number of FF also decreases roughly by a quarter. Furthermore, degraded precision has a minor impact on the loss function at the end of the training routine for all benchmarks of different K.
Figure 7 compares the trend of the loss function by two precisions on the benchmark of the largest K, i.e., most frequent network parameter updates when learning a sequence, which is expected to be the most precision sensitive. BF16 and FP32 are shown to result in a similar cross-entropy loss at each step, so BF16 can provide sufficient trainability, which is aligned with the statement by Kalamkar et al. [22].
Furthermore, Figure 8 shows the profiling of cumulative latency of all three types of matrix multiplication in the training process, where the type 1, 2 and 3 are shorthands for different transposition options, i.e., none, on the first matrix and on the second matrix. We find that the application of BF16 enables a slight decrease.
Therefore, replacing FP32 with BF16 in the training process results in significant savings in resource utilization and minor latency decreases at a marginal cost of accuracy loss.

6.4. Comparison to the Cloud

We run the training process benchmark of six different K values on Chipyard-based platforms and two GPUs. Thus, the visualized latency profiling comprises six groups of ten bars, combined with the memory footprint of each benchmark, as depicted in Figure 9.
For any benchmark of K in Figure 9, the latency on two data center-level GPU platforms does not outperform the proposed architecture with a specification greater than or equal to a 4 × 4 Gemmini and two CPUs (4 × 4-2), let alone the transmission time between the edge and the cloud. This shows the advantage of the dedicated embedded system in the efficiency of edge learning, i.e., small-scale fine-tuning, over the GPU. The potential reasons involve higher overheads in communication and synchronization between the CPU and off-chip GPUs on the server for small matrices, which also can hardly make most of the parallelism of GPU. More investigations into GPU performance analysis and latency breakdown will be carried out in future work.

6.5. Trade-Off in Design Points

In Figure 9, each benchmark of K is also executed on eight design points of the proposed architecture. We observe an increase in speed for each set with a larger systolic array or more CPU cores in the design, but the rate of increase is less than linear because of the limited amount of parallelizable computation in the benchmark.
Moreover, assuming the same number of RISC cores in the system, an 8 × 8 Gemmini only slightly reduces total latency compared to a 4 × 4 mesh, though the larger size significantly speeds up matrix multiplications as in Figure 10. Thus, a 4 × 4 Gemini already brings matrix multiplication latency to a minor proportion of the total latency of the benchmark, and parallelizable functions become the bottleneck with the addition of RSIC-V cores being more impactful than a larger Gemmini on the latency. For example, 4 × 4-2 shows more latency decrease than 8 × 8-1 on the top of 4 × 4-1.
With regard to the physical implementation on FPGA, Figure 11 lists the resource utilization of eight design points, and the design layout of three typical sizes on the target FPGA are shown in Figure 12b,c The trade-off between benchmark speedup and resource increase can help balance system performance and design budgets. We find that scaling from 4 × 4-1 to 8 × 8-8 requires a 1.4–3.8 times increase in various resources.
Furthermore, Table 7 lists the utilization of the main modules as compensation to Figure 11, allowing the estimation of further system scaling. Also, it is listed that 4 × 4 and 8 × 8 Gemmini utilize the same amount of BRAMs and DSPs because these two resources are not used by the scalable mesh of processing elements. Instead, they are consumed in the addressing logic.
Table 8 lists the rough power estimation of all design points from the Vivado. More precise power estimation would rely on the physical design of silicon flow at a specific technology node.

6.6. Degree of Partition

For the latency of the benchmark of different K values on the same architecture, we notice that increasing K, i.e., finer partition over a sequence, leads to a slight latency increase but significant memory saving, as shown by the red line in Figure 9.
Taking 4 × 4-2 as an example, compared to K-001, i.e., in one go and without partition, K-014 can bring a factor of 9.85 1.2 8.2 × reduction in memory at the cost of 2.05 1.71 1.71 20 % more latency.
Degrees of partition in the benchmark benefit from the multicore architecture differently as we further profile the speedup of the benchmark of nine K values when increasing the number of RISC-V cores in Figure 13. K also represents the number of parameter updates in the training of a sequence, which are local and parallelizable computations. Thus, the benchmark of a larger K has a higher proportion of parallelizable computations and tends to speed up more when adding CPU cores to the system.
However, when K is relatively small, i.e., less than or equal to 56, the latency of parallelizable computations is not dominating but only takes up a fraction of the total latency. In such cases, the speedup rates of the benchmark are rather close and not necessarily proportional to the K value.

7. Conclusions and Future Work

Our software–hardware design methodology enabled the design of a digital architecture in FPGA that applies the FPTT algorithm for sequential learning using an open-source Chipyard-based platform with dedicated optimizations. We target embedded and edge systems with a focus on efficient computation, leveraging Gemmini compression for matrix multiplications and multiple RISC-V cores to balance speedup and resource utilization. The HW/SW codesign flow has been tested thanks to the FPGA-accelerated FireSim framework that enables comprehensive system profiling. Overall, our methodology proves to be a memory-efficient, hardware-optimized and scalable solution for training recurrent networks with minimal latency trade-offs, particularly suitable for personalization tasks in resource-constrained environments. Moreover, the proposed hardware/software co-design is inherently scalable, as it allows adjustments in the size of the systolic array and the number of RISC-V cores to fit smaller FPGA platforms. Using model compression techniques such as pruning, quantization, or dynamic reconfiguration, the same framework can be adapted to low power edge devices while maintaining efficient training capabilities on the device. Finally, while our implementation targets FPGA-based acceleration, the same hardware/software co-design principles can be applied to develop an ASIC implementation, which could further optimize power and area efficiency.
The experimental findings demonstrate three specific advantages of our architectures: (i) memory efficiency, (ii) trade-offs between latency and memory and (iii) adaptability in the real world. In terms of (i) memory efficiency, we lower the memory footprint for the S-MNIST benchmarks by more than eight times. For (ii) the trade-off between latency and memory, the system achieves a notable memory efficiency with just a slight 20% increase in latency, indicating its appropriateness for edge applications where responsiveness is crucial. Lastly, the system’s practical significance is highlighted by its capacity to rapidly converge in just three fine-tuning epochs for (iii) real-world applications by fine-tuning pre-trained models for speaker customization using the GSCv2 dataset.
Future research includes integration of the BF16 computation directly into the RISC-V instruction set for improved arithmetic efficiency, latency and energy. Additionally, exploring the silicon implementation of architecture would enable further optimizations in power consumption and scalability. Extending the framework to support other RNN architectures, such as Gated Recurrent Units (GRUs), and integrating adaptive precision techniques will further enhance its applicability to a broader range of real-world scenarios.

Author Contributions

Conceptualization, F.C. and Y.Z.; methodology, Y.Z., F.C., M.D.G. and H.C.; software, Y.Z.; validation, Y.Z. and B.Y.; formal analysis, writing—original draft preparation, Y.Z. and F.C.; supervision, M.D.G., H.C., C.T. and F.C. All authors have read and agreed to the published version of the manuscript.

Funding

This work has been funded by the Dutch Organization for Scientific Research (NWO) under grant number KICH1.ST04.22.021 for the project Self-Healing Neuromorphic Systems.

Data Availability Statement

Code and data is published at https://github.com/federicohyo/HwRnnFPTT, accessed on 7 March 2025.

Conflicts of Interest

The authors declare no conflicts of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript; or in the decision to publish the results.

Abbreviations

The following abbreviations are used in this manuscript:
BF16Brain Float 16
BPTTBack Propagation Through Time
FPTTForward Propagation Through Time
FPGAField-Programmable Gate Array
GRUGated Recurrent Unit
LSTMLong Short-Term Memory
MACMultiply–Accumulate
MNISTModified National Institute of Standards and Technology
OSOutput Stationary
PEProcessing Element
RNNRecurrent Neural Network
RTLRegister-Transfer Level
WSWeight Stationary

References

  1. Zhang, Y.; Gomony, M.D.; Corporaal, H.; Corradi, F. A Scalable Hardware Architecture for Efficient Learning of Recurrent Neural Networks at the Edge. In Proceedings of the 2024 IFIP/IEEE 32nd International Conference on Very Large Scale Integration (VLSI-SoC), Tangier, Morocco, 6–9 October 2024; pp. 1–4. [Google Scholar]
  2. Lalapura, V.S.; Amudha, J.; Satheesh, H.S. Recurrent neural networks for edge intelligence: A survey. ACM Comput. Surv. (CSUR) 2021, 54, 1–38. [Google Scholar] [CrossRef]
  3. Rumelhart, D.E.; Hinton, G.E.; Williams, R.J. Learning Internal Representations by Error Propagation. 1985; DTIC Document; DTIC: Fort Belvoir, VA, USA, 1987. [Google Scholar]
  4. Werbos, P.J. Backpropagation through time: What it does and how to do it. Proc. IEEE 1990, 78, 1550–1560. [Google Scholar] [CrossRef]
  5. Kag, A.; Saligrama, V. Training Recurrent Neural Networks via Forward Propagation Through Time. In Proceedings of the 38th International Conference on Machine Learning, Virtual, 18–24 July 2021; Meila, M., Zhang, T., Eds.; PMLR: Breckenridge, CO, USA, 2021; Volume 139, pp. 5189–5200. [Google Scholar]
  6. Williams, R.J.; Zipser, D. A learning algorithm for continually running fully recurrent neural networks. Neural Comput. 1989, 1, 270–280. [Google Scholar] [CrossRef]
  7. Menick, J.; Elsen, E.; Evci, U.; Osindero, S.; Simonyan, K.; Graves, A. A practical sparse approximation for real time recurrent learning. arXiv 2020, arXiv:2006.07232. [Google Scholar]
  8. Gruslys, A.; Munos, R.; Danihelka, I.; Lanctot, M.; Graves, A. Memory-efficient backpropagation through time. Adv. Neural Inf. Process. Syst. 2016, 29, 4132–4140. [Google Scholar]
  9. Amid, A.; Biancolin, D.; Gonzalez, A.; Grubb, D.; Karandikar, S.; Liew, H.; Magyar, A.; Mao, H.; Ou, A.; Pemberton, N.; et al. Chipyard: Integrated design, simulation, and implementation framework for custom socs. IEEE Micro 2020, 40, 10–21. [Google Scholar] [CrossRef]
  10. Bachrach, J.; Vo, H.; Richards, B.; Lee, Y.; Waterman, A.; Avižienis, R.; Wawrzynek, J.; Asanović, K. Chisel: Constructing hardware in a scala embedded language. In Proceedings of the 49th Annual Design Automation Conference, San Francisco, CA, USA, 3–7 June 2012; pp. 1216–1225. [Google Scholar]
  11. Cho, H.; Lee, J.; Lee, J. FARNN: FPGA-GPU hybrid acceleration platform for recurrent neural networks. IEEE Trans. Parallel Distrib. Syst. 2021, 33, 1725–1738. [Google Scholar] [CrossRef]
  12. Li, S.; Wu, C.; Li, H.; Li, B.; Wang, Y.; Qiu, Q. FPGA Acceleration of Recurrent Neural Network Based Language Model. In Proceedings of the 2015 IEEE 23rd Annual International Symposium on Field-Programmable Custom Computing Machines, Vancouver, BC, Canada, 2–6 May 2015; pp. 111–118. [Google Scholar] [CrossRef]
  13. Lin, J.; Zhu, L.; Chen, W.M.; Wang, W.C.; Gan, C.; Han, S. On-Device Training Under 256KB Memory. In Proceedings of the 36th International Conference on Neural Information Processing Systems, New Orleans, LA, USA, 28 November–9 December 2022; Koyejo, S., Mohamed, S., Agarwal, A., Belgrave, D., Cho, K., Oh, A., Eds.; Curran Associates, Inc.: Red Hook, NY, USA, 2022; Volume 35, pp. 22941–22954. [Google Scholar]
  14. Ren, H.; Anicic, D.; Runkler, T.A. Tinyol: Tinyml with online-learning on microcontrollers. In Proceedings of the 2021 International Joint Conference on Neural Networks (IJCNN), Virtual, 18–22 July 2021; pp. 1–8. [Google Scholar]
  15. Ravaglia, L.; Rusci, M.; Nadalini, D.; Capotondi, A.; Conti, F.; Benini, L. A TinyML Platform for On-Device Continual Learning with Quantized Latent Replays. IEEE J. Emerg. Sel. Top. Circuits Syst. 2021, 11, 789–802. [Google Scholar] [CrossRef]
  16. Kukreja, N.; Shilova, A.; Beaumont, O.; Huckelheim, J.; Ferrier, N.; Hovland, P.; Gorman, G. Training on the Edge: The why and the how. In Proceedings of the 2019 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW), Rio de Janeiro, Brazil, 20–24 May 2019; pp. 899–903. [Google Scholar]
  17. Yuan, G.; Ma, X.; Niu, W.; Li, Z.; Kong, Z.; Liu, N.; Gong, Y.; Zhan, Z.; He, C.; Jin, Q.; et al. Mest: Accurate and fast memory-economic sparse training framework on the edge. Adv. Neural Inf. Process. Syst. 2021, 34, 20838–20850. [Google Scholar]
  18. van der Burgt, A. AI in the Wild: Robust Evaluation and Optimized Fine-Tuning of Machine Learning Algorithms Deployed on the Edge. Essay, University of Twente, 2023. Available online: http://essay.utwente.nl/95066/1/Burgt_MA_EEMCS.pdf (accessed on 7 March 2025).
  19. Yin, B.; Corradi, F.; Bohté, S.M. Accurate online training of dynamical spiking neural networks through Forward Propagation Through Time. Nat. Mach. Intell. 2023, 5, 518–527. [Google Scholar] [CrossRef]
  20. Asanovic, K.; Avizienis, R.; Bachrach, J.; Beamer, S.; Biancolin, D.; Celio, C.; Cook, H.; Dabbelt, D.; Hauser, J.; Izraelevitz, A.; et al. The Rocket Chip Generator; Tech. Rep. UCB/EECS-2016; Electrical Engineering and Computer Sciences University of California at Berkeley: Berkeley, CA, USA, 2016; Volume 4, pp. 2–6. [Google Scholar]
  21. Genc, H.; Kim, S.; Amid, A.; Haj-Ali, A.; Iyer, V.; Prakash, P.; Zhao, J.; Grubb, D.; Liew, H.; Mao, H.; et al. Gemmini: Enabling systematic deep-learning architecture evaluation via full-stack integration. In Proceedings of the 2021 58th ACM/IEEE Design Automation Conference (DAC), San Francisco, CA, USA, 5–9 December 2021; pp. 769–774. [Google Scholar]
  22. Kalamkar, D.D.; Mudigere, D.; Mellempudi, N.; Das, D.; Banerjee, K.; Avancha, S.; Vooturi, D.T.; Jammalamadaka, N.; Huang, J.; Yuen, H.; et al. A Study of BFLOAT16 for Deep Learning Training. arXiv 2019, arXiv:1905.12322. [Google Scholar]
  23. Gookyi, D.A.N.; Lee, E.; Kim, K.; Jang, S.J.; Lee, S.S. Deep Learning Accelerators’ Configuration Space Exploration Effect on Performance and Resource Utilization: A Gemmini Case Study. Sensors 2023, 23, 2380. [Google Scholar] [CrossRef] [PubMed]
  24. Karandikar, S.; Mao, H.; Kim, D.; Biancolin, D.; Amid, A.; Lee, D.; Pemberton, N.; Amaro, E.; Schmidt, C.; Chopra, A.; et al. FireSim: FPGA-accelerated Cycle-exact Scale-out System Simulation in the Public Cloud. In Proceedings of the 45th Annual International Symposium on Computer Architecture, Los Angeles, CA, USA, 2–6 June 2018; pp. 29–42. [Google Scholar] [CrossRef]
  25. Warden, P. Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv 2019, arXiv:1804.03209. [Google Scholar]
  26. Le, Q.V.; Jaitly, N.; Hinton, G.E. A Simple Way to Initialize Recurrent Networks of Rectified Linear Units. arXiv 2015, arXiv:1504.00941. [Google Scholar]
  27. Lecun, Y.; Bottou, L.; Bengio, Y.; Haffner, P. Gradient-based learning applied to document recognition. Proc. IEEE 1998, 86, 2278–2324. [Google Scholar] [CrossRef]
Figure 1. Unrolling of computation graph: BPTT vs. FPTT for a recurrent neural network. Input features x, hidden activations h, prediction y ^ , loss function l and target y are arranged from bottom to top, and time steps are arranged horizontally. The black arrows represent the forward pass, in which current hidden activations influence the next one in time. Red arrows indicate the backward computational graph. As depicted, BPTT requires storing the backward graph over all of the time steps of the input sequence ( x ( t 1 ) , x ( t ) , x ( t + 1 ) ) when making a single weight update, while FPTT requires storing the computational graph of the current timestep ( x ( t + 1 ) ) and can perform weight update at each time step.
Figure 1. Unrolling of computation graph: BPTT vs. FPTT for a recurrent neural network. Input features x, hidden activations h, prediction y ^ , loss function l and target y are arranged from bottom to top, and time steps are arranged horizontally. The black arrows represent the forward pass, in which current hidden activations influence the next one in time. Red arrows indicate the backward computational graph. As depicted, BPTT requires storing the backward graph over all of the time steps of the input sequence ( x ( t 1 ) , x ( t ) , x ( t + 1 ) ) when making a single weight update, while FPTT requires storing the computational graph of the current timestep ( x ( t + 1 ) ) and can perform weight update at each time step.
Jlpea 15 00015 g001
Figure 2. Comparison of training processes: one-go processing vs. sequence partitioning. In one-go processing, all network states across the sequence must be stored, requiring T memory blocks to hold T states for the backward pass, along with simultaneous computation of forward passes, backward passes and parameter updates. In contrast, sequence partitioning processes subsequences of length s t r i d e , requiring only s t r i d e + 1 memory blocks to store the states. Once the backward pass is completed for a subsequence, the memory blocks are released for the next subsequence. This approach achieves training with just s t r i d e + 1 memory blocks, resulting in approximately K-fold memory savings compared to the one-go method.
Figure 2. Comparison of training processes: one-go processing vs. sequence partitioning. In one-go processing, all network states across the sequence must be stored, requiring T memory blocks to hold T states for the backward pass, along with simultaneous computation of forward passes, backward passes and parameter updates. In contrast, sequence partitioning processes subsequences of length s t r i d e , requiring only s t r i d e + 1 memory blocks to store the states. Once the backward pass is completed for a subsequence, the memory blocks are released for the next subsequence. This approach achieves training with just s t r i d e + 1 memory blocks, resulting in approximately K-fold memory savings compared to the one-go method.
Jlpea 15 00015 g002
Figure 3. System architecture. The system is structured with all subsystems attached to the system bus based on the TileLink interface. Each RISC-V Tile contains the CPU core, Page Table Walker (PTW), private instruction and data caches, and interface to the system bus. Gemmini, as a Rocket Custom Coprocessor (RoCC) for matrix multiplication, is tightly coupled with a RISC-V tile by Instruction Set Architecture (ISA) extension with RoCC commands and shared memory. The memory subsystem includes the level 2 cache and connects with the extended memory. The periphery subsystem functions as a switch to devices, while the control subsystem handles interruption, boot, debugging, etc.
Figure 3. System architecture. The system is structured with all subsystems attached to the system bus based on the TileLink interface. Each RISC-V Tile contains the CPU core, Page Table Walker (PTW), private instruction and data caches, and interface to the system bus. Gemmini, as a Rocket Custom Coprocessor (RoCC) for matrix multiplication, is tightly coupled with a RISC-V tile by Instruction Set Architecture (ISA) extension with RoCC commands and shared memory. The memory subsystem includes the level 2 cache and connects with the extended memory. The periphery subsystem functions as a switch to devices, while the control subsystem handles interruption, boot, debugging, etc.
Jlpea 15 00015 g003
Figure 4. Architecture of Gemmini. Gemmini comprises a controller, scratchpad, matrix transposer, systolic array and other periphery functions. Implementation of Gemmini follows the idea of access–execute decoupling, and it has detached load, store and execute instructions. The dependency-management unit (or Re-Order Buffer; ROB) solves the instruction dependency and improves instruction-level parallelism. DMA engine transfers data between main memory and scratchpad/accumulator. Local TLB collaborates with the PTW in the Rocket Tile to keep the address mapping. The scratchpad is composed of multiple banks of SRAM, and it buffers the input and output data to the systolic array. The transposer can transpose the matrix before it enters the 2-level hierarchy systolic array. It is a mesh of tiles; the register is inserted between tiles. Inside a tile is a fully combinational array of PEs (Processing Elements). Each PE can be set to two dataflows: weight stationary and output stationary. The difference between these two is that either the weight (one source matrix) or the accumulator (the bias) is preloaded into the register in the PE.
Figure 4. Architecture of Gemmini. Gemmini comprises a controller, scratchpad, matrix transposer, systolic array and other periphery functions. Implementation of Gemmini follows the idea of access–execute decoupling, and it has detached load, store and execute instructions. The dependency-management unit (or Re-Order Buffer; ROB) solves the instruction dependency and improves instruction-level parallelism. DMA engine transfers data between main memory and scratchpad/accumulator. Local TLB collaborates with the PTW in the Rocket Tile to keep the address mapping. The scratchpad is composed of multiple banks of SRAM, and it buffers the input and output data to the systolic array. The transposer can transpose the matrix before it enters the 2-level hierarchy systolic array. It is a mesh of tiles; the register is inserted between tiles. Inside a tile is a fully combinational array of PEs (Processing Elements). Each PE can be set to two dataflows: weight stationary and output stationary. The difference between these two is that either the weight (one source matrix) or the accumulator (the bias) is preloaded into the register in the PE.
Jlpea 15 00015 g004
Figure 5. Pretraining and fine-tuning of GSCv2 dataset. In the pre-training phase, the main train set (train_main) is used to train the network, while the main test set (test_main) and test set of the target speaker (test_target) are employed to examine the network after every epoch to demonstrate the generalization. In the fine-tuning phase, the pre-trained network is further personalized using the train set of the target speaker (train_target) and the accuracy on the test set of the target speaker is collected every epoch to showcase personalization. K = 100 indicates that the network was fine-tuned at every time step. Learning curves are averaged over 5 runs.
Figure 5. Pretraining and fine-tuning of GSCv2 dataset. In the pre-training phase, the main train set (train_main) is used to train the network, while the main test set (test_main) and test set of the target speaker (test_target) are employed to examine the network after every epoch to demonstrate the generalization. In the fine-tuning phase, the pre-trained network is further personalized using the train set of the target speaker (train_target) and the accuracy on the test set of the target speaker is collected every epoch to showcase personalization. K = 100 indicates that the network was fine-tuned at every time step. Learning curves are averaged over 5 runs.
Jlpea 15 00015 g005
Figure 6. Effect of Gemmini compression. Compressing Gemmini results in significant savings in the utilization of four main types of resources, i.e., Look-Up Tables (LUTs), Flip-Flops (FFs), Block Random Access Memories (BRAMs) and Digital Signal Processing units (DSPs). On the other hand, the loss function was slightly degraded due to the application of BF16.
Figure 6. Effect of Gemmini compression. Compressing Gemmini results in significant savings in the utilization of four main types of resources, i.e., Look-Up Tables (LUTs), Flip-Flops (FFs), Block Random Access Memories (BRAMs) and Digital Signal Processing units (DSPs). On the other hand, the loss function was slightly degraded due to the application of BF16.
Jlpea 15 00015 g006
Figure 7. Comparison of loss function trend of the workload K-784.
Figure 7. Comparison of loss function trend of the workload K-784.
Jlpea 15 00015 g007
Figure 8. Comparison of cumulative matrix multiplication latency by FP32 and BF16 Gemmini.
Figure 8. Comparison of cumulative matrix multiplication latency by FP32 and BF16 Gemmini.
Jlpea 15 00015 g008
Figure 9. Latency and memory of FPTT benchmarks of six K values on ten architectures (all MS-C are normalized to 500 MHz).
Figure 9. Latency and memory of FPTT benchmarks of six K values on ten architectures (all MS-C are normalized to 500 MHz).
Jlpea 15 00015 g009
Figure 10. Comparison of cumulative matrix multiplication latency by 4 × 4 and 8 × 8 Gemmini.
Figure 10. Comparison of cumulative matrix multiplication latency by 4 × 4 and 8 × 8 Gemmini.
Jlpea 15 00015 g010
Figure 11. Resource utilization of eight design points.
Figure 11. Resource utilization of eight design points.
Jlpea 15 00015 g011
Figure 12. FPGA implementation views of the empty device, and small (4 × 4-1), medium (4 × 4-4), and large (8 × 8-8) design points. Colors represent different components of the design: orange indicates fixed cells, cyan indicates placed cells, gray represents bundle nets, and purple delimits pblocks.
Figure 12. FPGA implementation views of the empty device, and small (4 × 4-1), medium (4 × 4-4), and large (8 × 8-8) design points. Colors represent different components of the design: orange indicates fixed cells, cyan indicates placed cells, gray represents bundle nets, and purple delimits pblocks.
Jlpea 15 00015 g012
Figure 13. Speedup of the sum of parallelizable functions. Please note that K-002 curve is below K-001.
Figure 13. Speedup of the sum of parallelizable functions. Please note that K-002 curve is below K-001.
Jlpea 15 00015 g013
Table 1. Related work of training on edge devices.
Table 1. Related work of training on edge devices.
PaperNetworkTaskHardware
Lin et al. [13]MobileNetV2-w0.35 ProxylessNAS-w0.3 MCUNet (5FPS)ImageNet
Visual Wake Words
STM32F746 Cortex-M7
320 KB-SRAM 1MB-Flash
Ren et al. [14]autoencoder NNthe modes of vibration of a fanArduino Nano 33 BLE Sense: Cortex-M4, 256 KB SRAM, 1MB flash
Ravaglia et al. [15]MobileNet-V1Core50VEGA: A 10-core RISC-V processor, 64 MB SRAM
Kukreja et al. [16]from ResNet-18 to ResNet-152image classificationODROID XU4 board: 4 A15 cores, 4 A7 cores, Mali-T628 MP6 GPU, 2 GB LPDDR3 RAM
Yuan et al. [17]ResNet-32CIFAR-100Samsung Galaxy S20 smartphone: Qualcomm Adreno 650 mobile GPU
Table 2. Edge AI products.
Table 2. Edge AI products.
ProductProcessor(s)SRAMFlash/DRAMPrice
(2025)
Sony
Spresense Main Board https://developer.sony.com/spresense/product-specifications#secondary-menu-d, accessed on 7 March 2025
6-core Arm-Cortex-M4F (156 MHz)1.5 MB8 MB/-$65 https://shop-us.framos.com/Spresense-Main-Board-p112340655, accessed on 7 March 2025
Arduino
Portenta H7 https://store.arduino.cc/products/portenta-h7, accessed on 7 March 2025
Arm-Cortex-M7 (480 MHz) Arm-Cortex-M4 (240 MHz)
Chrom-ART Graphics Accelerator
1 MB16 MB/8 MB$99 https://store.arduino.cc/products/portenta-h7, accessed on 7 March 2025
Greenwaves
GAP8 https://greenwaves-technologies.com/wp-content/uploads/2021/04/Product-Brief-GAP8-V1_9.pdf, accessed on 7 March 2025
8-core 64-bit RISC-V cluster
(175 MHz)
CNN Accelerator
580 KB64 MB/8 MB$60 https://greenwaves-technologies.com/product/gapmod_module/, accessed on 7 March 2025
Greenwaves
GAP9 https://greenwaves-technologies.com/wp-content/uploads/2022/06/Product-Brief-GAP9-Sensors-General-V1_14.pdf, accessed on 7 March 2025
9-core 64-bit RISC-V cluster (370 MHz)
Cooperative AI Accelerator
1.6 MB-/-$90 https://greenwaves-technologies.com/gap9-store/, accessed on 7 March 2025
SiPEED
MAix Go https://wiki.sipeed.com/soft/maixpy/en/develop_kit_board/maix_go.html, accessed on 7 March 2025
2-core 64-bit RISC-V (400 MHz)
CNN Accelerator
8 MB16 MB/-$40 https://www.waveshare.com/maix-go-aiot-developer-kit.htm, accessed on 7 March 2025
NXP
i.MX 8ULP SOM https://www.ezurio.com/system-on-module/nxp-imx8/nitrogen8ulp-som, accessed on 7 March 2025
2-core Arm-Cortex-A35 (800 MHz)
Arm-Cortex-M33 (216 MHz)
Tensillica Hi-Fi 4 DSP (475 MHz)
Fusion DSP (200 MHz)
896 KB-/-$114 https://nl.mouser.com/ProductDetail/Ezurio/N8ULP_SOM_2r16e_i?qs=mELouGlnn3csyA7i8SLrfg%3D%3D&utm_source=octopart&utm_medium=aggregator&utm_campaign=239-8ULPSOM2R16EI&utm_content=Ezurio, accessed on 7 March 2025
Table 3. Gemmini architecture compression.
Table 3. Gemmini architecture compression.
DivisionParameterDefaultCompressed
ArithmeticinputTypeFP32BF16
spatialArrayOutputType
accType
Data Scalingmvin_scale_args.mul_t
mvin_scale_acc_args.mul_t
acc_scale_args.mul_t
Systolic ArraydataflowWS, OSWS
Scratchpadsp_capacity256 KB128 KB
Accumulatoracc_capacity64 KB32 KB
Table 4. System scale.
Table 4. System scale.
DivisionParameterValue Set
GemminitileRows, tileColumns1
meshRows, meshColumns4, 8
RISC-V corenumber1, 2, 4, 8
MS-C (MeshSize-Cores) distinguishes design points.
Table 5. Argument list of tiled_matmul_auto. The symbol ‘*’ refers to pointers.
Table 5. Argument list of tiled_matmul_auto. The symbol ‘*’ refers to pointers.
DivisionNameData Type
Dimensionsdim_I size_t
dim_J size_t
dim_K size_t
Matrix AddressA const elem_t *
B const elem_t *
D const void *
C void *
Matrix Stridestride_A size_t
stride_B size_t
stride_D size_t
stride_C size_t
Scaling FactorA_scale_factor scale_t
B_scale_factor scale_t
D_scale_factor scale_acc_t
scale acc_scale_t
Activationact int
Transposetranspose_A bool
transpose_B bool
Precisionfull_C bool
low_D bool
Bias optionrepeating_bias bool
Dataflowtiled_matmul_type enum tiled_matmul_type_t
Table 6. BF16 operations for training a batch of 4 samples.
Table 6. BF16 operations for training a batch of 4 samples.
TypeForward PassBackward PassParameter Update
Add/Sub52,684,800131,263,600105,705,600
Mul/Div52,787,200145,514,00066,066,000
Table 7. Utilization of main modules.
Table 7. Utilization of main modules.
            ModuleLUTFFBRAMDSP
Gemmini4 × 442,38726,25140120
8 × 880,00240,58740120
        Single RISC-V tile28,10813,303815
Table 8. Estimated power consumption of eight design points by Vivado.
Table 8. Estimated power consumption of eight design points by Vivado.
          Power (W)4 × 4-14 × 4-24 × 4-44 × 4-88 × 8-18 × 8-28 × 8-48 × 8-8
Dynamicclocks0.4990.4530.4080.4230.4540.4580.4120.426
signals0.1070.1270.1370.1890.1420.1500.1700.216
logic0.2060.2790.2830.3340.3010.3080.3210.368
BRAM0.0970.0980.1080.1130.0970.0990.1060.114
DSP0.0080.0080.0100.0130.0080.0080.0100.013
PLL0.3570.3570.3570.3570.3570.3570.3570.357
MMCM0.3050.3050.3050.3050.3050.3050.3050.305
I/O1.3251.3581.3271.3531.3301.3251.3301.330
          Static2.5142.5152.5162.5192.5152.5152.5162.519
          Total5.4225.5015.4525.6065.5095.5265.5275.648
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Zhang, Y.; Yin, B.; Gomony, M.D.; Corporaal, H.; Trinitis, C.; Corradi, F. Hardware/Software Co-Design Optimization for Training Recurrent Neural Networks at the Edge. J. Low Power Electron. Appl. 2025, 15, 15. https://doi.org/10.3390/jlpea15010015

AMA Style

Zhang Y, Yin B, Gomony MD, Corporaal H, Trinitis C, Corradi F. Hardware/Software Co-Design Optimization for Training Recurrent Neural Networks at the Edge. Journal of Low Power Electronics and Applications. 2025; 15(1):15. https://doi.org/10.3390/jlpea15010015

Chicago/Turabian Style

Zhang, Yicheng, Bojian Yin, Manil Dev Gomony, Henk Corporaal, Carsten Trinitis, and Federico Corradi. 2025. "Hardware/Software Co-Design Optimization for Training Recurrent Neural Networks at the Edge" Journal of Low Power Electronics and Applications 15, no. 1: 15. https://doi.org/10.3390/jlpea15010015

APA Style

Zhang, Y., Yin, B., Gomony, M. D., Corporaal, H., Trinitis, C., & Corradi, F. (2025). Hardware/Software Co-Design Optimization for Training Recurrent Neural Networks at the Edge. Journal of Low Power Electronics and Applications, 15(1), 15. https://doi.org/10.3390/jlpea15010015

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop