Next Article in Journal
Research on Localized Corrosion Monitoring of Cu Substrate Based on Discrete Fiber Optic Sensors
Previous Article in Journal
Distributed Integrated Energy System Optimization Method Based on Stackelberg Game
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

2QGRU: Power-of-Two Quantization for Efficient FPGA-Based Gated Recurrent Unit Architectures

by
Miguel Molina Fernandez
1,*,
Shao Jie Hu Chen
1,
Javier Mendez Gomez
2,
Diego P. Morales Santos
1,*,
Manuel Pegalajar Cuellar
3 and
Marisa Lopez-Vallejo
4
1
Department of Electronics and Computer Technology, University of Granada, 18071 Granada, Spain
2
HTEC GmbH, 82008 Munich, Germany
3
Department of Computer Science and Artificial Intelligence, University of Granada, 18014 Granada, Spain
4
IPTC, ETSI Telecomunicación, Universidad Politécnica de Madrid, 28040 Madrid, Spain
*
Authors to whom correspondence should be addressed.
Electronics 2026, 15(4), 722; https://doi.org/10.3390/electronics15040722
Submission received: 15 January 2026 / Revised: 31 January 2026 / Accepted: 4 February 2026 / Published: 7 February 2026

Abstract

This paper proposes a power-of-two-based quantization technique aimed at improving the hardware efficiency of artificial neural networks (ANNs) implemented on field-programmable gate arrays (FPGAs). The effectiveness of the proposed approach is validated using gated recurrent unit (GRU) models. The resulting architecture, referred to as 2 Q G R U , exploits parallelism, optimized operation scheduling, and fine-grained data bit-width management to achieve efficient hardware realization. Compared with state-of-the-art FPGA implementations based on sparsity compression, 2 Q G R U demonstrates superior performance in terms of resource utilization and power consumption, while eliminating the need for dedicated DSP blocks. Furthermore, area and power efficiency can be further improved by trading latency for reduced hardware cost through an integrated implementation reduction strategy, enabling deployment on highly resource-constrained devices. Finally, the 2 Q G R U model is integrated into an automated ANN framework, allowing the proposed quantization and hardware optimization techniques to be readily extended to other ANN models and FPGA-based deployments.

1. Introduction

The presence of Artificial Intelligence (AI) in our daily lives is becoming increasingly prominent. The potential of AI is more than evident, with new applications such as language or world models being developed and improved every year. AI’s ability to learn and adapt to complex tasks seems to have no limit. However, the more complex a model is, the more operations and memory are required, increasing the final hardware requirement for its deployment.
One way to solve this problem is to create specialized structures to perform the operations of these models, a path taken by companies such as Nvidia [1]. Nvidia boasts that it has increased computing power for AI faster than Moore’s Law over the past 8 years. Despite Nvidia’s innovation in the design of its GPUs, there is one key trend to reaching this milestone: While Nvidia used an FP16-based structure in 2020, it was reduced to FP8 in 2022 and FP4 in 2024. By reducing the data bit-width, more data can be processed in parallel in the same structure, reducing the latency of the models.
However, this approach has two problems: First, not all AI frameworks are flexible concerning the format of their data. Second, reducing the accuracy of the data complicates the training process and can degrade the desired results. In this context, Microsoft is making a strong commitment to the quantization of large language models (LLMs) with its recent publication entitled “The Era of 1-bit LLMs” [2]. This paper demonstrates that LLM results in latency, model size, and power consumption can be improved through extreme quantization, with their model matching full-precision baseline performance starting from the 3B parameter scale.
Therefore, researchers are focusing on solving both problems: how to train quantized models in software without significant performance degradation and how to efficiently deploy these models on specialized hardware architectures that enable substantial reductions in implementation cost. Quantization is useful not only for LLMs, but also for models used in edge or IoT applications, where area and power constraints are particularly stringent. Thus, this article proposes and validates a quantization technique for different Recurrent Neural Network (RNN) models. Furthermore, to solve the problem of data format flexibility, appropriate FPGA architectures have been developed. The main contributions of this work can be enumerated as follows:
  • Quantization: A power-of-two quantization and training technique is proposed to enable the deployment of highly quantized models, such as 2-bit GRUs, with minimal degradation in performance.
  • FPGA architectures: Leveraging the proposed quantization approach, FPGA-based architectures for ANN models are introduced as an efficient alternative for resource-constrained and low-power edge or IoT applications. These architectures share a common structure and scheduling scheme, allowing them to be flexibly combined in implementation.
  • Fine-grained configurability: The proposed FPGA architectures support configurable data bit-widths at both the weight and operation levels. Additionally, an implementation reduction strategy has been integrated to reduce area and power impact by increasing the number of execution cycles.
  • ANN framework: The framework presented in [3] has been extended to support the training of RNN models and the automated generation of their hardware architectures. This framework prioritizes portability, ensuring consistent results between the training phase and real-time execution.
  • Comparison: The proposed quantization technique is validated against the sparsity-based compression technique, which is currently the most widely adopted method for hardware reduction. The results show that carefully optimized quantization can achieve comparable performance while providing significant improvements in area and power efficiency. To the best of our knowledge, no existing framework combines automated, per-layer dual-weight quantization with hardware-aware optimization in such a streamlined and integrated manner.
The remainder of this paper is organized as follows: Section 2 provides background on GRUs, together with an introduction to the ANN framework in use. The proposed GRU model and quantization method are presented in Section 3. Section 4 describes the optimizations made for the hardware implementation. Section 5 presents two key enhancements to the resulting hardware architecture. Section 6 discusses the experimental evaluation, and finally, some conclusions are drawn in Section 7.

2. Background

2.1. Foundations

The idea behind GRUs [4] starts with understanding what an RNN is [5]. RNNs were created to improve the performance of the MLP [6] when dealing with sequential data or time series data, situations where retaining relevant information from previous time steps would help to understand the current time step. A good example of this approach working well is language, where context becomes essential: without the right context, the same piece of information could have many different meanings.
Based on the MLP equation, the first and simplest proposal to consider this context is to include feedback from the neuron’s output in the neuron’s computation. Thus, the neuron’s previous output H t 1 is treated as a second input with its corresponding neuron’s set of weights W h and thus propagated to the current output H t . The equations in Figure 1 highlight the difference between the MLP and RNN, where A F is the activation function, X t are the inputs at the instant of time t, and B are the biases. For easier explanation in the Section 4, three colors are used to highlight similar operations with the same dimension ( d i m ) between the equations:
  • In yellow, multiplications with size d i m ( X ) × d i m ( H ) .
  • In green, multiplications with size d i m ( H ) × d i m ( H ) .
  • In grey, the addition of the biases.
GRUs are based on the same principle as RNNs, but they were designed to include control of the feedback by introducing two gates: the reset ( R t ) and the update ( Z t ) gates. These gates interact with the previous and current outputs ( H t 1 and H t ) like a tap, deciding how much water to let through and the balance between hot and cold water. Thanks to these gates, the GRU can withhold relevant information or remove it when it is no longer needed. As a result, it improves its adaptability to slow or abrupt changes in input data.
The main equations of the GRU are defined in Figure 2, where the symbol ⊙ is the Hadamard (elementwise) product operator, sigm corresponds to the sigmoid function, and tanh is used as the hyperbolic tangent function. It may be noted that both gates use the same structure of the RNN equation (see Figure 1), as well as the candidate hidden state H ˜ t . Again, three colors highlight similar operations between the equations, including the Hadamard product in purple.
This paper designs and develops a suitable architecture for GRU synthesis on FPGAs based on the equations in Figure 2. The main goals of this architecture are as follows:
  • To encode the dataflow with the minimum data bit-width in each component, minimizing the area of the design. Several techniques were developed for this purpose, including the quantization of the weights, the quantization of the inputs, and training with these modifications.
  • To compress the three main equations in Figure 2 into one architecture, maximizing the reuse of the hardware resources.
  • To minimize the required power consumption by removing unnecessary operations.
  • To make each neuron of each layer work independently and in parallel, maximizing the design robustness at the same time as reducing the system latency.
In Section 3 and Section 4 these objectives are translated into hardware design decisions, ending with the creation of the GRU model for the framework.

2.2. Framework

The work presented in this paper uses the automated framework for ANN implementation on FPGAs, as introduced in [3]. The main objectives of this framework are:
  • To support a flexible architecture that enables the implementation and combination of different ANN types.
  • To explore the trade-off between the accuracy, area, power, and throughput of ANN models’ architectures, with a focus on improving their efficiency during deployment.
  • To provide a hardware generation abstraction that requires minimal developer intervention and hardware-specific knowledge.
Most of the steps defined for the flow graph of the framework remain the same as in [3], as reflected in Figure 3. This paper has extended the functionalities of the framework to include GRUs and improve its flexibility, affecting mainly:
  • The hardware estimation (step 2 in Figure 3), which evaluates the feasibility of deploying a given network architecture on a target FPGA before training. This approach uses empirical relationships (see Section 6.2.2) and the equations provided throughout this article to estimate resource utilization and latency.
  • The training of the ANN (step 3) by incorporating the techniques explained in Section 3.1.
  • The hardware implementation (step 6) by adopting the hardware architectures presented in Section 4 and the new resource-aware features discussed in Section 5.
Finally, the framework has been rebuilt in Pytorch v1.13 to improve model development and training speed, enabling the use of architectures not only with FPGAs but also with Pytorch-compatible devices (step 7). The proposed framework serves as a bridge between PyTorch-based training and Vivado FPGA implementation, enforcing functional and bit-accurate equivalence while enabling automated VHDL generation and hardware-aware design exploration. For the sake of completeness, the full intended workflow of the framework is described in Appendix A.

3. Quantized Gate Recurrent Unit Model

3.1. Weight Quantization Using Power-of-Two

Due to their exponential growth, weights turn out to be one of the most resource-consuming parts of large ANN hardware implementations. This effect can be mitigated by linear quantization, which has been demonstrated in several works [7,8,9]. This quantization can be taken to the extreme, where weights are implemented with only 1–2 bits [2,3]. However, linear quantization does not reduce the resources needed for weight multiplication, and despite its success [2,3], training with extreme quantization is still a hard task for many applications.
 Since multipliers are the second largest hardware resource consumer in this field, a solution focused on area reduction should consider making the weights better suited for their multiplication in hardware, which would allow two important improvements. With that in mind, one of the most attractive ideas is to limit the weights to a power of two: multipliers could easily be replaced by shift registers, drastically reducing their cost in resources. Power-of-two quantization has been previously shown to preserve recurrent network functionality while reducing hardware complexity, as shown in high-level analysis [10] and FPGA implementations for LSTMs [11]. Our approach aims to extend these principles to GRUs, investigating whether power-of-two quantization can preserve functionality while enabling efficient hardware implementation.
The study on the feasibility of this idea can be split into three questions: What are the typical values that GRU weights reach during training? How can the model be trained with these limitations on the weights? And what is the minimum set of weight values to consider to avoid a significant loss of accuracy?

3.1.1. Understanding the Model

First of all, it is desirable to have an intuition of what the distribution of weights of the vanilla GRU looks like after training. Through experimentation, it has been observed that the shape of the nine sets of weight values, W s and B s in Figure 2, can be well approximated by normal distributions with zero mean and standard deviation σ . Furthermore, the value of σ tends to remain less than one.
Based on these observations, it can be stated that:
  • If 2 n σ is defined as the closest power of two to σ , the range of possible weight values can be constrained to [ 2 n σ , 2 n σ ].
  • The number of possible weight values, N p 2 , is determined through experimental evaluation. The minimum representable value is then 2 n σ ( N p 2 1 ) .
Unlike conventional uniform quantization, where weights are mapped to evenly spaced fixed-point values over a predefined range, the proposed approach restricts weights to a discrete set of signed power-of-two values. Here, n σ controls the maximum admissible weight magnitude, adapting the quantization range to the observed weight distribution, while N p 2 governs the number of discrete weight levels, effectively reducing the required bit-width.
These two hyperparameters define the proposed training approach, from which most other architectural parameters are automatically derived (see Appendix A).

3.1.2. Dual-Weight Training for Quantization

The starting point for the training is the dual-weight training technique used in [3], which is conceptually equivalent to the Straight-Through Estimator (STE) with an identity gradient approximation for the quantizer gradient [12]. It consists of having two different pairs of weights and biases:
  • The first pair, W ˜ and B ˜ (note the tilde notation), is used to store the learned values of the network, which are already quantized. This pair of weights is the one used in the feed-forward step in Figure 2 during the model training and deployment.
  • The second pair, W and B, has the aim of accumulating all the updates given by the chosen back-propagation algorithm. They guide the updates of W ˜ and B ˜ during training, being removed later for the hardware implementation.
Since W ˜ is used in the feed-forward step, the generated gradient after the back-propagation step can be noted as Δ W ˜ . Therefore, to bridge the differences between the various back-propagation techniques, the final update step would normally be expressed as follows:
W ˜ = W ˜ Δ W ˜
In the dual-weight scheme, this is replaced by
W = W Δ W ˜
Having W, the proposal is to update W ˜ to the closest power of two to the values of W. Considering consecutive powers, the n-th power of two will have an upper ( T u p ) and a lower ( T l o w ) threshold defined as follows:
T u p ( 2 n ) = 2 n + 2 n + 1 2 = 3 × 2 n 1
T l o w ( 2 n ) = 2 n + 2 n 1 2 = 3 × 2 n 2
Therefore, W ˜ would be updated with the appropriate value of n that satisfies
W ˜ = S i g n ( W ) × 2 n if T l o w ( 2 n ) | W | < T u p ( 2 n )
where the function S i g n returns the sign of the input. There are two considerations when using (5):
  • The maximum possible value of W ˜ is defined as 2 n σ .
  • Weights below the minimum possible value of W ˜ , 2 n σ N p 2 + 1 , will be null.

3.1.3. Example of Final Weights

Although the optimal quantization parameters may vary across applications, model configurations, and hardware resource constraints, this section aims to illustrate how the weight training would look by applying the described quantization to the TIMIT dataset [13], used later in Section 6. The proposed training method follows a conventional approach, with the key difference being the introduction of the explained hyperparameters, n σ and N p 2 .
Based on an analysis of the weight distributions obtained from training a baseline ANN model on TIMIT, using the target architecture and model size adopted for comparison with prior work (123–1024–1024–62, GRU–GRU–MLP), the following observations were made:
  • 2 2 tends to be the closest power of two to the observed standard deviation, σ , leading to the selection of n σ = 2.
  • Restricting the quantization set to the first three highest candidates in the resulting range, 2 2 , 2 3 , and 2 4 , (corresponding to N p 2 = 3), allows the model to be trained successfully with less than 1% of the PER deviation (see Section 6.2.1 for more details).
Then, using (3)–(5), the framework provides the possible values of W ˜ as follows:
W ˜ = S i g n ( W ) × 2 2 if 2 4 | W 3 | S i g n ( W ) × 2 3 if 2 5 | W 3 | < 2 4 S i g n ( W ) × 2 4 if 2 6 | W 3 | < 2 5 0 o t h e r w i s e
Figure 4 shows an example of the expected W ˜ and W behavior throughout the training when these values are used.

3.2. Operation Simplification

To complement the reduction applied to the sets of weights, this paper explores the quantization of the operations performed by the GRU model. The aim is to make them all suitable for a target number of bits F b . This F b , defined per layer, is another hyperparameter of the proposed model. Quantization is applied at the layer level, allowing individual bit-width configurations for different layers of the network.

3.2.1. Multiplication Quantization and Bit Precision Discussion

Assume the following:
  • The results in Figure 2 are bounded within (0, 1) (in the case of the sigm) or within (−1, 1) (in the case of the tanh);
  • W ˜ is bounded by 2 n σ , which is a value less than one.
  • Then, the only operation where it is possible to exceed the value of one is the adder operation. With inputs less than or equal to one, the results of the multiplications will always be less than or equal to their inputs. Therefore, the following statements assume that all operations have F b bits for their decimal representation. For convenience, the sign bit is also included in F b . Working with a fixed point, this means that the minimum representable value, V m i n , would be
V m i n = 2 ( F b 1 )
while the maximum representable value V m a x would be
V m a x = 1 V m i n
However, to preserve precision in multiplication operations, the output of the multiplication must have at least the same number of bits as the sum of the bits of the multiplication input. Since the priority is to keep only the F b bits for their decimal representation, it was decided to quantize the result of each multiplication. Then, the quantization of the product R t H t 1 can be expressed as follows:
Q v a l ( R t H t 1 ) ( R t H t 1 + V m i n 2 ) × 2 F b 1 2 F b 1
But what about matrix multiplications? When talking about matrices, two operations are implicit: first, the arrays are multiplied, and second, the results are added. If no quantization is performed before this sum, the internal multiplications usually provide a higher resolution than expected, differentiating the multiplication from its hardware counterpart. However, not using software matrices would drastically reduce training speed, which is undesirable.
To solve this trade-off, the nature of the multiplications involved was taken into account. These multiplications take as input one of the sets of weights, highlighted in yellow and green in Figure 2. Playing with the possible values of the weights, the vanilla RNN seen in Figure 1 can be redefined as follows:
A F ( X t ( W x × 2 n σ ) + H t 1 ( W h × 2 n σ ) + ( B × 2 n σ ) 2 n σ )
During multiplication, (10) scales the values obtained by the weights by a factor of 2 n σ , which reduces the required additional decimal bits to N p 2 1 without affecting the final results. How does this translate to the model?
  • As notation, let F ˜ b be defined as the sum of F b + N p 2 1 .
  • Hadamard products will return F b bits for their decimal representation,=; meanwhile matrix multiplications will return F ˜ b bits. Then, the subsequent sum will also have F ˜ b decimal bits.
  • For both hardware and software sides, (9) is implemented for Hadamard products. Additionally, the activation function will integrate (9) to have as output F b decimal bits.
  • On the hardware side, the weights will already be saved multiplied by 2 n σ to simplify the operations.
Now that the decimal representation has been discussed, it is time to look at the integer bit representation I b . As seen at the beginning of this section, addition is the only operation that requires integer bits. Once training is finished, the weights and biases are fixed. Thus, if we consider the worst case for these specific weights, where all inputs are ones ( 1 X or 1 H ) and all the weights have the same sign, I b would be
I b = log 2 α · max 1 X | W x r | + 1 H | W h r | + | B r | , 1 X | W x z | + 1 H | W h z | + | B z | , 1 X | W x h | + 1 H | W h h | + | B h |
In (11), α ( 0 , 1 ] is an optional scaling coefficient used to reduce the conservativeness of the worst-case estimation. For example, α = 0.8 corresponds to 80% of the theoretical maximum sum. Since I b is computed after network training, α can be selected by analyzing the impact on validation set accuracy. The value of I b can vary across layers of the ANN and even between neurons within the same layer, making it a configurable parameter at the layer level. Its value is automatically determined by the framework using (11).
Similarly, F b is a user-defined hyperparameter that sets the output bit-width for each neuron. The bit precision representation used for each operation of a layer l can be derived from n σ , N p 2 , and F b and expressed as shown in Figure 5. In this figure, T b l is the bit-width used in the addition operation, while T ˜ b l is the number of bits in the addition output after removing the decimal bits below 2 F b l + 1 . F m a x l defines the decimal bit-widths after multiplication.

3.2.2. Activation Function Simplification

Previous work has used piece-wise linear or polynomial approximations to reduce the complexity of activation functions in their hardware implementations [14]. In the case of the GRU in Figure 2, two sigmoids and a tanh are involved in the process. To reduce their subsequent hardware requirements, they have been substituted by their equivalent hardsigmoid s i g m H and hardtanh t a n h H . By integrating (9) as explained in Section 3.2.1, they will be defined as follows:
s i g m H ( x ) = 0 if x 2 Q v a l ( x + 2 4 ) if 2 < x < 2 1 if 2 x
t a n h H ( x ) = 1 if x 1 Q v a l ( x ) if 1 < x < 1 1 if 1 x
The techniques presented in Section 3.1 and Section 3.2 complicate and slow down the training process. However, they ensure stability in the hardware implementation’s dataflow and enable seamless portability between the software model and its hardware architecture, without requiring any modifications. This is notably accomplished through the introduction of only three additional hyperparameters: n σ , N p 2 , and F b .
In summary, Figure 6 reflects the adjustments described in this section to the GRU equations:

4. Hardware Architecture

In this section, the proposed hardware architecture is explained. As discussed before, the GRU model has been integrated into the framework shown in Figure 3. Concerning the hardware architecture, this framework seeks:
  • To have a common baseline architecture that accelerates and facilitates the integration of new ANN models.
  • To create ANN architectures compatible with each other, allowing ANNs to be built by mixing architectures along the layers.
  • To keep the robustness of the typical parallel processing of ANN by having independent hardware resources for each neuron.
  • To focus the field of application on edge devices by minimizing the power and area requirements of the architectures.
The next section describes how the overall architecture works at the layer level, which is common to all ANN models considered in the framework. Following that, the GRU neuron architecture is described in detail. Finally, the MLP and RNN architectures are presented to demonstrate how techniques employed in this GRU can be extended to other ANN models. These are not standard ANN architectures; therefore, the hardware implementations within the framework are collectively referred to as 2QANNs (such as 2 Q G R U , 2 Q R N N , and 2 Q M L P ), highlighting their quantized and architecture-specific modifications.

4.1. Overall Architecture

The general structure of any ANN created with the framework can be seen in the blue area of Figure 7. Composed of L layers, each one may have different lengths ( n 1 , n 2 … and n L ). Five types of layers can be distinguished:
  • The input data layer, in charge of storing X t until the input layer processes it.
  • The 2 Q A N N layers, including the input, the hidden, and the output neuron layers, which are in charge of processing the input data to calculate the optimal output result.
  • The output data layer, in charge of modifying the 2 Q A N N output as required by the application.
Each layer consists of a set of neurons and an output shift register, called R1 for convenience. Their interface is simple: In each clock cycle, the shift registers pass one element of the data array to the next layer. This element is then processed in parallel by all the neurons of the following layer and stored in the adder accumulator of each neuron.

4.2. 2 Q G R U Architecture

The 2 Q G R U hardware architecture can be divided into two parts: the neuron design and its interface to the layer, and the system schedule.

4.2.1. 2 Q G R U Neuron Design

The 2 Q G R U neuron design arose from the idea of compressing the similarities between GRU equations into the minimal architecture capable of performing all their operations. After studying different possibilities, it was decided that splitting the design into two levels of operations per layer would facilitate the 2 Q G R U implementation:
  • Neuron Level: Most of the operations will be done at this level. Each neuron must compute its own intermediate results, including R t , Z t , and H ˜ t . To do this, they must have access to X t , H t 1 , and R t H t 1 , in addition to their weights and biases.
  • Layer Level: It is responsible for managing the interfaces and collecting the intermediate results of all neurons in the layer, which are kept in registers. It also calculates the final output H t , and it performs all the elementwise products between intermediate results.
These assumptions led to the architecture of Figure 7, where l indicates the number of layers (from 1 to L). The green area corresponds to the neuron level and the purple area corresponds to the layer level. Its main components are:
  • Multipliers, which include M1, M2, and M3. However, M1 and M2 have been replaced by shift operations thanks to the proposed quantization (Section 3.2.1).
  • Adders, including S1, S2, and S3. Specifically, S2 is a cumulative adder, which stores the results of each clock cycle until having studied all the inputs/previous outputs to the layer. When that occurs, the result is moved to the first register of the layer, named R2.
  • The activation function block (AF), capable of performing the operations seen in (12) and (13).
  • Registers, which are four in total. They store the intermediate/final results of the operations at the layer level.
Some multiplexers, the weights, and the biases are included too. The architecture is designed to have self-sufficient and independent neurons, increasing the robustness of the architecture. Thus, while X t and H t 1 data are critical to the success of the layer result, the failure of a hidden neuron would have a minimal impact.
Despite the fact that 2 Q G R U neurons may have differences between layers, they maintain the same structure, timing, and bit precision as their neighbors in the same layer. This ensures the stability of the layer, reducing the complexity of the system Controller. It also allows all neurons to work in parallel to obtain their respective results, while at the same time performing elementwise products at the layer level with the intermediate results.

4.2.2. System Schedule

To understand how the equations in Figure 6 are translated into the architecture in Figure 7, Figure 8 shows the system schedule from the equation perspective. This table allows us to understand which components are responsible for each operation and how the intermediate results are kept between the different registers.
Using counters, flags, and various execution cycles, T X , a Controller monitors all processes. In each execution cycle, one of the equations in Figure 6 is performed in parallel for all 2 Q G R U layers of the model. Most of the changes in each T X are managed by the multiplexers: weights, biases, and the register data depend on the equations being executed.
Regarding timings, for a layer l, these execution cycles require as many clock cycles as the largest between d i m ( H l 1 ) and d i m ( H l ) , plus one clock cycle to store the results. However, there is one exception: thanks to the decoupling of the two operation levels, the architecture allows parallelizing of the calculation of the final output, H t , with the calculation of the following reset state, R t + 1 , in the same execution cycle. This is possible because R t + 1 depends only on H t , and H t will have available all the required intermediate results at the layer level. Therefore, if R t + 1 is delayed by one clock cycle, the first element in H t would be available to start its computation. This reduces the latency of the layer from four to three execution cycles, which can be calculated as follows:
n m a x l = m a x ( d i m ( H l 1 ) , d i m ( H l ) )
L a t e n c y l = ( n m a x l + 1 ) × 3 + 1   clock   cycles
At the neuron level (green area in Figure 7), both matrix multiplications can be performed in parallel ( M 1 and M 2 ). After that, their results are added ( S 1 ) to finally be added to the cumulative result of the neuron ( S 2 ). When both X t and H t 1 have been processed, the result in S 2 is stored in R 2 until the end of the next execution cycle. It is important to note that S 2 counts with T b l bits, but it only carries T ˜ b l bits to R 2 . The reason is already explained in Section 3.1: the weights are multiplied by 2 n σ , and N p 2 1 extra bits were used during the additions to improve the accuracy. However, when S 2 ends, this difference is truncated before being stored in R 2 . Later, at the layer level, the results are passed through the activation function block ( A F ), which recovers the selected decimal bits F b .
In Figure 9, the system schedule is zoomed in to visualize what the blocks do in each clock cycle when calculating H t + 1 and R t + 2 in parallel at T 7 . To simplify the nomenclature, in this schedule d i m ( H l 1 ) = n 1 (input data size) and d i m ( H l ) = n 2 (input layer size). Let us also consider n 2 > n 1 . A counter C 1 is incremented every clock cycle to select the appropriate element at each register and weight set. In each clock cycle, an element of R t + 2 is computed at the neuron level, while another element of H t + 1 is computed at the layer level. When n 1 elements have been processed, the inputs of M 1 are set to zero, waiting for the rest of the operations to be completed. At n 2 + 1 clock cycles, the final neuron results are moved to R 2 , and S 2 is restarted with the bias value, allowing the next execution cycle to begin.

4.3. 2 Q M L P and 2 Q R N N Architectures

Although this research is about GRUs, both the MLP and the RNN share many similarities with GRUs (see Section 2.1). Furthermore, the MLP is widely used as the output layer of any ANN, including the RNN family, and it would be required during the experiments. Thus, their architectures are additionally included as an extension of the work done in Figure 10.
Taking as reference the 2 Q G R U architecture in Figure 7, the main differences are:
  • As no intermediate calculations are needed, most of the components at the layer level are removed, keeping just the output register R 1 and the activation function ( A F ). For the same reason, multiplexers are no longer needed, and the sets of weights are reduced to one for the 2 Q M L P and two for the 2 Q R N N .
  • In the case of the 2 Q M L P , the feedback branch from the output is removed, as well as the intermediate adder S 1 .
  • The system schedule at the neuron level remains the same, with a single execution cycle. Thus, when these models are mixed, the 2 Q M L P and 2 Q R N N must wait for the 2 Q G R U to complete two additional execution cycles.
  • A clock cycle requires approximately the same time for the 2 Q R N N and the 2 Q G R U ; meanwhile the reduction for the 2 Q M L P is more remarkable.

5. Resource-Aware Framework Features

Two additional features have been integrated into the developed framework to complement 2 Q A N N architectures: a hardware implementation reduction strategy that enables tuning of the area–latency trade-off and a weight compression technique aimed at minimizing memory footprint.

5.1. Implementation Reduction Strategy

The architecture shown in Figure 7 maintains the rule of having independent hardware resources for each neuron in the network configuration. Although this increases the robustness of the design, some applications may prefer a more compact implementation to meet more stringent area requirements either due to the limited size of the target device or to enable the deployment of larger models.
For this reason, the framework provides the ability to reduce the number of neurons by a factor f r . The reduction factor f r is a natural number bounded by the layer size ( 1 f r n l ), enabling a controllable trade-off between area and latency. The main considerations related to this reduction are:
  • The latency increases by a factor f r , as each output will require f r sub-cycles to be calculated.
  • At the layer level, register R 2 is reduced by a factor f r , and the neuron output computations are distributed across multiple sub-cycles. At the neuron level, all input values are accessed during each sub-cycle, ensuring correct accumulation of partial results into R 2 .
  • In the baseline configuration, the register R 3 is read and written within the same cycle, since each value of H t 1 R t is consumed only once and is not revisited. When the reduction factor f r >1 is enabled, H t 1 R t must remain available across all sub-cycles, while H t 1 Z t is computed in parallel. To support both operations, an additional register ( R 5 ) is introduced to store the values of H t 1 Z t , thereby delaying the update of R 3 until the final sub-cycle. As a result, the direct connection between M 3 and R 3 is replaced by M 3 R 5 R 3 in the reduced configuration.
  • The Controller is modified to include an additional counter that controls the sub-cycles, and additional flags are incorporated.
  • Since the hardware is shared by different neurons, I b at the neuron level will be the maximum value calculated for those neurons. As a result, the achieved reduction in S 2 is less noticeable.
  • The addition of multiplexers of size f r between weights and neurons enables access to the target weights during each sub-cycle.
Importantly, this reduction strategy only affects the number of neurons instantiated in the hardware; it does not alter the learned parameters obtained during training or how they are stored. Thus, the model weights remain unchanged, and the model’s accuracy is preserved. For example, consider a layer with two neurons ( n l = 2 ) and a reduction factor f r = 2 , resulting in two sub-cycles. In this case, only a single neuron-processing instance is implemented in the hardware. During the first sub-cycle, this instance processes the inputs using the weights associated with the first neuron, and the resulting partial outputs are stored in R 2 . In the second sub-cycle, the same hardware instance processes the inputs with the weights associated with the second neuron. While this computation is ongoing, the partial results stored in R 2 from the first sub-cycle are processed in parallel, ensuring that the operations remain consistent with the baseline configuration.
Unlike other frameworks, which only allow fixed values of f r or network sizes, this research has no such limitations. However, exact divisions of n l / f r l reduce idle time when scheduling components. For convenience, the compression ratio at each layer is defined as follows:
C r a t i o l = n l f r l / f r l

5.2. Weight Bit-Width Compression in Memory

In Section 3.1, it was discussed how to replace the multipliers with shift operations by limiting the weights to a power of two. However, binary encoding can be inefficient when using shift operations. To save memory, the number of possible weight values N p 2 would be compressed from N p 2 + 1 bits to W p 2 bits, as defined in the examples in Table 1. On the left is the compression for N p 2 = 7 , n σ = 1 , and S as the sign bit. On the right, the values seen in Section 3.1.3 are used. The final encoding allows the shift operators to know how many bits to shift or when to set their output to zero.
These weights are stored in BRAMs. The FPGA used in the experiment is the XILINX xc7z100, which has 755 BRAMs with a capacity of 36 Kb each. It can be configured with fixed data widths, including 2 K × 18 (or 16), 1 K × 36 (or 32), or 512 × 72 (or 64) [15]. Since the minimum address size is 512, two types of memory are created to minimize the area impact:
  • B R A M t y p e 1 : If the dimension of the input is greater than or equal to ⌊ 512 / 3 ⌋, the weights are kept in the combination whose address is closest to that dimension. Then, they are split by equations as shown at the top of Figure 11. In this case, only one third of the BRAMs are read per execution cycle, reducing the impact on power consumption of the design.
  • B R A M t y p e 2 : If the input dimension is less than ⌊ 512 / 3 ⌋, the weights are stored with an address of 512. Then, they are organized as shown at the bottom of Figure 11. In this case, one BRAM holds the set of weights for the three equations.
f w l is defined as the maximum number of neurons whose respective sets of weights fit the selected BRAM type. To determine the appropriate BRAM type, weights are split by inputs and hidden sets because their dimensions may differ, allowing two types of BRAMs to coexist within the same layer. f w l takes into account the number of bits in use by the layer, W p 2 l , as follows:
f w l = 36,864 W p 2 l × Addr_size
Therefore, the total number of BRAMs required for the input weights of a layer l, depending on the layer size and the type of BRAM, will be
B R A M t y p e 1 x l = d i m ( X l ) f w l × 3 B R A M t y p e 2 x l = d i m ( X l ) f w l
Similarly, the total number of BRAMs required for the hidden weights can be calculated using (18) by substituting d i m ( H l ) in place of d i m ( X l ) .

6. Experiments

6.1. Experimental Settings

After the successful implementation of the 2 Q G R U architecture, the next step was to validate it with a well-known benchmark. TIMIT [13] was found to be the most widely used dataset for initial Automatic Speech Recognition (ASR) research and RNN FPGA implementations. It consists of recordings of 630 speakers of eight dialects of American English, each reading 10 phonetically rich sentences.
The task performed on this dataset is to detect and classify the sequences of phonemes from these recordings. Different combinations of preprocessing methods have been proposed for phoneme classification tasks. The most common ones are [16]: preemphasis, windowing (with a size of 25 ms and a step of 10 ms between samples), feature extraction (such as Mel-scaled Filter-banks (FBANKs) [17] or Mel-Frequency Cepstral Coefficients (MFCCs) [18]), normalization, and quantization.
After preprocessing, each data sample is introduced into a classification model (the GRU model in this work). This model must provide its predicted phoneme among the 62 covered by TIMIT. A training criterion such as the Connectionist Temporal Classification (CTC) allows the model to understand the connection between an input data sequence and the expected output sequence [19]. Then, some techniques can be used in the decoding phase, such as beam search or a greedy decoder. Finally, the phoneme error rate (PER) can be used to evaluate the accuracy of the model [16]. It consists of the number of all phoneme errors (inserted, deleted, and changed phonemes) divided by the total number of phonemes.

6.2. Results and Discussion

This approach was developed with two main tools: Pytorch for exploring and training the quantized models and Vivado Design Suite 2021.2 for architectural development, component integration, and implementation analysis. To enable a fair comparison in Section 6.3 with the work developed in Spartus [20], the proposed 2 Q G R U architecture was implemented on the same Xilinx Zynq 7100 FPGA (xc7z100ffg900-1) and evaluated at an operating frequency of 200 MHz. Post-place-and-route worst-case timing was verified under this target frequency to ensure sufficient timing margin. Synthesis was carried out using the Flow_AreaOptimized strategy, while implementation used the Area_ExploreWithRemap strategy. The reported metrics in the experiments are derived from Vivado post-place-and-route estimations.
The features extracted from the audio signal in this work consist of 40 FBANK coefficients, the log-energy term, and their first and second temporal derivatives, following [16]. The data were normalized to the range [0, 1] and quantized according to F b l = 1 . The network was optimized using the RMSprop algorithm. To improve generalization, regularization was introduced by injecting white noise into both the input data, with σ = 10 1 , and the weights, using σ = 10 3 . The training configuration used a learning rate of 10 5 , a batch size of 32, a weight decay of 10 4 , and 30 training epochs. Figure 12 includes a schema of the process using our model, from the input audio data to the final detected phonemes.
Unless stated otherwise, all results reported in Section 6 refer to a fixed model architecture comprising three layers: using a model shape of 123–1024–1024–62, the first two layers are 2QGRUs and the final one is an 2 Q M L P .

6.2.1. Training Phoneme Error Rate (PER)

The first step is to validate the weight quantization technique explained in Section 3.1. After testing various combinations and as anticipated in Section 3.1.3, the quantization for TIMIT was reduced to three values ( 2 2 , 2 3 , and 2 4 ), plus zero and the sign. Therefore, 3 bits were enough to encode these weights in the implementation, as explained in Section 5.2. Based on this combination, a more extreme quantization was studied: removing the highest-weight candidates and keeping 2 4 , plus zero and the sign. This combination not only reduces the implementation to 2 bits, but also converts the shift operations (M1 and M2) into multiplexers ruled by the weights plus an XOR gate for the sign. As a result, this configuration is functionally equivalent to a ternary neural network [3].
Both combinations of weight candidates were tested for different levels of quantization at the operation level. As explained in Section 3.2.1, this is managed by the number of decimal bits, F b , and Equations (9) and (10). The integer bit representation, I b , is calculated with (11) at the neuron and layer level. Thus, I b at the layer level is the maximum I b result between the neurons of the layer.
Table 2 presents the phoneme error rate (PER) for the aforementioned model configuration (123–1024–1024–62). It shows how the choice of the weight bit-width, W p 2 l , and the neuron output bit-width, F b l , for each layer l affects performance. For ease of comparison, F b l is the same for all the layers. Priority is given to the reduction in W p 2 l , as it has a greater impact in the final implementation than the reduction in F b l .
The best result achieved by applying equations in Figure 6 is located in the first row, using W p 2 l = 3 and F b l = 8 . Below these values, the PER starts to increase at different rates depending on the value of W p 2 l . While F b l 6 , the model provides a deviation below 1%, which increases considerably for lower values. The configuration highlighted in row 6, with W p 2 l = 2 for the 2 Q G R U layers, W p 2 l = 3 for the output 2 Q M L P layer, and F b l = 6 , offers the best trade-off between accuracy and implementation efficiency. For that reason, this configuration is selected as the baseline for subsequent evaluations.

6.2.2. Hardware Implementation Results

After selecting the desired configuration, an exploration of the hardware implementation can be performed to obtain insight into its attributes. In addition to W p 2 l and F b l , the size of the layer and the selected compression ratio are the main hyperparameters that affect the final results.
Regarding the size of the layer, Table 3 shows the attributes of a 2 Q G R U layer when it has the same number of inputs and outputs for different layer sizes, and therefore (18) yields the same value for both input and hidden weights. For layer sizes 64 and 128, type 2 BRAMs are implemented, while the rest use type 1 BRAMs (see Section 5.2). While the number of BRAMs can be precalculated with (18), it can be observed that the FF and LUT increase in a similar proportion to the size of the layer. This is the expected behavior when neurons are independent of each other, while the impact of the resulting interfaces and fanout remains low. The same applies to the power, latency, and throughput values, which roughly keep the same ratio. It is important to note that due to the reduced bit utilization in both weights and data results, DSPs will not be necessary in the architecture. For large values of F b l , a DSP could start to be beneficial for the implementation of M 3 (see Section 4.2.1). If this option is chosen, it implies the addition of one DSP per layer, which would have a minimal impact on area.
How does the framework’s implementation reduction feature affect these results? Taking as reference the implementation of 1024 neurons, Table 4 evaluates the implementation reduction strategy explained in Section 5.1 across different compression ratios, C r a t i o . This strategy enables a reduction in the number of physically implemented neurons without altering either the layer-level behavior of the default implementation or the parameter usage, which remains constant at 6.29 M, as in the baseline. However, the effectiveness of this approach is limited by two main factors: the static nature of the default layer implementation and the additional overhead introduced by multiplexers.
It can be observed that, up to a reduction ratio of 6, the effective reduction in LUTs and FFs is notable, halving the resources of the default implementation. However, above this value, the reduction slows down. With an initial latency of 15 μs, the design offers a valuable improvement for applications with strict area constraints where ultra-low latency is not essential. For instance, many sensors operate at or below 100 Hz, as in speech processing tasks such as TIMIT where phonemes are typically analyzed using a 25 ms window with a 10 ms step. In this context, the baseline implementation from Table 4 is approximately 650 times faster than the data acquisition rate, allowing significant headroom for design compression or clock frequency reduction, both of which contribute to lower power consumption.

6.3. Comparison to Related Work

Most of the related works have focused their research on the efficient management of zero weights [11,20,21]. This practice usually requires indices to organize and read the final weights of the model, including techniques such as making small weights tend to zero, pruning non-relevant weights, and applying weight and temporal sparsity to reduce the number of operations per processed input. As a result, these techniques reduce the memory utilization of the architecture and increase the throughput of the architecture. In contrast, our work focuses on high quantization, which, instead of increasing the throughput of the architecture, allows for a reduction in hardware resources and power.
Table 5 contains the main attributes of the most relevant architectures for Recurrent Neural Networks on an FPGA that have been validated on the TIMIT core test. These works are sorted by year of publication, and their results are based on the implementation of the models’ input layer. Although the attributes under study may vary, these works keep in common a layer size of 1024 neurons. From 2017 with ESE [22] to 2024 with Spartus [20], they have explored and proposed different techniques for the efficient management of zero weights. The trend over the years has been to improve the sparsity of the model and to increase the model’s quantization, thereby reducing the latency and power consumption. The works in Table 5 implement MACs for matrix–vector multiplications (MxV) and, except for ESE and E-LSTM [11], these works process the input data with a batch size of 1. The best results in most of the fields have recently been achieved by Spartus, which combines spatio-temporal sparsity with a new column-balanced targeted dropout (CBTD) that helps to have a balanced workload.
We have selected two implementations from Table 2 to be included in Table 5, which are the extremes between the different possibilities: the smallest model with a minor PER deviation (first row in Table 2) and the smallest model with a reasonable deviation (sixth row in Table 2). To the best of our knowledge, our work 2 Q G R U is the only approach in the literature that has achieved similar PER results on TIMIT by using extreme quantization instead of sparsity compression, standing out above all for its reduced use of bits. 2 Q G R U techniques result in models with 2–3 bits for their weights and up to 5–6 bits for the activation function. Compared to the works in Table 5, 2 Q G R U prioritizes and outperforms in the reduction of resource utilization and power consumption, being the only one that avoids the use of DSPs. In addition, 2 Q G R U stands out by prioritizing robustness: it implements one MAC per neuron and performs vector–scalar shift operations. The E-LSTM model [11] also uses weight quantization based on powers of two, achieving the lowest bit usage after 2 Q G R U in Table 5 (8 bits for the weights and operations). However, the E-LSTM model works with a batch size of 8, prioritizing the throughput of the model.
There are two additional differences to highlight:
  • Predictability: While sparsity compression depends heavily on the training process to achieve good latency and resource performance, our quantization technique depends only on the number of neurons and the bit-width used. Therefore, the final hardware requirements of the 2 Q G R U architecture are known without additional effort after selecting the model hyperparameters, enabling quick and early design space exploration (step 2 in Figure 3).
  • Resource utilization: Except for 2 Q G R U , none of the works in Table 5 could implement more than one layer on their respective FPGA due to their resource-consuming approaches. Notably, their reported results are limited to the input layer (123–153 × 1024 neurons), which is typically smaller than subsequent hidden layers (1024 × 1024 neurons), further emphasizing the scalability limitations of BRAM- and DSP-constrained approaches. In contrast, 2 Q G R U achieves the lowest BRAM usage and is entirely DSP-free.
EdgeDRNN is the only relevant work found in the literature that presents the results of a full model implementation on an FPGA [21]. EdgeDRNN focuses on edge applications, enabling large reductions in resources and power. However, EdgeDRNN does not validate its performance on TIMIT and uses a smaller model shape, which makes a fair comparison with the works in Table 5 difficult. Inspired by EdgeDRNN, Spartus also presents a special architecture for edge purposes called Edge-Spartus, though its optimizations are limited to the input layer.
These works are compared in Table 6 with the full three-layer model implementation of 2 Q G R U , using the reduced implementation of the 2 Q G R U model as an edge alternative (see Section 5.1). The reduced 2 Q G R U solution provides similar figures in FFs and LUTs to EdgeDRNN and Edge-Spartus, without using DSPs and without compromising accuracy, as parameters remain unchanged (see row 6). Furthermore, this reduced implementation gives the model the flexibility to adapt its features to the requirements of the final application, achieving higher throughput when needed.
A key distinction seen in Table 6 is that both EdgeDRNN and Edge-Spartus rely on external DRAM for weight storage, suitable for low-cost FPGAs with limited BRAM, but introducing memory bottlenecks constrained by bandwidth. While the XC7Z100 device is not ideal for edge applications due to its size, power consumption, and cost, the proposed 2 Q G R U architecture can be readily adapted for deployment on low-density FPGAs by sourcing weights from external DRAM. This adaptation requires two key changes:
  • Similarly to the approach in Section 5.1, transition from a parallel to a sequential execution model, implementing only one layer at a time to reduce the model’s bit bandwidth requirements.
  • Adjust the number of neurons in this single layer with (16) to match the available DRAM input bandwidth, thereby optimizing resource usage and eliminating the need for BRAMs.
This architectural strategy, executing layers sequentially and tailoring the implementation for minimal hardware overhead, enables the integration of larger models on resource-constrained devices. Under this configuration, the primary design constraint shifts from on-chip BRAM availability to throughput requirements. Consequently, external DRAM-based weight storage is fully compatible with the proposed approach and can be selectively employed based on the specific constraints of the target application. Similarly, efficient zero-weight management is also compatible, although its integration would require further research.

7. Conclusions

This paper has introduced 2 Q G R U , the first GRU hardware accelerator that exploits power-of-two weight quantization to outperform weight sparsity techniques in terms of resource and power consumption for RNN implementations in fixed-point hardware. Unlike previous works, 2 Q G R U replaces the use of matrix–vector multiplications with vector–scalar shift operations, avoiding the use of DSPs and simplifying the final design. This improvement is achieved by compact scheduling of the operations performed, which maximizes hardware resource reuse and parallelization, together with optimal data bit-width management. All proposed techniques are integrated into a custom automated framework that ensures compatibility and portability between different 2 Q A N N models. The framework emphasizes automated hardware generation, exploration of implementation trade-offs, and high configurability, including a flexible reduced implementation strategy that enables the deployment of 2 Q G R U in edge and IoT environments.
Although this work focuses on phoneme recognition using the TIMIT dataset, the scalability of the proposed 2 Q G R U performance to larger datasets, deeper architectures, and more complex sequence modeling tasks remains an important direction for future research. Appendix A outlines the envisioned workflow for adapting the framework to new datasets, model configurations, and hardware constraints in order to systematically explore this scalability. In contrast, the hardware architecture already demonstrates clear scalability through low LUT, FF, and BRAM utilization and a fully DSP-free implementation, providing substantial headroom for scaling to larger models.

Author Contributions

Conceptualization, M.M.F. and M.L.-V.; methodology, M.M.F. and M.L.-V.; software, M.M.F. and S.J.H.C.; validation, M.M.F. and S.J.H.C.; formal analysis, M.M.F.; investigation, M.M.F. and M.L.-V.; resources, M.M.F.; data curation, M.M.F. and S.J.H.C.; writing—original draft preparation, M.M.F., S.J.H.C., J.M.G. and M.L.-V.; writing—review and editing, M.M.F., S.J.H.C., J.M.G., M.L.-V., D.P.M.S. and M.P.C.; visualization, M.M.F. and M.L.-V.; supervision, D.P.M.S. and M.P.C.; project administration, D.P.M.S. and M.P.C.; funding acquisition, D.P.M.S. and M.P.C. All authors have read and agreed to the published version of the manuscript.

Funding

This work has received funding from the Key Digital Technologies Joint Undertaking (JU) under grant agreement No. 101112268 (NEUROKIT2E). The JU receives support from the European Union’s Horizon Europe research and innovation program and France, the Netherlands, Austria, Italy, and Germany. Furthermore, on the national level, this work is supported by the German Federal Ministry of Education and Research (BMBF) under the sub-project with the funding number 16MEE0300.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The original data presented in the study are openly available at https://github.com/philipperemy/timit (accessed on 15 December 2025).

Conflicts of Interest

Author Javier Mendez Gomez was employed by the company HTEC GmbH. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Abbreviations

    The following abbreviations are used in this manuscript:
ANNsArtificial Neural Networks
FPGAsField-Programmable Gate Arrays
GRUsGated Recurrent Units
RNNsRecurrent Neural Networks
AIArtificial Intelligence
LLMsLarge Language Models
MLPMultilayer Perceptron
STEStraight-Through Estimator
AFActivation Function
PERPhoneme Error Rate
2 Q A N N ANN Architectures with the Proposed Modifications
ASRAutomatic Speech Recognition
CTCConnectionist Temporal Classification

Appendix A. Framework Workflow: From ANN Model to FPGA Implementation

This appendix describes the intended usage workflow of the proposed framework, from model exploration to hardware implementation. It is assumed that the user has already selected a target ANN model and FPGA platform, that the corresponding 2 Q A N N architecture is available, and that the dataset of interest has been selected, preprocessed, and normalized. Under these assumptions, the workflow can then be organized into three main stages:
  • Stage 1: Baseline ANN model exploration and training.
    This stage is optional but strongly recommended. It provides insight into the behavior of the full-precision model (prior to quantization) when processing the target dataset. This preliminary analysis helps identify suitable model hyperparameters more efficiently and establishes a performance baseline for comparison with the corresponding 2 Q A N N model. This stage is typically iterated until satisfactory performance is achieved and can be summarized by the following steps:
    • Selection of model hyperparameters, incluiding target architecture and model size.
    • Training of the baseline ANN model in PyTorch, for example by applying equations in Figure 2 for GRU layers.
    • Evaluation of performance and convergence.
  • Stage 2: 2 Q A N N model exploration and training.
    This stage focuses on configuring and training the quantized 2 Q A N N model, guided by both the baseline ANN behavior from Stage 1 and early hardware-aware estimations. It can be summarized by the following steps:
    • Analysis of the learned weight distributions of the baseline ANN model. Based on this analysis, the user defines the quantization hyperparameters n σ and N p 2 (see Section 3.1 and the example in Section 3.1.3). Note: Since N p 2 primarily determines the final weight precision, it is recommended to start with a conservative value and, once convergence is verified, gradually reduce it to achieve an acceptable trade-off between model performance and bit-width reduction.
    • Selection of model hyperparameters. In addition to the standard ANN hyperparameters (normally defined in Stage 1 and refined in Stage 2 when required), the user must define the decimal bit-width F b l per layer l, which determines the precision of the input data, inter-layer activations, and output values (see Section 3.2 and Figure 5). Note: As a general recommendation, the same F b l can be initially assigned to all layers and later reduced on a per-layer basis.
    • Early design-space exploration using analytical estimations (see Step 2 in Section 2.2). Depending on the estimations and the target hardware, the user can decide whether to keep the current hyperparameters, modify them, or enable the implementation reduction strategy by selecting f r > 1 (see Section 5.1).
    • Automated quantization of the dataset according to the selected value of F b l = 1 .
    • Automated generation of training parameters for dual-weight training, where T u p , T l o w and W ˜ (3)–(5) are derived from n σ and N p 2 , and V m i n and V m a x (7) and (8) are derived from F b l .
    • Training of the 2 Q A N N model in PyTorch, for example by applying equations in Figure 6 for 2 Q G R U layers.
    • Evaluation of performance and convergence of the 2 Q A N N model, including deviation analysis with respect to the baseline ANN model from Stage 1.
  • Stage 3: FPGA implementation.
    • Automated generation of implementation parameters, including the integer bit-width I b l (11) and all remaining parameters defined in Figure 5 (derived from n σ , N p 2 , F b l , and I b l ). This step also includes the determination of W p 2 (see Section 5.2) and the selection of the BRAM type according to the target FPGA (17) and (18).
    • Automated extraction of weights and biases from the trained 2 Q A N N model (see Section 5.2). The extracted parameters are organized into sets of yml files, with one file per BRAM instance to be instantiated in the hardware.
    • Automated VHDL code generation based on the selected parameters. Using predefined neuron cells and 2 Q A N N architectural templates (see Section 4), and based on the value of f r (see Section 5.1), this step configures each 2 Q A N N layer instantiation, bit-width allocation, and interconnections accordingly.
    • Hardware synthesis, implementation, and evaluation using Vivado. The user should specify the synthesis and implementation strategies to be applied.
In summary, apart from the usual ANN hyperparameters, the user only needs to specify three quantization-related hyperparameters ( n σ , N p 2 , and F b l ) and may optionally enable the reduction strategy by setting f r . Once these parameters are defined, training proceeds as in a conventional ANN workflow. After satisfactory accuracy and convergence are achieved, the process advances to Stage 3 for hardware implementation. This sequence defines the general intended workflow of the proposed framework.

References

  1. Wang, B. Nvidia Increased Compute Power 1000× in 8 Years to 20 Petaflops in the Blackwell GPU. 2024. Available online: https://www.nextbigfuture.com/2024/03/nvidia-increased-compute-power-1000x-in-8-years-to-20-petaflops-in-the-blackwell-gpu.html (accessed on 10 December 2025).
  2. Ma, S.; Wang, H.; Ma, L.; Wang, L.; Wang, W.; Huang, S.; Dong, L.; Wang, R.; Xue, J.; Wei, F. The Era of 1-bit LLMs: All Large Language Models are in 1.58 Bits. arXiv 2024, arXiv:2402.17764. [Google Scholar] [CrossRef]
  3. Molina, M.; Mendez, J.; Morales, D.P.; Castillo, E.; Vallejo, M.L.; Pegalajar, M. Power-Efficient Implementation of Ternary Neural Networks in Edge Devices. IEEE Internet Things J. 2022, 9, 20111–20121. [Google Scholar] [CrossRef]
  4. Chung, J.; Gulcehre, C.; Cho, K.; Bengio, Y. Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv 2014, arXiv:1412.3555. [Google Scholar] [CrossRef]
  5. Medsker, L.R.; Jain, L. Recurrent neural networks. Des. Appl. 2001, 5, 2. [Google Scholar]
  6. Frean, M. The upstart algorithm: A method for constructing and training feedforward neural networks. Neural Comput. 1990, 2, 198–209. [Google Scholar] [CrossRef]
  7. Silfa, F.; Dot, G.; Arnau, J.M.; Gonzàlez, A. E-PUR: An energy-efficient processing unit for recurrent neural networks. In Proceedings of the 27th International Conference on Parallel Architectures and Compilation Techniques, Limassol, Cyprus, 1–4 November 2018; pp. 1–12. [Google Scholar]
  8. Rybalkin, V.; Wehn, N.; Yousefi, M.R.; Stricker, D. Hardware architecture of Bidirectional Long Short-Term Memory Neural Network for Optical Character Recognition. In Proceedings of the Design, Automation & Test in Europe Conference & Exhibition (DATE), Lausanne, Switzerland, 27–31 March 2017; pp. 1390–1395. [Google Scholar] [CrossRef]
  9. Lee, M.; Hwang, K.; Park, J.; Choi, S.; Shin, S.; Sung, W. FPGA-Based Low-Power Speech Recognition with Recurrent Neural Networks. In Proceedings of the 2016 IEEE International Workshop on Signal Processing Systems (SiPS), Dallas, TX, USA, 26–28 October 2016; pp. 230–235. [Google Scholar] [CrossRef]
  10. Wang, Z.; Lin, J.; Wang, Z. Hardware-Oriented Compression of Long Short-Term Memory for Efficient Inference. IEEE Signal Process. Lett. 2018, 25, 984–988. [Google Scholar] [CrossRef]
  11. Wang, M.; Wang, Z.; Lu, J.; Lin, J.; Wang, Z. E-LSTM: An Efficient Hardware Architecture for Long Short-Term Memory. IEEE J. Emerg. Sel. Top. Circuits Syst. 2019, 9, 280–291. [Google Scholar] [CrossRef]
  12. Yin, P.; Lyu, J.; Zhang, S.; Osher, S.; Qi, Y.; Xin, J. Understanding straight-through estimator in training activation quantized neural nets. arXiv 2019, arXiv:1903.05662. [Google Scholar] [CrossRef]
  13. Garofolo, J.S.; Lamel, L.F.; Fisher, W.M.; Fiscus, J.G.; Pallett, D.S. DARPA TIMIT Acoustic-Phonetic Continous Speech Corpus CD-ROM; NIST speech disc 1-1.1; NIST Interagency/Internal Report (NISTIR); National Institute of Standards and Technology: Gaithersburg, MD, USA, 1993; Volume 93, p. 27403.
  14. Mittal, S.; Umesh, S. A survey on hardware accelerators and optimization techniques for RNNs. J. Syst. Archit. 2021, 112, 101839. [Google Scholar] [CrossRef]
  15. AMD (Xilinx). Zynq-7000 SoC Data Sheet: Overview (DS190, v1.11.1), July 2018. Available online: https://docs.xilinx.com/v/u/en-US/ds190-Zynq-7000-Overview (accessed on 8 December 2025).
  16. Ravanelli, M.; Parcollet, T.; Bengio, Y. The pytorch-kaldi speech recognition toolkit. In Proceedings of the ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Brighton, UK, 12–17 May 2019; pp. 6465–6469. [Google Scholar]
  17. Gowdy, J.N.; Tufekci, Z. Mel-scaled discrete wavelet coefficients for speech recognition. In Proceedings of the 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing, Proceedings (Cat. No. 00CH37100), Istanbul, Turkey, 5–9 June 2000; Volume 3, pp. 1351–1354. [Google Scholar]
  18. Logan, B. Mel frequency cepstral coefficients for music modeling. In Proceedings of the International Symposium on Music Information Retrieval, Plymouth, MA, USA, 23–25 October 2000; Volume 270, p. 11. [Google Scholar]
  19. Graves, A.; Fernández, S.; Gomez, F.; Schmidhuber, J. Connectionist temporal classification: Labelling unsegmented sequence data with recurrent neural networks. In Proceedings of the 23rd International Conference on Machine Learning, Pittsburgh, PA, USA, 25–29 June 2006; pp. 369–376. [Google Scholar]
  20. Gao, C.; Delbruck, T.; Liu, S.C. Spartus: A 9.4 TOp/s FPGA-Based LSTM Accelerator Exploiting Spatio-Temporal Sparsity. IEEE Trans. Neural Netw. Learn. Syst. 2024, 35, 1098–1112. [Google Scholar] [CrossRef] [PubMed]
  21. Gao, C.; Rios-Navarro, A.; Chen, X.; Liu, S.; Delbruck, T.; Gao, C.; Rios-Navarro, A.; Chen, X.; Liu, S.; Delbruck, T. EdgeDRNN: Recurrent neural network accelerator for edge inference. IEEE J. Emerg. Sel. Top. Circuits Syst. 2020, 10, 419–432. [Google Scholar] [CrossRef]
  22. Han, S.; Kang, J.; Mao, H.; Hu, Y.; Li, X.; Li, Y.; Xie, D.; Luo, H.; Yao, S.; Wang, Y. Ese: Efficient speech recognition engine with sparse lstm on fpga. In Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, Monterey, CA, USA, 22–24 February 2017; pp. 75–84. [Google Scholar]
  23. Wang, S.; Li, Z.; Ding, C.; Yuan, B.; Qiu, Q.; Wang, Y.; Liang, Y. C-LSTM: Enabling efficient LSTM using structured compression techniques on FPGAs. In Proceedings of the 2018 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, Monterey, CA, USA, 25–27 February 2018; pp. 11–20. [Google Scholar]
  24. Li, Z.; Ding, C.; Wang, S.; Wen, W.; Zhuo, Y.; Liu, C.; Qiu, Q.; Xu, W.; Lin, X.; Qian, X.; et al. E-RNN: Design Optimization for Efficient Recurrent Neural Networks in FPGAs. In Proceedings of the 2019 IEEE International Symposium on High Performance Computer Architecture (HPCA), Washington, DC, USA, 16–20 February 2019; pp. 69–80. [Google Scholar] [CrossRef]
  25. Cao, S.; Zhang, C.; Yao, Z.; Xiao, W.; Nie, L.; Zhan, D.; Liu, Y.; Wu, M.; Zhang, L. Efficient and effective sparse LSTM on FPGA with bank-balanced sparsity. In Proceedings of the 2019 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, Seaside, CA, USA, 24–26 February 2019; pp. 63–72. [Google Scholar]
Figure 1. Equations of the MLP (top) and the RNN (bottom).
Figure 1. Equations of the MLP (top) and the RNN (bottom).
Electronics 15 00722 g001
Figure 2. Equations of the GRU.
Figure 2. Equations of the GRU.
Electronics 15 00722 g002
Figure 3. ANN framework’s flow graph based on [3]. It illustrates the steps followed to achieve the deployment of an ANN.
Figure 3. ANN framework’s flow graph based on [3]. It illustrates the steps followed to achieve the deployment of an ANN.
Electronics 15 00722 g003
Figure 4. Example of weight updates during training with N p 2 = 3 and n σ = 2.
Figure 4. Example of weight updates during training with N p 2 = 3 and n σ = 2.
Electronics 15 00722 g004
Figure 5. Bit precision representation (above) and dataflow schema example (below) for layer l.
Figure 5. Bit precision representation (above) and dataflow schema example (below) for layer l.
Electronics 15 00722 g005
Figure 6. Equations of the 2 Q G R U .
Figure 6. Equations of the 2 Q G R U .
Electronics 15 00722 g006
Figure 7. 2 Q G R U architecture. Blue area: The proposed ANN architecture. Green area: 2 Q G R U neuron level. Purple area: 2 Q G R U layer level.
Figure 7. 2 Q G R U architecture. Blue area: The proposed ANN architecture. Green area: 2 Q G R U neuron level. Purple area: 2 Q G R U layer level.
Electronics 15 00722 g007
Figure 8. System schedule—Concrete equation from Figure 6 performed in each execution cycle, T X , by each component in Figure 7.
Figure 8. System schedule—Concrete equation from Figure 6 performed in each execution cycle, T X , by each component in Figure 7.
Electronics 15 00722 g008
Figure 9. System schedule detail—Concrete operation performed in each clock cycle during the execution cycle T 7 , synchronized using a counter, C 1 , from the Controller, Ctrl. To simplify the nomenclature, d i m ( H l 1 ) = n 1 and d i m ( H l ) = n 2 . Note that n 2 > n 1 .
Figure 9. System schedule detail—Concrete operation performed in each clock cycle during the execution cycle T 7 , synchronized using a counter, C 1 , from the Controller, Ctrl. To simplify the nomenclature, d i m ( H l 1 ) = n 1 and d i m ( H l ) = n 2 . Note that n 2 > n 1 .
Electronics 15 00722 g009
Figure 10. (a) 2 Q M L P architecture. (b) 2 Q R N N architecture.
Figure 10. (a) 2 Q M L P architecture. (b) 2 Q R N N architecture.
Electronics 15 00722 g010
Figure 11. BRAM input weight distribution types for n 2 = d i m ( X l ) .
Figure 11. BRAM input weight distribution types for n 2 = d i m ( X l ) .
Electronics 15 00722 g011
Figure 12. Phoneme use case setup.
Figure 12. Phoneme use case setup.
Electronics 15 00722 g012
Table 1. Example of weight bit-width compression in memory.
Table 1. Example of weight bit-width compression in memory.
N p 2 * = 7 , n σ = 1 N p 2 = 3 , n σ = 2
Powers of TwoBinary Bits W p 2 Bits ShiftPowers of TwoBinary Bits W p 2 BitsShift
0 S . 0000000 S . 111Null0 S . 000 S . 11Null
2 7 S . 0000001 S . 1106 2 4 S . 001 S . 102
2 6 S . 0000010 S . 1015 2 3 S . 010 S . 011
2 5 S . 0000100 S . 1004 2 2 S . 100 S . 000
2 4 S . 0001000 S . 0113----
2 3 S . 0010000 S . 0102----
2 2 S . 0100000 S . 0011----
2 1 S . 1000000 S . 0000----
* Number of possible weight values. Compressed weight bit-width.
Table 2. Training quantization vs. PER comparison.
Table 2. Training quantization vs. PER comparison.
W p 2 l * F b l I b l T ˜ b l § PER (%)Dev.
MaxMean
3-3-388-8-96-6-814-14-1523.450.0
68-8-96-6-812-12-1323.97+0.52
58-8-96-6-811-11-1224.62+1.17
48-8-96-6-810-10-1136.80+13.35
2-2-3810-10-98-8-814-14-1524.37+0.92
6 10-10-98-8-812-12-1324.27+0.82
510-10-98-8-811-11-1229.25+5.80
2-2-2810-10-108-8-1014-14-1424.31+0.86
610-10-108-8-1012-12-1224.88+1.43
510-10-108-8-1011-11-1130.17+6.72
* Weight bit-width per layer. Neuron output bit-width per layer. Integer bit-width, the mean and maximum values per layer. § Mean bit-width after addition per layer.  The configuration highlighted offers the best trade-off between accuracy and implementation efficiency.
Table 3. 2 Q G R U implementation for different layer sizes (with W p 2 l = 2-2-3 and F b l = 6).
Table 3. 2 Q G R U implementation for different layer sizes (with W p 2 l = 2-2-3 and F b l = 6).
Neurons64-64128-128256-256512-5121024-1024
Parameters (M)0.020.100.391.576.29
I b l (Mean/Max)4/55/65/76/98/10
BRAMs484890342
FF (×1000)2491941
LUT (×1000)1361429
Power (W)0.30.50.71.22.4
Latency (μs)0.981.943.867.7015.38
Throughput (kFPS)102051525913065
Power Eff. (kFPS/W)3000110539511127
Table 4. 2 Q G R U implementations for different compression ratios ( C r a t i o ). (Baseline: 1024-1024 hidden layers in Table 3).
Table 4. 2 Q G R U implementations for different compression ratios ( C r a t i o ). (Baseline: 1024-1024 hidden layers in Table 3).
Neurons1024-1024 *
C r a t i o 86/12128/8171/6256/4512/2
FF (×1000)910101724
FF reduction×4.3×3.9×3.8×2.4×1.7
LUT (×1000)910111319
LUT reduction×3.3×2.9×2.7×2.2×1.55
Power (W)1.11.21.21.41.7
Power reduction×2.2×2.0×2.0×1.7×1.4
Latency increase×12×8×6×4×2
LUT effective reduction (%)2836485578
* Parameters and BRAMs remain constant at 6.29 M and 342, respectively.
Table 5. Input layer comparison on TIMIT.
Table 5. Input layer comparison on TIMIT.
ESE [22]CLSTM [23]E-RNN [24]E-LSTM [11]BBS [25]Spartus [20] 2 Q GRU
Year20172018201920242026
ModelGoogle LSTMGoogle LSTMGRULSTMLSTMLSTMGRU
Input Layer153 × 1024153 × 1024153 × 1024153 × 1024153 × 1024123 × 1024123 × 1024
Language ModelYes--No-NoNo
PER on TIMIT (%)20.724.620.423.223.621.823.4524.27
FPGA PlatformXCKU0607V37V3SX660GX1150XC7Z100XC7Z100
Frequency (MHz)200200200200200200200
Bit Precision (A/W/I) *16/12/416/16/012/12/08/8/416/16/016/8/88/3/06/2/0
Parameters (M)3.253.25∼3.66 4.823.254.703.52
After Compression (M)0.360.200.230.600.410.29-
MACs3212812812840965121024
Weights On-chipNoYesYesNoHybridYesYesYes
DSPs1504 2787231524151852000
BRAMs94793111674302509250301200
FF (×1000)453207259156N/A1084940
LUT/ALM (×1000)3644755792202891363927
Batch size (B)321181111
Power (W)41232915.919.18.42.72.0
Latency (μs)82.79.16.523.92.4115.3815.38
Throughput (kFPS)387330465335∼41710016565
Throughput, B = 1 (kFPS)∼12330465∼42∼41710016565
Power Eff. (kFPS/W)∼9.4414.3616.0221.07∼21.82119.1724.1832.03
Power Eff., B = 1 (kFPS/W)∼0.2914.3616.022.63∼21.82119.1724.1832.03
* Activation function/weights/index. Values with ∼ are estimated by the authors from the metrics reported in the original works when not directly available. In particular, throughput for batch size B = 1 is estimated assuming linear scaling. Most of the works publish utilization percentages for each hardware component. To have a clear comparison among the different FPGA platforms, absolute values have been calculated.
Table 6. Complete model implementation comparison with other works.
Table 6. Complete model implementation comparison with other works.
EdgeDRNN [21]Edge-Spartus [20] 2 Q GRU
Year202020242026
ModelDeltaGRUDeltaLSTMGRU
Neurons40-768-768-9123-1024 *123-1024-1024-6240-768-768-9
FPGA Platformxc7z0707sxc7z0707sXC7Z100
Bit Precision (A/W/I)16/8/016/8/106/2-2-3/0
Parameters (M)5.44.79.95.4
Sparsity Compression (M)-0.29-
C r a t i o --1/1171/686/121/1128/664/12
Frequency (MHz)200200200
MACs8421103531781545258130
DSPs950
BRAMs3833548412
FF (×1000)1912821714591410
LUT/ALM (×1000)1012571915411511
Weights On-chipNoNoYes
On-Chip Power (W)1.41.42.41.31.32.41.51.2
Latency (μs)5361223621543127162323
Throughput (kFPS)286511587147
Power Eff. (kFPS/W)1.45.727.38.04.036.29.95.9
* Edge-Spartus provides implementation results exclusively for the input layer, whereas this work presents results for the complete three-layer model (see number of parameters).
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Molina Fernandez, M.; Hu Chen, S.J.; Mendez Gomez, J.; Morales Santos, D.P.; Pegalajar Cuellar, M.; Lopez-Vallejo, M. 2QGRU: Power-of-Two Quantization for Efficient FPGA-Based Gated Recurrent Unit Architectures. Electronics 2026, 15, 722. https://doi.org/10.3390/electronics15040722

AMA Style

Molina Fernandez M, Hu Chen SJ, Mendez Gomez J, Morales Santos DP, Pegalajar Cuellar M, Lopez-Vallejo M. 2QGRU: Power-of-Two Quantization for Efficient FPGA-Based Gated Recurrent Unit Architectures. Electronics. 2026; 15(4):722. https://doi.org/10.3390/electronics15040722

Chicago/Turabian Style

Molina Fernandez, Miguel, Shao Jie Hu Chen, Javier Mendez Gomez, Diego P. Morales Santos, Manuel Pegalajar Cuellar, and Marisa Lopez-Vallejo. 2026. "2QGRU: Power-of-Two Quantization for Efficient FPGA-Based Gated Recurrent Unit Architectures" Electronics 15, no. 4: 722. https://doi.org/10.3390/electronics15040722

APA Style

Molina Fernandez, M., Hu Chen, S. J., Mendez Gomez, J., Morales Santos, D. P., Pegalajar Cuellar, M., & Lopez-Vallejo, M. (2026). 2QGRU: Power-of-Two Quantization for Efficient FPGA-Based Gated Recurrent Unit Architectures. Electronics, 15(4), 722. https://doi.org/10.3390/electronics15040722

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop