6.1. Experimental Settings
After the successful implementation of the
architecture, the next step was to validate it with a well-known benchmark. TIMIT [
13] was found to be the most widely used dataset for initial Automatic Speech Recognition (ASR) research and RNN FPGA implementations. It consists of recordings of 630 speakers of eight dialects of American English, each reading 10 phonetically rich sentences.
The task performed on this dataset is to detect and classify the sequences of phonemes from these recordings. Different combinations of preprocessing methods have been proposed for phoneme classification tasks. The most common ones are [
16]: preemphasis, windowing (with a size of 25 ms and a step of 10 ms between samples), feature extraction (such as Mel-scaled Filter-banks (FBANKs) [
17] or Mel-Frequency Cepstral Coefficients (MFCCs) [
18]), normalization, and quantization.
After preprocessing, each data sample is introduced into a classification model (the GRU model in this work). This model must provide its predicted phoneme among the 62 covered by TIMIT. A training criterion such as the Connectionist Temporal Classification (CTC) allows the model to understand the connection between an input data sequence and the expected output sequence [
19]. Then, some techniques can be used in the decoding phase, such as beam search or a greedy decoder. Finally, the phoneme error rate (PER) can be used to evaluate the accuracy of the model [
16]. It consists of the number of all phoneme errors (inserted, deleted, and changed phonemes) divided by the total number of phonemes.
6.2. Results and Discussion
This approach was developed with two main tools: Pytorch for exploring and training the quantized models and Vivado Design Suite 2021.2 for architectural development, component integration, and implementation analysis. To enable a fair comparison in
Section 6.3 with the work developed in Spartus [
20], the proposed
architecture was implemented on the same Xilinx Zynq 7100 FPGA (xc7z100ffg900-1) and evaluated at an operating frequency of 200 MHz. Post-place-and-route worst-case timing was verified under this target frequency to ensure sufficient timing margin. Synthesis was carried out using the Flow_AreaOptimized strategy, while implementation used the Area_ExploreWithRemap strategy. The reported metrics in the experiments are derived from Vivado post-place-and-route estimations.
The features extracted from the audio signal in this work consist of 40 FBANK coefficients, the log-energy term, and their first and second temporal derivatives, following [
16]. The data were normalized to the range [0, 1] and quantized according to
. The network was optimized using the RMSprop algorithm. To improve generalization, regularization was introduced by injecting white noise into both the input data, with
, and the weights, using
. The training configuration used a learning rate of
, a batch size of 32, a weight decay of
, and 30 training epochs.
Figure 12 includes a schema of the process using our model, from the input audio data to the final detected phonemes.
Unless stated otherwise, all results reported in
Section 6 refer to a fixed model architecture comprising three layers: using a model shape of 123–1024–1024–62, the first two layers are 2
QGRUs and the final one is an
.
6.2.1. Training Phoneme Error Rate (PER)
The first step is to validate the weight quantization technique explained in
Section 3.1. After testing various combinations and as anticipated in
Section 3.1.3, the quantization for TIMIT was reduced to three values (
,
, and
), plus zero and the sign. Therefore, 3 bits were enough to encode these weights in the implementation, as explained in
Section 5.2. Based on this combination, a more extreme quantization was studied: removing the highest-weight candidates and keeping
, plus zero and the sign. This combination not only reduces the implementation to 2 bits, but also converts the shift operations (M1 and M2) into multiplexers ruled by the weights plus an XOR gate for the sign. As a result, this configuration is functionally equivalent to a ternary neural network [
3].
Both combinations of weight candidates were tested for different levels of quantization at the operation level. As explained in
Section 3.2.1, this is managed by the number of decimal bits,
, and Equations (
9) and (
10). The integer bit representation,
, is calculated with (
11) at the neuron and layer level. Thus,
at the layer level is the maximum
result between the neurons of the layer.
Table 2 presents the phoneme error rate (PER) for the aforementioned model configuration (123–1024–1024–62). It shows how the choice of the weight bit-width,
, and the neuron output bit-width,
, for each layer l affects performance. For ease of comparison,
is the same for all the layers. Priority is given to the reduction in
, as it has a greater impact in the final implementation than the reduction in
.
The best result achieved by applying equations in
Figure 6 is located in the first row, using
and
. Below these values, the PER starts to increase at different rates depending on the value of
. While
, the model provides a deviation below 1%, which increases considerably for lower values. The configuration highlighted in row 6, with
for the
layers,
for the output
layer, and
, offers the best trade-off between accuracy and implementation efficiency. For that reason, this configuration is selected as the baseline for subsequent evaluations.
6.2.2. Hardware Implementation Results
After selecting the desired configuration, an exploration of the hardware implementation can be performed to obtain insight into its attributes. In addition to and , the size of the layer and the selected compression ratio are the main hyperparameters that affect the final results.
Regarding the size of the layer,
Table 3 shows the attributes of a
layer when it has the same number of inputs and outputs for different layer sizes, and therefore (
18) yields the same value for both input and hidden weights. For layer sizes 64 and 128, type 2 BRAMs are implemented, while the rest use type 1 BRAMs (see
Section 5.2). While the number of BRAMs can be precalculated with (
18), it can be observed that the FF and LUT increase in a similar proportion to the size of the layer. This is the expected behavior when neurons are independent of each other, while the impact of the resulting interfaces and fanout remains low. The same applies to the power, latency, and throughput values, which roughly keep the same ratio. It is important to note that due to the reduced bit utilization in both weights and data results, DSPs will not be necessary in the architecture. For large values of
, a DSP could start to be beneficial for the implementation of
(see
Section 4.2.1). If this option is chosen, it implies the addition of one DSP per layer, which would have a minimal impact on area.
How does the framework’s implementation reduction feature affect these results? Taking as reference the implementation of 1024 neurons,
Table 4 evaluates the implementation reduction strategy explained in
Section 5.1 across different compression ratios,
. This strategy enables a reduction in the number of physically implemented neurons without altering either the layer-level behavior of the default implementation or the parameter usage, which remains constant at 6.29 M, as in the baseline. However, the effectiveness of this approach is limited by two main factors: the static nature of the default layer implementation and the additional overhead introduced by multiplexers.
It can be observed that, up to a reduction ratio of 6, the effective reduction in LUTs and FFs is notable, halving the resources of the default implementation. However, above this value, the reduction slows down. With an initial latency of 15 μs, the design offers a valuable improvement for applications with strict area constraints where ultra-low latency is not essential. For instance, many sensors operate at or below 100 Hz, as in speech processing tasks such as TIMIT where phonemes are typically analyzed using a 25 ms window with a 10 ms step. In this context, the baseline implementation from
Table 4 is approximately 650 times faster than the data acquisition rate, allowing significant headroom for design compression or clock frequency reduction, both of which contribute to lower power consumption.
6.3. Comparison to Related Work
Most of the related works have focused their research on the efficient management of zero weights [
11,
20,
21]. This practice usually requires indices to organize and read the final weights of the model, including techniques such as making small weights tend to zero, pruning non-relevant weights, and applying weight and temporal sparsity to reduce the number of operations per processed input. As a result, these techniques reduce the memory utilization of the architecture and increase the throughput of the architecture. In contrast, our work focuses on high quantization, which, instead of increasing the throughput of the architecture, allows for a reduction in hardware resources and power.
Table 5 contains the main attributes of the most relevant architectures for Recurrent Neural Networks on an FPGA that have been validated on the TIMIT core test. These works are sorted by year of publication, and their results are based on the implementation of the models’ input layer. Although the attributes under study may vary, these works keep in common a layer size of 1024 neurons. From 2017 with ESE [
22] to 2024 with Spartus [
20], they have explored and proposed different techniques for the efficient management of zero weights. The trend over the years has been to improve the sparsity of the model and to increase the model’s quantization, thereby reducing the latency and power consumption. The works in
Table 5 implement MACs for matrix–vector multiplications (MxV) and, except for ESE and E-LSTM [
11], these works process the input data with a batch size of 1. The best results in most of the fields have recently been achieved by Spartus, which combines spatio-temporal sparsity with a new column-balanced targeted dropout (CBTD) that helps to have a balanced workload.
We have selected two implementations from
Table 2 to be included in
Table 5, which are the extremes between the different possibilities: the smallest model with a minor PER deviation (first row in
Table 2) and the smallest model with a reasonable deviation (sixth row in
Table 2). To the best of our knowledge, our work
is the only approach in the literature that has achieved similar PER results on TIMIT by using extreme quantization instead of sparsity compression, standing out above all for its reduced use of bits.
techniques result in models with 2–3 bits for their weights and up to 5–6 bits for the activation function. Compared to the works in
Table 5,
prioritizes and outperforms in the reduction of resource utilization and power consumption, being the only one that avoids the use of DSPs. In addition,
stands out by prioritizing robustness: it implements one MAC per neuron and performs vector–scalar shift operations. The E-LSTM model [
11] also uses weight quantization based on powers of two, achieving the lowest bit usage after
in
Table 5 (8 bits for the weights and operations). However, the E-LSTM model works with a batch size of 8, prioritizing the throughput of the model.
There are two additional differences to highlight:
Predictability: While sparsity compression depends heavily on the training process to achieve good latency and resource performance, our quantization technique depends only on the number of neurons and the bit-width used. Therefore, the final hardware requirements of the
architecture are known without additional effort after selecting the model hyperparameters, enabling quick and early design space exploration (step 2 in
Figure 3).
Resource utilization: Except for
, none of the works in
Table 5 could implement more than one layer on their respective FPGA due to their resource-consuming approaches. Notably, their reported results are limited to the input layer (123–153 × 1024 neurons), which is typically smaller than subsequent hidden layers (1024 × 1024 neurons), further emphasizing the scalability limitations of BRAM- and DSP-constrained approaches. In contrast,
achieves the lowest BRAM usage and is entirely DSP-free.
EdgeDRNN is the only relevant work found in the literature that presents the results of a full model implementation on an FPGA [
21]. EdgeDRNN focuses on edge applications, enabling large reductions in resources and power. However, EdgeDRNN does not validate its performance on TIMIT and uses a smaller model shape, which makes a fair comparison with the works in
Table 5 difficult. Inspired by EdgeDRNN, Spartus also presents a special architecture for edge purposes called Edge-Spartus, though its optimizations are limited to the input layer.
These works are compared in
Table 6 with the full three-layer model implementation of
, using the reduced implementation of the
model as an edge alternative (see
Section 5.1). The reduced
solution provides similar figures in FFs and LUTs to EdgeDRNN and Edge-Spartus, without using DSPs and without compromising accuracy, as parameters remain unchanged (see row 6). Furthermore, this reduced implementation gives the model the flexibility to adapt its features to the requirements of the final application, achieving higher throughput when needed.
A key distinction seen in
Table 6 is that both EdgeDRNN and Edge-Spartus rely on external DRAM for weight storage, suitable for low-cost FPGAs with limited BRAM, but introducing memory bottlenecks constrained by bandwidth. While the XC7Z100 device is not ideal for edge applications due to its size, power consumption, and cost, the proposed
architecture can be readily adapted for deployment on low-density FPGAs by sourcing weights from external DRAM. This adaptation requires two key changes:
Similarly to the approach in
Section 5.1, transition from a parallel to a sequential execution model, implementing only one layer at a time to reduce the model’s bit bandwidth requirements.
Adjust the number of neurons in this single layer with (
16) to match the available DRAM input bandwidth, thereby optimizing resource usage and eliminating the need for BRAMs.
This architectural strategy, executing layers sequentially and tailoring the implementation for minimal hardware overhead, enables the integration of larger models on resource-constrained devices. Under this configuration, the primary design constraint shifts from on-chip BRAM availability to throughput requirements. Consequently, external DRAM-based weight storage is fully compatible with the proposed approach and can be selectively employed based on the specific constraints of the target application. Similarly, efficient zero-weight management is also compatible, although its integration would require further research.