Real-Time Bird Audio Detection with a CNN-RNN Model on a SoC-FPGA

Silva, Rodrigo Lopes da; Jacinto, Gustavo; Véstias, Mário; Duarte, Rui Policarpo

doi:10.3390/electronics15020354

Open AccessArticle

Real-Time Bird Audio Detection with a CNN-RNN Model on a SoC-FPGA

¹

ISEL-IPL, 1959-007 Lisboa, Portugal

²

INESC-ID/IST-ULisboa, 1000-029 Lisbon, Portugal

³

INESC INOV, 1000-029 Lisboa, Portugal

^*

Author to whom correspondence should be addressed.

Electronics 2026, 15(2), 354; https://doi.org/10.3390/electronics15020354

Submission received: 23 November 2025 / Revised: 5 January 2026 / Accepted: 5 January 2026 / Published: 13 January 2026

(This article belongs to the Special Issue System-on-Chip (SoC) and Field-Programmable Gate Array (FPGA) Design, 2nd Edition)

Download

Browse Figures

Versions Notes

Abstract

Monitoring wildlife has become increasingly important for understanding the evolution of species and ecosystem health. Acoustic monitoring offers several advantages over video-based approaches, enabling continuous 24/7 observation and robust detection under challenging environmental conditions. Deep learning models have demonstrated strong performance in audio classification. However, their computational complexity poses significant challenges for deployment on low-power embedded platforms. This paper presents a low-power embedded system for real-time bird audio detection. A hybrid CNN–RNN architecture is adopted, redesigned, and quantized to significantly reduce model complexity while preserving classification accuracy. To support efficient execution, a custom hardware accelerator was developed and integrated into a Zynq UltraScale+ ZU3CG FPGA. The proposed system achieves an accuracy of 87.4%, processes up to 5 audio samples per second, and operates at only 1.4 W, demonstrating its suitability for autonomous, energy-efficient wildlife monitoring applications.

Keywords:

audio bird detection; deep learning; embedded system; FPGA

1. Introduction

Birds are often used as an example of animal behavior for monitoring biodiversity, population surveillance, and analyzing migratory patterns [1]. However, conducting visual surveys in remote or ecologically sensitive areas, such as forests, can be challenging and disruptive. Sound-based wildlife detection, on the other hand, offers a non-intrusive method for gaining insights into avian behaviors and trends while minimizing disruption to natural habitats.

Small-scale and low-power embedded systems are necessary for continually following bird species. Their low memory capacity does not permit constant audio recording for several days before being collected. Therefore, local detection is fundamental because only information about the presence or absence of the species is stored. The audio detection of the presence or absence of an animal can then be complemented by a more complex deep learning model that identifies a particular species whenever a bird is identified in the zone.

A few works have already been proposed for detecting or classifying bird species [2]. However, only a few consider the problem of running the models on low-cost and efficient platforms with good accuracy [3].

The target computing platform is a crucial choice for these applications, as it must meet the timing, power, and cost requirements. Computation based on general purpose processors (CPU) may fail to guarantee the real-time requirements of the application and consume more power when compared to custom-designed architectures. The two most common alternatives are embedded GPUs (Graphics Processing Units) and FPGAs (Field-Programmable Gate Arrays). While GPUs offer high-throughput and easy programmability, they are less power efficient than FPGAs [4]. The performance and energy efficiency of FPGAs are typically higher than those of GPUs since the dataflow can be customized and optimized, and activations and weights can be quantized to a more efficient data representation [5].

In this work, an embedded system for bird audio detection is designed, implemented, and tested in a System-on-Chip (SoC) FPGA. A hybrid model with a Convolutional Neural Network (CNN) and a Recurrent Neural Network (RNN) is optimized and mapped on a SoC-FPGA to acquire and process bird audio locally, resulting in a compact system with improved energy efficiency. The work can be used to detect other animal species with a proper adaptation of the model.

This paper is organized as follows: Section 2 presents the related work. Section 3 explains the project workflow, model selection, and quantization. Section 4 describes the hardware/software (Hw/Sw) system architecture. Section 5 describes the implementation results of the final system. Section 6 concludes the paper.

2. Related Work

The process of Bird Audio Detection (BAD) can be split into two categories: bird sound detection and bird species identification through their characteristic sounds. The first method simply detects any bird sound in an audio recording, helping to identify bird activity in an area. The second method focuses on recognizing different bird species by their unique sounds. The advantage of using the first model to detect the presence of a bird is that the second, more complex model, is only executed when necessary and only has to manage sounds related to birds. Therefore, the first model filters audio unrelated to birds.

The first automatic bird sound recognition solutions are based on machine learning algorithms. Most approaches consider bird audio recognition based on features extracted with different machine learning algorithms, such as hidden Markov models [6], support vector machines [7], and random forests [8]. While valid, these algorithms cannot achieve high recognition rates when complex bird audio is present.

More robust methods were achieved with deep learning (DL) methods. These are better than traditional machine learning algorithms because of their ability to extract features from complex data, even when noise is present [9]. Additionally, to mitigate the negative effect of noise in the recognition process, some approaches include a preprocessing step to filter noise [10].

Novel algorithms based on DL [11,12,13] started to be researched and developed, driven by the Bird Audio Detection Challenge (BADC) as part of the Detection and Classification of Acoustic Scenes and Events [14] that took place between 2016 and 2017, with a second edition in 2018. The main goal of this challenge was to advance the field of automatic bird audio detection and classification by developing machine learning models. The algorithms that entered this challenge can be found on the Bird Audio Detection Challenge Results (2018) page [15].

DL models such as convolutional neural networks [16,17] and recurrent neural networks [18,19] obtained improved results compared to machine learning methods.

In [20], an automatic bird sound recognition system uses the ResNet50 model to classify spectrogram images generated from the audio wave of sounds from 46 distinct bird species with an accuracy of 72%.

The work proposed in [21] addresses the challenge of detecting and classifying bird vocalizations by developing a solution that learns from a weakly labeled dataset. This solution is robust against background sounds such as airplanes, weather, etc. This solution entered a challenge called BirdCLEF 2021 and ranked 10th out of 816 teams. This team aimed to optimize the pipeline and explore techniques to reduce the hardware requirements and inference time, with the purpose of a future deployment on smartphones and online platforms.

In [22], the accuracy and efficiency of bird species identification in images are improved using a system that combines the YOLOv5 object detection algorithm for birds and other algorithms such as VGG19, Inception V3, and EfficiencyNetB3. This work focuses on image-based bird identification, but can be combined with an audio bird detection algorithm.

Transformers have also been introduced for bird audio recognition. In [23], the STFT-Transformer model was proposed for bird sound recognition. The model receives the spectrogram of the audio as input to obtain a bird classification output with better results when compared to solutions based on CNN. The visual transformer model with a multihead attention mechanism proposed in [24] improves the extraction of features and the final classification.

Besides considering the attention mechanism, some works consider multiple sources of features using different feature extraction mechanisms [25,26], achieving an accuracy of 90.1% on the Birdsdata dataset, higher than that achieved using only a single feature set. In [27], features are extracted using the Fisher Ratio method algorithm on the mel frequency cepstral coefficient map. A careful selection of dimensions used is required, and then a convolutional neural network is applied for bird audio classification. The focus of the algorithm is on accuracy only, and it does not consider optimizations for embedded real-time computing.

A limitation of using CNNs in bird audio detection is that these deep learning models do not consider the input spectrogram as time series data and, in most cases, discontinuous [28], which are difficult to detect with CNNs. To overcome these limitations, some works consider a hybrid model consisting of a CNN and a transformer encoder to capture context information from sequential features [29].

The work proposed in [30] explores a new approach to bird sound recognition. It enhances classification accuracy by combining multiple acoustic features and a transformer encoder. Traditional methods often rely on single features, while this new approach explores spatial relationships and noise-reduction techniques for cleaner input data. The work combines two CNN-based networks, EfficientNetB3 and ResNet50, and a transformer encoder, incorporating various acoustic features of bird sounds. The CNN networks are very efficient in extracting the log-mel spectrogram features, while the transformer encoder is used to capture the long-term contextual information contained in feature sequences. The explored method from this study resulted in an accuracy of over 90% with multiple datasets, one of which achieves 97.99% of accuracy.

Previous works on bird audio detection or classification do not target embedded systems. Therefore, the focus is on achieving the highest accuracy. The complexity of these models strongly conditions their execution in an embedded system. Considering the execution of deep learning models for bird audio analysis in embedded systems, a few works consider small models and performance-oriented optimization of the models. In [31], a small model with around 1K parameters is proposed to detect bird sounds and estimate their proximity. Since the model was considerably reduced to be executed in a microcontroller, the accuracy achieved is below 70%. The work in [32] proposes some optimization techniques, such as quantization and pruning, to reduce the complexity of models for sound classification in embedded systems. However, the accuracy of the models after optimization suffer a major drop, making the solution not applicable for bird audio detection.

The work proposed in this paper focuses on the design of an embedded solution for bird audio detection to be executed in an SoC FPGA to be deployed as an embedded system. The model explored in this work combines a Convolutional Neural Network (CNN) and a Recurrent Neural Network (RNN). The CNN is used to extract features from a log-mel spectrogram, and the RNN captures the information contained in sequences of features. The model was optimized, quantized, and mapped to a custom hardware accelerator implemented and tested in a low-density ZU3CG FPGA for embedded computing.

The computation of the spectrogram introduces a limited computational overhead in embedded audio analysis pipelines. Spectrograms are obtained using a short-time Fourier transform (STFT), whose complexity scales linearly with the number of time frames and quasi-linearly with the FFT size. For typical bioacoustic settings (sampling rates between 16–44.1 kHz, FFT sizes of 256–1024, and hop sizes of 128–512 samples), the resulting number of FFT operations remains modest, even for multi-second audio segments.

On ARM-based processors, spectrogram computation can be performed efficiently using optimized FFT libraries and vectorized implementations. The computation of a spectrogram for a one-second audio segment typically requires only a few milliseconds, increasing linearly with audio duration and remaining well below the inference time of convolutional neural networks used for bird sound detection. As a result, the spectrogram generation stage does not represent a computational bottleneck in the proposed system and is suitable for real-time detection on resource-constrained platforms.

3. System Design for Bird Audio Detection

A workflow with four main steps was developed to implement the system (See Figure 1).

Model design: Design and optimization of the model for bird audio detection, trading off accuracy and complexity constrained by the established accuracy goal;
Model quantization: To be more efficiently implemented in hardware, the model is quantized;
Hardware Design: a hardware accelerator is designed to run the inference of the neural model;
HW/SW design: The custom accelerator is designed, integrated in a SoC architecture, and validated against the original model.

The workflow starts by selecting a bird audio detection model that achieves the required accuracy with the lowest complexity (number of parameters and operations).

The subsequent step is the quantization of the model to reduce the hardware complexity and the memory requirements. To design deep neural models in an embedded system with limited resources, it is important to optimize the neural model to make it more hardware-friendly. Quantization [33] is the process of mapping floating-point values to a smaller set of discrete finite values. It is often used in various fields such as signal processing, data compression, and machine learning. In machine learning, quantization can be applied to model parameters and activations to reduce memory usage and arithmetic complexity, while preserving accuracy to some extent.

In this work, mixed quantization was considered, where different layers can use different data representations. This permits further reduction of memory and resources to run the model when compared to a fixed-size quantization. The quantization is done with QKeras [34], considering a Quantization-Aware Training approach. Multiple quantizations with different data bitwidths are explored to determine the minimum number of bits required for network operation with minimal accuracy reduction.

After model quantization, the parameters of all layers are extracted using a custom Python (3.12) script. The output maps are also extracted for each layer to validate the hardware design.

A custom hardware accelerator was then designed using High-Level Synthesis (HLS). The quantized model is described in C++ and converted to a Register-Level Transfer (RTL) description of the circuit to be mapped on the FPGA. Hardware optimizations, such as pipeline, loop unrolling, and array partitioning, are explored and applied at this level to guarantee the required performance constraints.

The hardware accelerator was integrated into a hardware/software architecture of the complete bird audio detection system and tested in the SoC FPGA. The application is programmed in C++ for faster execution.

All training and tests of the model in this work consider two datasets. They contain 10 s long WAV files at a sampling rate of 44.1 kHz mono PCM:

Field recordings (freefield1010)—a collection that contains 7690 recordings around the world of very diverse environments, such as city sounds, wildlife, birds, etc. This dataset was gathered by the Freesound [35] project and then standardized for research.
Crowdsourced dataset (warblrb10k)—this dataset comes from a UK bird-sound crowdsourcing research designated Warblr [36]. It contains 8000 recordings from around the UK, by users using their smartphones using the Warblrb recognition app. The audio was gathered from multiple types of environments, including weather noise, traffic noise, human speech, and even human bird imitations.

The datasets were split into 80:20 for training and testing, respectively.

3.1. Model Exploration and Design

This work presents a hybrid model that utilizes a CNN for feature extraction, followed by an RNN to capture context information from the sequential features of the spectrogram, based on the work from [37].

The original model is composed of a CNN with convolutional and downsampling layers (max Pooling), an RNN with two bidirectional Gated Recurrent Units (GRUs), and two Time-Distributed Dense layers to classify the sequences generated by the RNN layers. The output is a value between 0 and 1 that represents the certainty about the presence of a bird vocalization.

The CNN section comprises a series of blocks, each with one convolutional layer followed by a batch normalization layer and the ReLU activation function. This block is repeated six times. After each pair of blocks, a maxpool layer downsamples the number of columns of the input by half to a final size of five. The dense layer reduces the five columns down to a single one to obtain an output map of size

431 \times 64 \times 1

. This map then enters the RNN model.

The RNN model considered in this work is based on the Gated Recurrent Unit (GRU) [38]. The GRU is similar to a long short-term memory (LSTM) model but includes a gating mechanism to input or forget particular features. However, it lacks a context vector or output gate, requiring fewer parameters when compared to the LSTM model. The RNN section of the original model consists of two bidirectional GRU layers, both producing an output of

431 \times 128

.

The classification section uses two Time-Distributed (TD) dense layers followed by a ReduceMax. The first TD layer concentrates the input into an output of

431 \times 64

, and the second into an output of

431 \times 1

. The ReduceMax then returns a single output, the maximum from the 431 lines. The number of parameters of the model is 316,481.

The model was trained with the freefield1010 and warblrb10k datasets, achieving an accuracy (AUC) of 90.5%. Figure 2 shows the model training progression with 100 epochs and 100 steps per epoch.

Looking to reduce the number of parameters and operations of the model, a series of model modifications was explored and tested.

The number of layers and filters of the CNN was progressively reduced, and the accuracy was verified. It was observed that the accuracy reduces by more than 3% for minimal modifications. Therefore, the structure of the CNN layers was left unmodified.

The RNN part was tested with one and two bidirectional GRUs and with a variable number of cells. It was observed that using a single bidirectional GRU reduces the accuracy by 2.5% compared to the model with two bidirectional GRUs. On the other hand, reducing the number of GRU cells had a minor impact on the accuracy of the model. Compared to the original model with 64 GRU cells, the reduced model with a single GRU cell has an accuracy of 89.8%, only 0.7% lower than the original model (See Table 1).

The reduction in the number of GRU cells has a major positive impact on the reduction of weights. With a bidirectional GRU layer with a single cell, the number of parameters reduces from a total of 124,416 to 432, a reduction by a factor of 288× (see Table 2).

3.2. Model Quantization and Weights Extraction

TensorFlow model optimization allows converting a deep learning model into a TFLite (TensorFlow Lite model: https://ai.google.dev/edge/litert/conversion/tensorflow/overview (accessed on 20 November 2025)) model. This TFLite model is designed to optimize models for running on small devices, such as mobile phones, which utilize a lightweight version of TensorFlow intended solely for inference. This optimized (TensorFlow model: https://ai.google.dev/edge/litert/conversion/tensorflow/quantization/model_optimization (accessed on 20 November 2025)) model uses less processing time and RAM. However, this approach has limited control over the number of bits for model quantization, considering only pre-defined types: Int8, Int16, Float16, and other byte-aligned types.

Instead, a fine-grained quantization was considered in this work using the QKeras (QKeras GitHub repository: https://github.com/google/qkeras (accessed on 10 November 2025)) library, a quantization extension to Keras that provides a drop-in replacement for some of the Keras layers to quickly create a deep quantized version of the Keras network [34].

Different quantization configurations were tested with QKeras by varying the number of bits across the layers of the model to find the best-balanced option between model size reduction and accuracy. Quantizations of 2, 4, and 8 bits were considered for both parameters and activations.

The homogeneous quantization with 8 bits has an accuracy of 87.85. The most efficient solution was considered the hybrid quantization with 8 bits for the GRU cells and the dense layers and 4 bits for the CNN layers, with an accuracy of 87.46%, 3% less than the original model and only 0.4% less than the homogeneous quantization with 8 bits. Considering that the 4-bit quantization requires half the memory and close to three times less computing resources compared to the 8-bit quantization, it was decided to proceed with the hybrid solution.

4. Hardware/Software System for Bird Audio Detection

Figure 3 presents an overview of the proposed hardware/software architecture to run the inference of the BAD detection model.

The system consists of a processor and a hardware accelerator. The accelerator includes three hardware IP blocks (CONV, GRU, Dense) to run the neural network model and two DMA (Direct Memory Access) modules to transfer data between external memory and the cores. The processor controls the model’s execution flow and configures the DMAs and IP blocks.

4.1. Hardware Accelerator

The architecture of the accelerator has three main blocks (IPs): one that implements the convolutional layers with optional max pooling (CONV), another that implements the Bidirectional GRU layers (GRU), and another that implements the time distributed dense layers (Dense). The blocks are configurable to support different configurations of the layers.

The original microfaune_ai model returns 2 outputs: the Local Score and the Global Score. The Local Score is the score given to each line of the input, while the Global Score is the highest score found among the Local Scores. The system implementation in this work only returns the Global Score. The Global Score is a number between 0 and 1. If it is below 0.5, no bird vocalization was detected; if it is 0.5 or higher, a bird vocalization was detected. The closer the score is to 1, the more confident the model is in its assessment.

4.1.1. CONV Block

The Conv block implements the Convolution, MaxPool, and output quantization. Figure 4 overviews the Conv3D IP, where 5 main sub-blocks can be seen.

The Input Buffer sub-block stores data from the input stream. It has enough storage for three lines of the input map. The Conv3D sub-block is responsible for processing the input according to the Convolution algorithm, rounding the output value, and sending it to the next sub-block, either the Output Buffer of the MaxPool2D sub-block or the output buffer. The MaxPool2D sub-block is responsible for executing the Max Pooling layer on the output of the Conv3D sub-block. The Output Buffer sub-block is used to temporarily store lines of the output map before being sent over the AXI-Stream to be stored in external memory. The last output map is sent to the GRU block. The Weights block represents the local memories used to store the parameters for all convolutional layers.

The weights and biases were embedded into the IP to reduce the number of transactions done within the AXI-Stream, improving performance and energy at the cost of increased memory usage. Table 3 shows the memory usage of the weights in bytes. The quantized weights and the bias use half a byte. Even though bias was quantized to 4 bits, it uses 16 bits per value in memory to guarantee a statically aligned fixed-point value that initializes the accumulator.

The CONV block is configurable according to the following parameters: layer ID, input size, and the indication whether the downsampling layer is executed or not.

There are 64 biases per layer, with each value taking 2 bytes. The first layer The kernel takes half a byte per value. Each convolutional layer has 64 filters, each with a size of

64 \times 3 \times 3

, except the first layer, which has only one input channel. Additionally, there is one scale of the fixed-point value per filter, each occupying half a byte.

After the calculation of a full convolution, the value must be adjusted to match the 4-bit fixed-point representation of the output activations. This is achieved by shifting the accumulator value (scale-1) times to the right. If the value is negative, then it is set to 0 (ReLU activation function). The resulting value is rounded to the nearest even, the same as in QKeras. Algorithm 1 represents a pseudo-code of the HLS description of the CONV3D IP.

Initially, the block reads the first lines of the input feature map. Reading the whole map to an input buffer, it is not possible due to insufficient on-chip RAM. The module then runs the full convolution for all channels and filters, while reading the input map in parallel. Pooling is applied conditionally, since the same block is used for convolutional layers, followed or not by a pooling layer. The final activation is packed before being sent to external memory or to the local memory to be read by the GRU IP.

Algorithm 1 HLS pseudo Code for the CONV3D IP

1: function Conv3D(

s t r e a m I n, s t r e a m O u t, c o n f i g

)

2:

R e a d I n i t M a p ()

3: for

l i n e \leftarrow - 1

to (

O H E I G H T - 2

) do

4: for

c o l u m n \leftarrow - 1

to

O W I D T H - 2

do

5: for

f i l t e r \leftarrow 0

to (

F I L T E R S - 1

) do

6:

a c c \leftarrow b i a s [f i l t e r]

7:

i n p u t \leftarrow s t r e a m I n

8: for

k l i n e \leftarrow 0

to 3 do

9: for

k c o l \leftarrow 0

to 3 do

10: for

c h a n n e l \leftarrow 0

to (

64 - 1

) do

11:

k \leftarrow k e r n e l [f i l t e r] [c h a n n e l] [k l i n e] [k c o l]

12:

i \leftarrow i n p u t [c h a n n e l] [k l i n e] [k c o l]

13:

a c c

+=

\leftarrow k \times i

14: end for

15: end for

16: end for

17: if

a c c < 0

then

18:

a c c = 0

19: end if

20:

o u t p u t [f i l t e r] [l i n e - 1] [c o l u m n - 1] \leftarrow a c c

21: if

p o o l

then

22:

p o o l O u t p u t \leftarrow m a x (p o o l O u t p u t, o u t p u t)

23: else

24:

p o o l O u t \leftarrow o u t p u t

25: end if

26: end for

27: end for

28: end for

29:

s t r e a m O u t \leftarrow o u t p u t

30: end function

Pipelining, unrolling and data partitioning pragmas are applied to the code to improve the performance of the circuit by exploring the available parallelism. The optimization strategy explored the parallelism available in the convolution operation by unrolling the cycles in the convolutional algorithm. The unroll factors selected for kernel cycles were 1 and 3, representing no cycle unrolling and full cycle unrolling, respectively, appropriate for a 3 by 3 kernel. Factors 1, 2, and 4 were explored in the accumulation calculation cycle. Each cycle handles 64 calculations in parallel.

Table 4 shows the explored unroll settings for each cycle, as well as the expected total number of cycles and FPGA resources usage reported by HLS C synthesis.

The K. lines and K. cols represent the selected unroll factor for kernel iteration, and Par. filters is the number of independent filters calculated in parallel. The Cycles metric is the estimated number of cycles needed to execute the layer with the highest computational complexity. This is the second layer, which has an input map of size

431 \times 64 \times 40

, 64 filters of size

3 \times 3 \times 3

, and a MaxPool with a window of size

1 \times 2

.

The Baseline design is the solution without unrolling and with a parallelism factor of 64 in the accumulation cycle. The cases where the resource usage exceeds the resources available in the target platform are marked in bold.

The architecture selected to implement the accelerator was the one obtained by unrolling the number of filters by 4.

4.1.2. GRU Block

The GRU block implements the bidirectional GRU layers. This block does both bidirectional GRUs. Bidirectional GRU is achieved by applying the GRU layer to an input in a forward direction, and then the same input in a backward direction, ending with the merge of both outputs.

Figure 5 presents an overview of the GRU block, where four main sub-blocks can be identified.

The Input Buffer sub-block stores the entire feature map returned by the previous layer, with a size of

431 \times 64

on the first bidirectional GRU layer and

431 \times 2

on the second. The bidirectional GRU sub-block is responsible for iterating the input according to the GRU algorithm and managing the GRU Cell sub-block. The GRU Cell sub-block receives the input line from its manager, processes it, and then returns the output back to its manager, the bidirectional GRU sub-block.

The Weights block represents the memory of parameters of both Bidirectional GRU layers.

Both the Sigmoid and Tanh functions are used in the calculation of the GRU cell. Before returning the output, the module updates the GRU state with that output. The output produced by the GRU Cell block is received by the Bidirectional GRU sub-block, which places this output value in the Output buffer to be passed to the dense layer block. The GRU cell uses custom Sigmoid and Tanh hardware blocks implemented with lookup tables for 8-bit input data (size of activations).

Algorithm 2 represents a pseudo-code of the HLS description of the GRU IP.

Algorithm 2 HLS Pseudo-Code for the GRU IP

1: function gru_cell(

i n p u t, o u t p u t, c o n f i g

)

2: for

i \leftarrow 0

to

(3 - 1)

do

3: for

c o l u m n \leftarrow 0

to (63) do

4:

i n V a l \leftarrow i n p u t [c o l u m n]

5:

k r V a l \leftarrow k e r n e l [c o n f i g . i d x] [i] [c o l u m n]

6:

m a t r i x_x [i] + = (i n V a l \times k r V a l)

7: end for

8:

m a t r i x_x [i] + = b i a s [c o n f i g . i d x] [i]

9: end for

10:

11: for

i \leftarrow 0

to

(2)

do

12: for

c o l u m n \leftarrow 0

to (

64 - 1

) do

13:

i n V a l \leftarrow s t a t e [c o l u m n]

14:

k r V a l \leftarrow r e c_k e r n e l [c o n f i g . i d x] [i] [c o l u m n]

15:

m a t r i x_i n n e r [i] + = (i n V a l \times k r V a l)

16: end for

17: end for

18: for

i \leftarrow 0

to

(2)

do

19:

m a t r i x_i n n e r [i] + = r e c_b i a s [i d x] [i]

20: end for

21:

22:

z \leftarrow S I G M O I D (m a t r i x_x [0] + m a t r i x_i n n e r [0])

23:

r \leftarrow S I G M O I D (m a t r i x_x [1] + m a t r i x_i n n e r [1])

24:

h h \leftarrow T A N H (m a t r i x_x [2] + (r \times m a t r i x_i n n e r [2]))

25:

o u t p u t \leftarrow (z \times s t a t e [i d x]) + ((1 - z) \times h h)

26:

s t a t e [i d x] \leftarrow o u t p u t

27: end function

The Sigmoid and Tanh operations are implemented with lookup tables, assuming the input has 8 bits. Therefore, the arguments of the operations are first quantized to 8 bits.

Parallelism was not explored because the implemented model only uses one GRU cell. Also, the developed GRU block was not subjected to the same optimization step as the CONV block because the number of expected cycles is much lower than that of the CONV block. Table 5 shows the estimated number of resources reported by the HLS synthesis Tool.

The GRU block is configurable only according to the layer ID parameter.

4.1.3. Dense Block

The dense block runs multiple dot products between the vectors from the GRU, with a vector of weights. Algorithm 3 shows the pseudo-code of the Dense IP.

The first dense layer is executed over a map with 431 lines and 128 input columns and produces 64 columns. The second produces a single output column. For faster execution, 128 multiplications are executed in parallel to produce a single inner product. The result of the block is sent to the processor to determine the global score.

Algorithm 3 HLS Pseudo-Code for the Dense IP

1: function Dense(

i n p u t, s t r e a m O u t, c o n f i g

)

2: for

l i n e \leftarrow 0

to (430) do

3: for

c o l \leftarrow 0

to (

c o n f i g . o u t C o l s

) do

4:

a c c \leftarrow b i a s [c o l]

5: for

k c o l \leftarrow 0

to (

c o n f i g . i n C o l s

) do

6:

k \leftarrow k e r n e l [c o l] [k c o l]

7:

i \leftarrow i n p u t [l i n e] [k c o l]

8:

a c c

+=

k \times i

9: end for

10:

o u t p u t [l i n e] [c o l] \leftarrow S I G M O I D (a c c)

11: end for

12: end for

13:

s t r e a m O u t \leftarrow o u t p u t

14: end function

4.2. Software Design

Figure 6 illustrates the order of execution of the layers of the model in the hardware accelerator controlled by the microprocessor of the hardware/software system.

The microprocessor dynamically configures the hardware blocks implemented in the FPGA and the DMA. Before running a block, the software configures it according to the characteristics of each layer.

The Model Predict accepts a padded input of

431 \times 64 \times 40

, with only the first of the 64 channels filled with real data, and then the Conv3D IP is called multiple times with their respective layer configuration. The input and output transferred between these layers are always read and written to the external memory. The input is sent through the DMA, and the PS then waits for the transaction completion by checking the status via polling. This interaction happens at every Conv3D_X call, ending with Conv3D_5, where it gets the output from external memory.

The output of the last convolutional layer is written in a local buffer to be accessed by the GRU block, avoiding a write and read access to external memory.

The GRU IP is called the first time for a forward pass, and then using the same input in a backward pass. This is executed in sequence because the architecture has only one GRU IP, as explained previously. The forward and backward outputs are stored in local buffers for the next Bidirectional GRU layer. The GRU IP is called 2 more times, forward and backward, to execute BGRU_1.

The output of the second bidirectional GRU is buffered to be used by the Dense block. The Dense block runs the time distributed layers in sequence, with the intermediate map being stored in external memory. The output of the last time-distributed layer is sent to the external memory to be read by the processor and produce the final classification.

5. Results

This section reports the accuracy and performance results of the proposed bird audio detection model running in the proposed hardware/software architecture. The system was implemented in a Xilinx Zynq UltraScale+ ZU3CG6 MPSoC of the Avnet Ultra96-V2 (Avnet Ultra96-V2: https://www.avnet.com/americas/products/avnet-boards/avnet-board-families/ultra96-v2/ (accessed on 20 November 2025), Avnet Ultra96-V2 (AMD product page): https://www.xilinx.com/products/boards-and-kits/1-vad4rl.html, (accessed on 20 November 2025)).

5.1. System Accuracy

The proposed bird audio detection model was compared with the original microfaune_ai model. All tests used 400 samples from the dataset freefield1010, 200 with positive bird detection and 200 with negative bird detection. The performance results were determined by averaging the execution time across 400 samples. The evaluated models are as follows:

Original: Original microfaune_ai model without any modification, using float-point arithmetic;
Modified: Modified model with a single GRU cell, using floating-point arithmetic;
FPGA: Quantized modified model with a single GRU cell implemented in the FPGA.

As shown in Table 6, the reduction in the number of GRU cells has a low impact on the accuracy of the model, less than 1%. Comparing the FPGA implementation with the original model, the loss in accuracy is 3%, indicating that the hardware implementation is working as expected with a small accuracy degradation.

The reduction in GRU cells positively impacted the memory necessary to store the model while retaining the capability to detect bird vocalizations. This reduced the model weights’ memory usage by a factor of 1.7, from 302 KBytes down to 178 KBytes (see Table 7). This reduction is important as it is possible to consider smaller and more cost-effective FPGA options.

5.2. Resource Utilization of the System

Table 8 shows the FPGA resource usage. The CONV block consumes most of the BRAM resources, followed by the GRU block. The dense block utilizes 36% of the DSP resources. The total consumption of resources is below 50%. The number of utilized resources can be further reduced by decreasing the level of parallelism of the blocks, with an impact on the performance.

5.3. CNN Accelerator Performance

Table 9 shows the time in milliseconds of each CNN layer executed in the ARM processor and in the programmable logic of the FPGA. Notice that Conv_1 takes longer than Conv_0, which is expected because for every 2 convolution layers, MaxPool is executed. After a MaxPool is executed, the number of columns is halved, except on the last Conv_5, which runs a MaxPool of

1 \times 10

.

By comparing the collected results, it can be concluded that the proposed optimizations have successfully reduced the inference time from 53 s to 0.187 s, a speedup of 285.2×.

The design of the RNN and the time-distributed dense cores did not undergo the same level of optimization exploration as the CNN core. This decision was influenced by the already low execution time when compared to the execution of the convolutional layers. Furthermore, the exploration of the parallelism of the GRU cells was limited due to the presence of only one GRU cell instead of the original 64 cells.

The execution of the RNN in software takes around 10.5 ms against 4.7 ms when executed in the hardware accelerator. The execution of the time-distributed layer takes 7.4 ms in software and 1.1 ms in hardware.

5.4. Overall Performance

The performance results were found by averaging the execution time across 400 samples. Table 10 shows the execution time in SW-only and in SW/SW, as well as the speed-up factor achieved with the proposed HW/SW system.

The software implementation takes around 53 s for each input evaluation, while the HW/SW implementation in the FPGA takes around 193 milliseconds and consumes 1.4 W.

This massive reduction in execution time (272×) increases the chances of identifying a bird vocalization when deployed in the field. Since the input is an audio recording of 10 s, the software implementation can only evaluate one input per minute, while the proposed architecture can achieve real-time processing capable of analyzing 5 vocalizations per second.

The number of occupied resources of the proposed system can be improved. Considering less restrictive timing constraints, the level of parallelism can be reduced, which reduces the occupied resources. This is important if very low-density FPGAs are used.

5.5. Comparison with the State of the Art

The proposed work was compared to relevant previous works on bird audio detection (see Table 11).

The proposed work is highly competitive in terms of accuracy and features a reduced model with only 92 KBytes needed to store the model. The proposed system not only reduces the required memory by about 6× compared to the work with the lowest memory requirement [11], but also runs in a low-power system, while the other works target desktop solutions.

The proposed work has a competitive accuracy compared to previous works, except against work [11], which has 5% better accuracy. However, it requires ×1000 more memory to store the whole model, which is impractical for an embedded system. Also, the work in [27] reports an accuracy of 98.7% (AUC is not reported). However, the method does not consider embedded computing. Also, it requires the calculation of RF, and a careful selection of data dimensions for a particular dataset. The complexity of the CNN is not reported.

Previous works on bird audio detection focus on the accuracy of the system. Only a few works consider the design of bird audio analysis in embedded systems. A previous study [31] develops a deep learning-based model to detect pest bird sounds of a particular bird and estimate their proximity. The system is deployed in an RP2040 microcontroller. In this project, a small model with around 1K parameters is considered. For this reason, the accuracy drops to less than 70%. The execution time is not reported.

The analysis of different embedded computing platforms for environmental sound recognition was considered in [32]. The work does not focus on bird audio detection but analyzes the implementation of sound classification in embedded platforms. They stress the importance of quantization-aware training and pruning in the optimization of the model for embedded deployment. 1D-CNN models achieved an accuracy from 44% to 78%, depending on the target dataset. 2D-CNN models achieve an average accuracy close to 64% when implemented in an FPGA, far from the necessary accuracy for a useful system.

6. Conclusions

This work presents a novel approach for bird audio detection on a SoC-FPGA, combining both algorithmic and hardware innovations. A hybrid CNN + RNN model was specifically designed and optimized to maximize the accuracy-to-complexity ratio while minimizing energy consumption on the embedded platform.

An end-to-end workflow was developed, covering model selection, training, and hardware/software co-design. The neural network was redesigned and quantized to improve both hardware efficiency and real-time performance, demonstrating that careful integration of the model and the accelerator is crucial for achieving high-performance, energy-efficient embedded systems.

The fully implemented system achieves 87.4% accuracy and an execution time of 196.7 ms per bird sound classification. Given that audio recordings typically span several seconds, this leaves sufficient margin to deploy more complex models if higher accuracy is desired, without compromising real-time constraints.

Overall, the novelty of this work lies in the co-optimization of a hybrid CNN+RNN model and its FPGA implementation, resulting in a practical, energy-efficient system for real-time bioacoustic detection.

Author Contributions

Conceptualization, R.L.d.S.; Methodology, R.L.d.S.; Software, R.L.d.S.; Validation, G.J.; Investigation, G.J.; Resources, R.P.D.; Data curation, G.J.; Writing—original draft, R.L.d.S.; Writing—review & editing, R.P.D.; Visualization, G.J.; Supervision, M.V. and R.P.D.; Project administration, M.V. and R.P.D.; Funding acquisition, M.V. and R.P.D. All authors have read and agreed to the published version of the manuscript.

Funding

This work is funded by national funds through FCT—Fundação para a Ciência e a Tecnologia, I.P., under projects/supports UID/6486/2025 and UID/PRR/6486/2025, and under projects UID/50021/2025 and UID/PRR/50021/2025, project 2023.15325.PEX and project LISBOA2030-FEDER-00692100 with DOI https://doi.org/10.54499/2023.16747.ICDT (accessed on 4 January 2026).

Data Availability Statement

Data are contained within the article.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Tóth, B.; Czeba, B. Convolutional Neural Networks for Large-Scale Bird Song Classification in Noisy Environment. In Proceedings of the Conference and Labs of the Evaluation Forum, Évora, Portugal, 5–8 September 2016. [Google Scholar]
Liu, D.; Xiao, H.; Chen, K. Research progress in bird sounds recognition based on acoustic monitoring technology: A systematic review. Appl. Acoust. 2025, 228, 110285. [Google Scholar] [CrossRef]
Anusha, P.; ManiSai, K. Bird Species Classification Using Deep Learning. In Proceedings of the 2022 International Conference on Intelligent Controller and Computing for Smart Power (ICICCSP), Hyderabad, India, 21–23 July 2022; pp. 1–5. [Google Scholar] [CrossRef]
Véstias, M.; Neto, H. Trends of CPU, GPU and FPGA for high-performance computing. In Proceedings of the 2014 24th International Conference on Field Programmable Logic and Applications (FPL), Munich, Germany, 2–4 September 2014; pp. 1–6. [Google Scholar] [CrossRef]
Véstias, M.P. A Survey of Convolutional Neural Networks on Edge with Reconfigurable Computing. Algorithms 2019, 12, 154. [Google Scholar] [CrossRef]
Lee, C.H.; Hsu, S.B.; Shih, J.L.; Chou, C.H. Continuous Birdsong Recognition Using Gaussian Mixture Modeling of Image Shape Features. Trans. Multi. 2013, 15, 454–464. [Google Scholar] [CrossRef]
Zhao, Z.; Zhang, S.; Xu, Z.; Bellisario, K.; Dai, N.; Omrani, H.; Pijanowski, B. Automated bird acoustic event detection and robust species classification. Ecol. Inform. 2017, 39, 99–108. [Google Scholar] [CrossRef]
Swaminathan, B.; Jagadeesh, M.; Vairavasundaram, S. Multi-label classification for acoustic bird species detection using transfer learning approach. Ecol. Inform. 2024, 80, 102471. [Google Scholar] [CrossRef]
Zhang, H.; McLoughlin, I.; Song, Y. Robust sound event recognition using convolutional neural networks. In Proceedings of the 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), South Brisbane, Australia, 19–24 April 2015; pp. 559–563. [Google Scholar] [CrossRef]
Bardeli, R.; Wolff, D.; Kurth, F.; Koch, M.; Tauchert, K.H.; Frommolt, K.H. Detecting bird sounds in a complex acoustic environment and application to bioacoustic monitoring. Pattern Recognit. Lett. 2010, 31, 1524–1534. [Google Scholar] [CrossRef]
Mario, L. Acoustic Bird Detection with Deep Convolutional Neural Networks. In Proceedings of the Detection and Classification of Acoustic Scenes and Events 2018 Workshop (DCASE2018); Technical Report, DCASE2018 Challenge; Tampere University of Technology: Tampere, Finland, 2018. [Google Scholar]
Liaqat, S.; Bozorg, N.; Jose, N.; Conrey, P.; Tamasi, A.; Johnson, M.T. Domain Tuning Methods for Bird Audio Detection. In Proceedings of the Detection and Classification of Acoustic Scenes and Events 2018 Workshop (DCASE2018); Technical Report, DCASE2018 Challenge; Tampere University of Technology: Tampere, Finland, 2018. [Google Scholar]
Vesperini, F.; Gabrielli, L.; Principi, E.; Squartini, S. A Capsule Neural Networks Based Approach for Bird Audio Detection. In Proceedings of the Detection and Classification of Acoustic Scenes and Events 2018 Workshop (DCASE2018); Technical Report, DCASE2018 Challenge; Tampere University of Technology: Tampere, Finland, 2018. [Google Scholar]
DCASE2018. Bird Audio Detection Challenge. 2018. Available online: https://dcase.community/challenge2018/task-bird-audio-detection (accessed on 20 November 2025).
DCASE2018. Bird Audio Detection Challenge—Results. 2018. Available online: https://dcase.community/challenge2018/task-bird-audio-detection-results (accessed on 20 November 2025).
Xie, J.; Hu, K.; Zhu, M.; Yu, J.; Zhu, Q. Investigation of Different CNN-Based Models for Improved Bird Sound Classification. IEEE Access 2019, 7, 175353–175361. [Google Scholar] [CrossRef]
Xie, J.; Zhu, M. Handcrafted features and late fusion with deep learning for bird sound classification. Ecol. Inform. 2019, 52, 74–81. [Google Scholar] [CrossRef]
Himawan, I.; Towsey, M.; Roe, P. 3D convolution recurrent neural networks for bird sound detection. In Proceedings of the 3rd Workshop on Detection and Classification of Acoustic Scenes and Events; Wood, M., Glotin, H., Stowell, D., Stylianou, Y., Eds.; Tampere University of Technology: Tampere, Finland, 2018; pp. 1–4. [Google Scholar]
Cakir, E.; Adavanne, S.; Parascandolo, G.; Drossos, K.; Virtanen, T. Convolutional recurrent neural networks for bird audio detection. In Proceedings of the 2017 25th European Signal Processing Conference (EUSIPCO), Kos Island, Greece, 28 August–2 September 2017; pp. 1744–1748. [Google Scholar] [CrossRef]
Sankupellay, M.; Konovalov, D. Bird Call Recognition using Deep Convolutional Neural Network, ResNet-50. In Proceedings of the Acoustics 2018; James Cook University: Brisbane, QLD, Australia, 2018. [Google Scholar] [CrossRef]
Conde, M.; Shubham, K.; Agnihotri, P.; Movva, N.D.; Bessenyei, S. Weakly-Supervised Classification and Detection of Bird Sounds in the Wild. A BirdCLEF 2021 Solution. arXiv 2021, arXiv:2107.04878. [Google Scholar]
Vo, H.T.; Thien, N.; Mui, K. Bird Detection and Species Classification: Using YOLOv5 and Deep Transfer Learning Models. Int. J. Adv. Comput. Sci. Appl. 2023, 14, 01407102. [Google Scholar] [CrossRef]
Puget, J.F. STFT Transformers for Bird Song Recognition. In Proceedings of the Conference and Labs of the Evaluation Forum, Bucharest, Romania, 21–24 September 2021. [Google Scholar]
Tang, Q.; Xu, L.; Zheng, B.; He, C. Transound: Hyper-head attention transformer for birds sound recognition. Ecol. Inform. 2023, 75, 102001. [Google Scholar] [CrossRef]
Su, Y.; Zhang, K.; Wang, J.; Madani, K. Environment Sound Classification Using a Two-Stream CNN Based on Decision-Level Fusion. Sensors 2019, 19, 1733. [Google Scholar] [CrossRef]
Xiao, H.; Liu, D.; Chen, K.; Zhu, M. AMResNet: An automatic recognition model of bird sounds in real environment. Appl. Acoust. 2022, 201, 109121. [Google Scholar] [CrossRef]
Han, X.; Peng, J. Bird sound detection based on sub-band features and the perceptron model. Appl. Acoust. 2024, 217, 109833. [Google Scholar] [CrossRef]
Adavanne, S.; Drossos, K.; Çakir, E.; Virtanen, T. Stacked Convolutional and Recurrent Neural Networks for Bird Audio Detection. arXiv 2017. [Google Scholar] [CrossRef]
Zhang, X.; Chen, A.; Zhou, G.; Zhang, Z.; Huang, X.; Qiang, X. Spectrogram-frame linear network and continuous frame sequence for bird sound classification. Ecol. Inform. 2019, 54, 101009. [Google Scholar] [CrossRef]
Zhang, S.; Gao, Y.; Cai, J.; Yang, H.; Zhao, Q.; Pan, F. A Novel Bird Sound Recognition Method Based on Multi-feature Fusion and a Transformer Encoder. Sensors 2023, 23, 8099. [Google Scholar] [CrossRef]
Aman, E.; Wang, H.C. A deep learning-based embedded system for pest bird sound detection and proximity estimation. Eur. J. Eng. Technol. Res. 2024, 9, 53–59. [Google Scholar] [CrossRef]
Vandendriessche, J.; Wouters, N.; da Silva, B.; Lamrini, M.; Chkouri, M.Y.; Touhafi, A. Environmental Sound Recognition on Embedded Systems: From FPGAs to TPUs. Electronics 2021, 10, 2622. [Google Scholar] [CrossRef]
MathWorks. “What Is Quantization?”. 2024. Available online: https://www.mathworks.com/discovery/quantization.html (accessed on 20 November 2025).
Google. QKeras: A Quantization Deep Learning Library for TensorFlow Keras. 2026. Available online: https://github.com/google/qkeras (accessed on 4 January 2026).
Stowell, D.; Plumbley, M.D. Freefield1010: An open dataset for research on audio field recording archives. In Proceedings of the 53rd Audio Engineering Society Conference on Semantic Audio (AES 53), London, UK, 26–29 January 2014. [Google Scholar]
Stowell, D.; Wood, M.; Pamuła, H.; Stylianou, Y.; Glotin, H. Automatic acoustic detection of birds through deep learning: The first Bird Audio Detection challenge. J. Methods Ecol. Evol. 2018, 10, 368–380. [Google Scholar] [CrossRef]
microfaune. “microfaune_ai (updated fork)”. 2020. Available online: https://github.com/W-Alphonse/microfaune_ai (accessed on 20 November 2025).
Cho, K.; Merrienboer, B.; Gulcehre, C.; Bahdanau, D.; Bougares, F.; Schwenk, H.; Bengio, Y. Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation. arXiv 2014, arXiv:1406.1078. [Google Scholar] [CrossRef]

Figure 1. Project workflow, from model selection to FPGA deployment.

Figure 2. Original model accuracy (AUC) training progression plot, 90.5%.

Figure 3. Proposed system architecture designed for a ZYNQ FPGA with a processing system and programmable logic.

Figure 4. Conv3D IP overview.

Figure 5. GRU block overview.

Figure 6. State diagram of the execution of the proposed BAD model.

Table 1. Accuracy comparison between different numbers of GRU cells in the bidirectional GRU layers.

GRU Cells	64	32	16	8	4	2	1
Accuracy (AUC)	90.50%	90.25%	90.01%	89.89%	89.81%	89.72%	89.80%

Table 2. Comparison of the number of parameters used by the Bidirectional GRUs with 64 cells vs. 1 cell.

	64 Cells		1 Cell
Number Params	BGRU_0	BGRU_1	BGRU_0	BGRU_1
Forward Bias	192	192	3	3
F. Recurrent Bias	192	192	3	3
F. Kernel	12,288	24,576	192	6
F. Recurrent Kernel	12,288	12,288	3	3
Backward Bias	192	192	3	3
B. Recurrent Bias	192	192	3	3
B. Kernel	12,288	24,576	192	6
B. Recurrent Kernel	12,288	12,288	3	3
Total	124,416		432
	Reduction Factor		288

Table 3. Bytes used by Conv3D weights.

Bytes	Conv3D_0	Conv3D_1	Conv3D_2	Conv3D_3	Conv3D_4	Conv3D_5
Bias	128	128	128	128	128	128
Kernel	576	18,432	18,432	18,432	18,432	18,432
Scale	32	32	32	32	32	32
Total	768	18,592	18,592	18,592	18,592	18,592	93,728

Table 4. CONV unrolling cycles’ exploration.

Conv3D Unroll	K. Lines	K. cols	Par. Filters	Cycles	BRAM	DSP	FF	LUT
Ultra96-V2	-	-	-	-	216	360	141,120	70,560
Baseline	1	1	1	13,240,401	40	0	881	9761
	1	1	2	8,275,281	40	0	885	12,917
Selected	1	1	4	5,516,881	75	0	1138	18,492
	1	3	4	3,861,841	229	0	1618	34,678
Fastest	3	3	4	1,103,696	711	0	2778	64,537

Table 5. BGRU IP expected FPGA resource usage.

Conv3D Unrolled	Cycles	BRAM	DSP	FF	LUT
ZU3CG FPGA	-	216	360	141,120	70,560
BGRU IP	186,633	24	0	1438	10,045

Table 6. Accuracy between the original, modified (original with a single GRU cell), and full quantized model with a single GRU cell.

	Count		Percentage
	Correct	Incorrect	Correct	Incorrect
Original	362	38	90.50%	9.50%
Modified	359	41	89.80%	10.20%
FPGA	350	50	87.46%	12.54%

Table 7. Weights memory usage (in bytes) comparison between a model with 64 GRU cells and 1 GRU cell, using the previously mentioned weights calculations.

Weights Memory	Convolutions	1st GRU	2nd GRU	Non-Quantized	Total
Model w/64 cells	111,552	49,920	74,496	66,052	302,020
Model w/1 cell	111,552	402	30	66,052	178,036
Reduction Factor	1	124.2	2483.2	1	1.7

Table 8. FPGA resource usage of the hardware/software architecture.

	BRAM	DSP	FF	LUT
Conv block	75	0	1138	18,492
GRU block	24	0	1438	10,045
Dense block	2	64	2188	286
Others	7	0	1213	5138
ZU3CG FPGA	216	360	141,120	70,560
Usage %	50.0%	18.1%	4.3%	48.1%

Table 9. Comparison of the milliseconds used in the CNN portion of the model, between software and hardware.

Milliseconds	Conv 0	Conv 1	Conv 2	Conv 3	Conv 4	Conv 5	Total
Software	535	20,902.1	10,673.4	10,661.5	5332.8	5289.5	53,394.3
Hardware	52.7	54.1	26.9	27.1	13.2	13.2	187.2
Speed Up Factor	10.2	386.0	396.8	393.4	404.0	400.7	285.2

Table 10. Execution time of the inference of the model in the CPU and HW/SW architecture.

Architecture	Execution Time (ms)
SW	53,495.3
HW/SW	196.7
Speed-Up Factor	×272

Table 11. Comparison of the proposed work with previous works.

Work	Model	Params	Memory	AUC
[11]	CNN	23.8M	95 Mbytes	92.0%
[12]	CNN	130K	520 KBytes	86.8%
[13]	CNN	424K	1.7 MBytes	83.7%
[27]	FR + CNN	—		98.7% *
[18]	CNN + RNN	380K	1.5 MBytes	87.1%
[19]	CNN + RNN	806K	3.2 MBytes	88.5%
Ours	CNN + RNN	190K	92 KBytes	87.4%

* Only accuracy is reported, not AUC.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Silva, R.L.d.; Jacinto, G.; Véstias, M.; Duarte, R.P. Real-Time Bird Audio Detection with a CNN-RNN Model on a SoC-FPGA. Electronics 2026, 15, 354. https://doi.org/10.3390/electronics15020354

AMA Style

Silva RLd, Jacinto G, Véstias M, Duarte RP. Real-Time Bird Audio Detection with a CNN-RNN Model on a SoC-FPGA. Electronics. 2026; 15(2):354. https://doi.org/10.3390/electronics15020354

Chicago/Turabian Style

Silva, Rodrigo Lopes da, Gustavo Jacinto, Mário Véstias, and Rui Policarpo Duarte. 2026. "Real-Time Bird Audio Detection with a CNN-RNN Model on a SoC-FPGA" Electronics 15, no. 2: 354. https://doi.org/10.3390/electronics15020354

APA Style

Silva, R. L. d., Jacinto, G., Véstias, M., & Duarte, R. P. (2026). Real-Time Bird Audio Detection with a CNN-RNN Model on a SoC-FPGA. Electronics, 15(2), 354. https://doi.org/10.3390/electronics15020354

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Real-Time Bird Audio Detection with a CNN-RNN Model on a SoC-FPGA

Abstract

1. Introduction

2. Related Work

3. System Design for Bird Audio Detection

3.1. Model Exploration and Design

3.2. Model Quantization and Weights Extraction

4. Hardware/Software System for Bird Audio Detection

4.1. Hardware Accelerator

4.1.1. CONV Block

4.1.2. GRU Block

4.1.3. Dense Block

4.2. Software Design

5. Results

5.1. System Accuracy

5.2. Resource Utilization of the System

5.3. CNN Accelerator Performance

5.4. Overall Performance

5.5. Comparison with the State of the Art

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI