FPGA Implementation of Blue Whale Calls Classifier Using High-Level Programming Tool

In this paper, we propose a hardware-based architecture for automatic blue whale calls classification based on short-time Fourier transform and multilayer perceptron neural network. The proposed architecture is implemented on field programmable gate array (FPGA) using Xilinx System Generator (XSG) and the Nexys-4 Artix-7 FPGA board. This high-level programming tool allows us to design, simulate and execute the compiled design in Matlab/Simulink environment quickly and easily. Intermediate signals obtained at various steps of the proposed system are presented for typical blue whale calls. Classification performances based on the fixed-point XSG/FPGA implementation are compared to those obtained by the floating-point Matlab simulation, using a representative database of the blue whale calls.


Introduction
In the last decades, many pattern recognition systems have been proposed in various signal processing fields such as speech recognition, speaker identification/verification, musical instrument identification, biomedical signal classification, bioacoustic signal recognition, etc.A typical pattern recognition system is composed of two main blocks: feature extraction and classification.The feature extraction is the process that transform the pattern signal from original high-dimensional representation into lower-dimensional space, while the classification task operates to associate the signal pattern to one of predefined classes.The literature review shows that discrete Fourier transform (DFT), mel frequency cepstral coefficients (MFCC), linear predictive coefficients (LPC), discrete wavelet transform (DWT), and time-frequency contours have been used for feature extraction, while artificial neural networks (ANN), support vector machine (SVM), Gaussian mixture models (GMM), vector quantization (VQ), Fisher's discriminant analysis (FD), and k-nearest neighbor (k-NN), classification and regression tree (CART), dynamic time warping (DTW), and hidden Markov models (HMM) have been used for the recognition/classification task.
The pattern recognition algorithms were generally developed and optimized using software platform such as Matlab (R2012b, MathWorks, Natick, MA, USA, 2012).However, many applications require hardware implementation of these algorithms into real-time embedded systems.Hardware implementation remains a great challenge that requires tradeoff between accuracy, resource, and computation speed [18,19].Most of the hardware-based architectures are proposed for speech and speaker recognition using digital signal processor (DSP) [20] or field programmable gate array (FPGA) [21][22][23][24][25][26].FPGAs are generally preferred to the other hardware approaches because they combine the high-performance of the application-specific integrated circuits (ASICs) and the flexibility of the DSPs.
However, to the best of our knowledge, no hardware architecture has been proposed in the literature to classify/recognize bioacoustic signals.In this paper, we propose a hardware implementation of an automatic blue whale calls recognition based on the short-time Fourier transform (STFT) and the multilayer perceptron (MLP) neural network.Based on our previous algorithm developed using Matlab software [10], the proposed architecture has been optimized and implemented on FPGA using Xilinx System Generator (XSG) and Nexys-4 Artix-7 FPGA board.This architecture takes advantage of the native parallelism of the FPGA chips, instead of sequential processors implemented on these chips [27].The main contributions are: (1) Optimization of the MLP-based classifier by reducing the number of its hidden neurons; (2) Development of XSG-based models of the mathematical equations describing the characterisation/classification algorithm to be implemented on FPGA; (3) Optimization of the required resources by reducing the fixed-point data format.

Feature Extraction
The implemented technique is based on the short-time Fourier transform (STFT) to extract a reduced number of features from two frequency subbands corresponding to the AB phrase and vocalization D, respectively [10,11].The vocalization signal is split into successive frames before applying the STFT to each frame in order to extract 12-dimensional feature vector x m (Figure 3).Feature extraction process based on the short-time Fourier transform (STFT).Power spectrum density P s (m, k) of each signal frame s(m, k) is computed using the STFT.The feature vector x m is constructed by extracting six components from lower subband (15.1375-20.508Hz) and six components from upper subband (38.574-84.961Hz).The discrete frequencies are obtained with a sampling frequency of 250 Hz and a frame length of 512 samples.

Signal Windowing
The vocalization signal is divided into consecutive frames of N samples, s(m, n), where m is the frame index and n is the time index within the analyzed frame.
where w(n) is a window function of N samples.This function is located at mL, where L is the shift time step in samples.Hamming window is usually selected to reduce signal discontinuities during the segmentation and their resulting spectral artifacts.

Short-Time Fourier Transform
The short-time Fourier transform (STFT) of a given frame s(m, n) is defined by: where N is the number of discrete frequencies that is usually chosen to be a power-of-2 in order to use the fast Fourier transform (FFT) algorithm, j = √ −1, and k is the frequency index (k = 0, . . ., N − 1).

Power Spectrum
The power spectrum density (PSD) is computed from the complex Fourier spectrum using At the sampling frequency f S , each signal frame is represented by N-points PSD covering the frequency range [− f S /2, f S /2].As the power spectrum is symmetric, it can be described by only N/2 discrete frequencies.Each frequency index k represents a discrete frequencies f k = k f S /N, where 0 ≤ k < N/2.

Feature Vector Calculation
This method extracts features from two subbands (15.137-20.508Hz) and (38.574-84.961Hz) corresponding to the AB and D calls frequency ranges, respectively [10,11] (Figure 3).The first six components of the feature vector x m are calculated by averaging PSD points between P s (m, 31) and P s (m, 42) by bins of 2 points.The last six features are similarly calculated for the PSD interval from P s (m, 79) to P s (m, 174) but with bins of 16 points.The feature vector components, of a given frame m, are defined by Thus, the analyzed frame, s(m, n), will be characterized by the extracted feature vector, x m = [x m,1 , x m,2 , . . ., x m,12 ] T , where T represents the transpose operation.

Classification
Based on the extracted features vectors from a given vocalization, the classification task operates to associate this vocalization to one of predefined classes.The multi-layer Perceptron (MLP) is probably the most popular and the simplest artificial neural network that can be used for this kind of classification problems [10,11].However, other techniques such as vector quantization (VQ), k-nearest neighbour (k-NN) and Gaussian mixture models (GMM) can also be employed.

Multilayer Perceptron
The MLP neural network is a set of several interconnected neurons arranged in layers: an input layer that acts as a data buffer, one or more hidden layers, and an output layer.Figure 4 represents an example of a MLP network characterized by N 0 = 12 inputs, one hidden layer of N 1 neurons, and N 2 = 3 outputs.Each hidden neuron j receives the output of each node i from the input layer through a connection of weight w h j,i and then produce a corresponding response y h j which is forwarded to the output layer.In fact, each neuron j performs a weighted sum that is transferred by a nonlinear function ϕ h according to where w h j,0 represents the bias of the jth hidden neuron.Similarly, the output of neuron k, in output layer, is defined by: where ϕ o is the activation function, w o k,j is synaptic weight relating the output of the j th neuron in the hidden layer to the k th neuron of the output layer, and w o k,0 is the bias of the k th output neuron.The activation functions of the hidden neurons are typically hyperbolic tangent or logistic sigmoid.However, those of the output neurons can be linear or non-linear depending on the task performed by the network: a function approximation or a classification, respectively.In the proposed architecture, we used a logistic sigmoid function ϕ(x) = (1 + e −x ) −1 for neurons of both layers.The connection weights and biases are determined in the training phase using a set of inputs for which the desired outputs are known {(x 1 , d 1 ), (x 2 , d 2 ) , ..., (x p , d p )}.In the pattern recognition problems, the desired or target outputs are vectors of N 2 components, where N 2 = 3 is the number of the reference classes.For each desired vector d i , only one component, corresponding to the presented input pattern x i , is set to 1 and the others are set to 0. To accomplish the training task, the backpropagation (BP) algorithm is commonly used [28].For the implemented architecture, the training phase is performed on Matlab and the resulting weights and biases are transferred to the XSG constant blocks.

Class Recognition
To classify an unknown vocalization, the observation sequence X = {x 1 , x 2 , . . ., x M }, corresponding to the feature vectors obtained from M adjacent segments of this vocalization, is presented to the MLP neural network that responds by the actual output sequence Y = {y 1 , y 2 , . . ., y M }.For each output k, the network provides a set of M values {y m,k }, m = 1, . . ., M. To classify the entire vocalization, the means of the values obtained over frames at the neural network outputs are commonly used.The unknown vocalization is associated to the reference call that corresponds to the largest mean value of the individual outputs: where the mean values y k are computed over the M frames.
To avoid the use of a counter block to calculate the number of frames M in the tested vocalization and a divider block at each MLP output, the mean values Equation (10), are replaced by sum values Equation ( 11)

FPGA Implementation
Our objective is to propose a hardware-based architecture for automatic blue whale calls classification.The FPGAs have been chosen over the other hardware approaches because they combine the high-performance of the ASICs and the flexibility of DSPs.FPGAs are programmed using Hardware Description Languages (HDL) such as VHDL and Verilog, which require much experience with the hardware design and remain time consuming to perform the low-level description and debugging of the circuit to be mapped on FPGAs.In our case, we opted for a high-level programming tool (XSG) that allows us, with a high-level of abstraction, to design, simulate and execute the compiled design in the Matlab/Simulink environment quickly and easily.Once the XSG-based architecture is verified in Simulink, it can automatically be mapped to hardware implementation by Xilinx ISE Design Suite.The above techniques for feature extraction and classification were implemented on FPGA using the architecture presented in Figures 5-8, where the main parts are described in the following subsections.

Signal Windowing
The windowed signal s(m, n) was obtained by multiplying the input signal s(n) with a Hamming window w(n) of N = 512 samples, stored in a read only memory (ROM).The Xilinx ROM block is driven by a modulo-N cyclic counter [29] (Figure 5).

Short-Time Fourier Transform
The short-time Fourier transform (STFT) of a given input frame, s(m, n), is computed using a Xilinx FFT (fast Fourier transform) block.Pipelined streaming option has been chosen to achieve continuous computation of the STFT [29].As shown in Figure 5, the Xilinx FFT block provides two outputs xk_re and xk_im corresponding to real, S r (m, k), and imaginary, S i (m, k), parts of S(m, k).The output signals xn_index and xk_index mark the time index n of the input data and the frequency index k of the output data, respectively.The output signal done toggles high when the FFT block is ready to output the processed data, while the edone output signal toggles one sample period before.

Power Spectrum
The real and imaginary parts of the Fourier transform provided by the XSG FFT block are used to calculate the power spectrum defined in Equation ( 3).The corresponding subsystem (Figure 5) requires two multipliers and one adder [29].It can be noted that division by N = 512 is replaced with 9-bit right shift.

Feature Extraction
The feature extraction block is constructed using twelve subblocks that compute six feature components from the lower subband and six others from the upper subband according to Equation (4).The first component of the lower subband (x m,1 ) and the first component of the upper subband (x m,7 ) are described in Figure 5.Each feature component subblock uses adders and delays to compute the sum of successive samples of the power spectrum.Division by 2 and 16 are replaced with 1-bit and 4-bit right shifts, respectively.Each subblock saves its corresponding feature component using a register enabled by the frequency index (k).Twelve other registers are placed at the output of these subblocks to update them at the end of every frame.They are enabled by the done signal provided by the FFT block [29].

MLP Neural Network
The implemented MLP architecture is presented in Figure 6, where each hidden neuron is constructed using adders and multipliers to compute the weighted sum.The XSG constant blocks correspond to the synoptic weights and biases that are computed in the training phase using Matlab language.The logistic sigmoid function is implemented using XSG ROM block of 2 NB samples, where NB = 14 is the size of the bus address.To convert a given input value, v h j (m), to its corresponding address, this input is 6-bit left shifted rather than multiplied by 2 NB /AM = 64, where AM is the maximum amplitude of the input values that is found to be less than 256.This implementation takes advantage from the symmetrical characteristic of this activation function (ϕ h (−t) = 1 − ϕ h (t)).On the other hand, Figure 7 presents the output neuron structure and its logistic sigmoid function.

One Value Per Frame
For a given input vocalization, the neural network outputs remain constant during each frame; so only one value per frame can be used to compute the sums of the neural network outputs according to Equation (11).As shown in Figure 5, the "one sample per frame" block provides the ctrl signal that toggles high at the end of each of the proceeded M frames, by taking into account the computation delays.This block is based on the signals edone and eof provided by the blocks FFT and "From multimedia", respectively.

Class Recognizer
For a vocalization of M frames presented to the system input, the class recognizer block compares the resulting neural network outputs to associate it to one of the predefined classes.As shown in Figure 8, the class recognizer uses an accumulator at each of the neural network outputs y 1 , y 2 , and y 3 that correspond to A, B, and D vocalization classes, respectively.Enabled by the ctrl signal, these accumulators compute the sums of the values obtained at these outputs during the M frames, according to Equation (11).Comparator and multiplexer blocks are used to find the index k of the largest cumulated output among y 1 , y 2 , and y 3 , according to Equation (9).The recognized class is represented by integer values k = 1, . . ., 3 that correspond to A, B, and D vocalization classes, respectively.

Implementation Characteristics
Starting with 36-bit fixed-point format in various blocks of the XSG-based model in order to ensure the same classification performance as the floating-point format of Matlab simulation, the data width was progressively reduced using a test-error approach.The fixed-point formats (generally of 24 bits) used in different blocks, that guarantee the same performance as Matlab, are shown in Figures 5-8.For example, FIX_24_19 represents a 2's complement signed 24-bit fixed-point format having 19 fractional bits.UFIX_24_10 represents an unsigned 24-bit fixed-point format having 10 fractional bits.
Once completed and verified, the XSG-based architecture can be automatically mapped to hardware implementation.Two FPGA chips are targeted in this study: Artix-7 XC7A100T and Virtex-6 XC6VLX240T available on the Nexys-4 and ML605 boards, respectively.Table 1 gives the resource utilization, the total power consumption, and the maximum operating frequency, as provided by the Xilinx ISE 14.7 tool.These values are based on the fixed-point data format and the ROMs size indicated on Figures 5-8.The proposed architecture uses 219 DSP48E1s blocks.This is justified by the parallel architecture of the MLP networks that requires 12 multipliers for each hidden neuron and 7 multipliers for each output neuron.Thus, the MLP network requires 7 × 12 + 3 × 7 = 105 multipliers of numbers coded with 24 bits.The power spectrum block needs 2 multipliers of numbers coded with 26 bits, while the windowing block needs 1 multiplier of numbers coded with 16 bits.Considering the fact that each DSP48E1 slice contains one 25 × 18 multiplier, a multiplication of two numbers coded with 26-bit, 24-bit and 16-bit fixed-point format requires 4, 2 and 1 DSP48E1s, respectively.Therefore, the whole architecture requires 105 × 2 + 2 × 4 + 1 = 219 DSP48E1s.Except for DSP48E1s, this architecture requires only a small part of resources available on these FPGA (Artix-7 and Virtex-6).Difference in the maximum operating frequency is probably due to the speed grades of these FPGAs.
The implemented architecture on the Artix-7 XC7A100T FPGA operates at more than 25 MHz and consumes less than 123 mW.It can be an interesting solution for autonomous real-time classification system.The blue whale calls are sampled at 2 kHz.

Database
The recordings were collected in the Saguenay-St.Lawrence Marine Park at the head of the 300-m deep Laurentian channel in the Lower St. Lawrence Estuary.The used hydrophones were AURAL autonomous hydrophones (Multi-Electronique Inc., Rimouski, QC, Canada) moored at intermediate depths in the water column, as in standard oceanographic mooring.The 16-bit recording data were acquired at the 2000 Hz optional sampling rate of the AURALs, which includes an appropriate antialiasing low-pass filter of 1000 Hz [30].For the blue whale calls classification, the selected recordings were down-sampled to 250 Hz [10,11].

Protocol
Blue whale vocalizations were extracted manually from records and categorized into the three classes (A, B and D) by the visualization of their spectrogram using Adobe Audition (1.5, Adobe, San Jose, CA, USA, 2004) software.Each class of the constructed database contains 100 calls.To use the whole dataset, the "k-fold cross-validation" method is employed to evaluate the performance of the implemented characterization/classification architecture.The principle of this method consists in dividing each class into 10 groups, each time we use 9 groups for training and the last one for the test [10,11].

Classification Performance
The performance of the proposed classifier is evaluated by the ratio of the number of the correctly classified vocalization (NCCV) to the number of tested vocalizations (NTV) [10,11].

. Simulation Using XSG Blockset
This section presents the simulation results obtained with the proposed architecture implemented with XSG blockset (Figures 5-8).These blocks provide bit and cycle accurate modeling for arithmetic and logic functions, memories, and DSP functions.
Figure 9 presents an A call input and its corresponding response signals obtained at different levels of the implemented characterization/classification architecture.The input signal s(n) is first split into non-overlapped frames s(m, n) using Hamming windows.The FFT block has an output latency of 1150 samples that corresponds to the delay between the first sample of a given input frame s(m, n) and the first sample of its Fourier transform S(m, k).The power spectrum P s (m, k) is computed for all the frequency components k of the frame m.As shown for x m (3), the feature vector components of a given frame m are updated at the end of this frame.The neural network outputs y m (1), y m (2) and y m (3) are calculated for each frame.The ctrl signal toggles high at the end of each of the proceeded M frames.There are only eight pulses because the length of this signal is greater than 8N but less than 9N, where N is the frame length of 512 samples.The class recognizer provides a provisory class index that is based on the neural network outputs of the present and the previous frames.The actual class recognition will be available only when the last frame is proceeded.The A call is classified as B during the first three frames, but is truly classified as A at the last frame (Figure 9).Figures 10 and 11 present the same response signals but for B and D calls, respectively.Note that the B call (Figure 10) is confounded with the A call during the first four frames.However, the D call (Figure 11) is clearly separated from the A and B ones during all frames.
On the other hand, Figure 12 shows the feature vectors, the MLP neural network outputs, and the class recognizer output obtained for an A call.It can be seen that the first neural network output is globally predominant, but during the first three frames, the A call is confounded with the B call.Figures 13 and 14 present the same response signals but for the B and D calls, respectively.Figure 13 shows that the second neural network output is globally predominant, but the B call is confounded with the A call during the first four frames.Figure 14 shows that the D call is clearly discriminated from the A and B calls during all frames.Note that A and B calls are principally represented by the first half of the feature components {x m (1), . . ., x m (6)}, while the D call is represented by the second half ones {x m (7), . . ., x m (12)} (Figures 12-14).The ML605 evaluation kit is pre-configured in the Simulink/XSG environment for hardware co-simulation.System Generator block allows us to choose a clock frequency target design of 33.3, 50, 66.7 or 100 MHz.However, the Nexys-4 board must be configured by the customer to achieve the hardware co-simulation.Once the designed model is verified by Simulink simulation using XSG blockset, it can be executed on actual FPGA using the hardware co-simulation compilation provided by the System Generator.Depending on the targeted board, this compilation automatically creates a bitstream and associated it to a Simulink block (Figure 15).When the design is simulated, the compiled model is executed in hardware with the flexible simulation environment of Simulink.System Generator reads the input data from wav file in Simulink environment and send them to the design on the board using the JTAG connection.It then reads the classification result (output) back from JTAG and sends it to Simulink for display.
Table 2 shows that the fixed-point XSG implementation gives the same performance (85.67%) as the floating-point Matlab one using the described database.Table 3 gives the confusion matrix of the true classes versus assigned classes for the fixed-point and floating-point implementations of the STFT/MLP based classifier.These results show the high performance of the implemented method to separate D class from A and B, but its difficulty to discriminate A and B classes [10].The obtained classification result (85.67%) with only 7 hidden neurons is comparable to that (86.25%) obtained with 25 hidden neurons [10].This slight difference can also be explained by other experimental test conditions : (a) the feature vectors are extracted with 0% overlap instead of 50%; (b) the use of logistic sigmoid in hidden layer instead of hyperbolic tangent; and (c) performance is obtained from one test instead of averaged values of 50 repeated tests [10].
It can be noted that the classification error between the fixed-point XSG implementation and the floating-point Matlab simulation will not be zero if the data width is not sufficient.This error will increase if the number of bits decreases under the values indicated in Figures 5-8.

Conclusions
A blue whale calls classification technique based on short-time Fourier transform and MLP neural network has been successfully implemented on FPGA using a high-level programming language (XSG).The fixed-point XSG implementation, on the low-cost Nexys-4 Artix-7 FPGA board, provides the same classification performance as the floating-point Matlab simulation using a reduced number of bits.The maximum operating frequency and the total power consumption obtained for the Artix-7 XC7A100T FPGA show that the proposed architecture can be used to build an autonomous real-time classification system.
As future work, a vocalization activity detector (VAD) will be added to automatically extract the calls before there classification.

Figure 1 .
Figure 1.Spectrogram of an A call (a), a B call (b), and a D call (c).

Figure 2
Figure 2 represents the block diagram of a typical classification system, which is composed of the two main blocks: feature extraction and classification.

Figure 2 .
Figure 2. Block diagram of a typical classification system based on short-time Fourier transform (STFT) and multilayer perceptron (MLP) neural network.

Figure 3 .
Figure 3.Feature extraction process based on the short-time Fourier transform (STFT).Power spectrum density P s (m, k) of each signal frame s(m, k) is computed using the STFT.The feature vector x m is constructed by extracting six components from lower subband(15.1375-20.508Hz) and six components from upper subband (38.574-84.961Hz).The discrete frequencies are obtained with a sampling frequency of 250 Hz and a frame length of 512 samples.

Figure 5 .
Figure 5. Implementation of the proposed classification technique on field programmable gate array (FPGA) using Xilinx System Generator (XSG).The top-level Simulink diagram (top panel) with details of different subsystems (bottom panels).The green blocks are designed using the XSG blocks (blue).The white blocks are the standard Simulink blocks.

Figure 6 .
Figure 6.Architecture of the implemented MLP neural network.The hidden neuron structure, and its sum and hyperbolic tangent blocks are also detailed.

Figure 7 .
Figure 7. Implementation of the output neuron and its sum and logistic sigmoid blocks.

Figure 8 .
Figure 8. Implementation of the class recognizer.

Figure 9 .
Figure 9. Response signals obtained during the characterization/classification of an A call.(a) input signal s(n); (b) time index n that allows to delimit successive frames s(m, n); (c) windowed signal s(m, n); (d) frequency index k that allows to delimit successive power spectra; (e) power spectra P s (m, k); (f) done signal; (g) third component of the feature vector x m (3); (h) first MLP output y m,1 ; (i) second MLP output y m,2 ; (j) third MLP output y m,3 ; (k) ctrl signal; and (l) recognized class.

Figure 10 .
Figure 10.Response signals obtained during the characterization/classification of a B call.(a) input signal s(n); (b) time index n that allows to delimit successive frames s(m, n); (c) windowed signal s(m, n); (d) frequency index k that allows to delimit successive power spectra; (e) power spectra P s (m, k); (f) done signal; (g) third component of the feature vector x m (3); (h) first MLP output y m,1 ; (i) second MLP output y m,2 ; (j) third MLP output y m,3 ; (k) ctrl signal; and (l) recognized class.

Figure 11 .
Figure 11.Response signals obtained during the characterization/classification of a D call.(a) input signal s(n); (b) time index n that allows to delimit successive frames s(m, n); (c) windowed signal s(m, n); (d) frequency index k that allows to delimit successive power spectra; (e) power spectra P s (m, k); (f) done signal; (g) ninth component of the feature vector x m (9); (h) first MLP output y m,1 ; (i) second MLP output y m,2 ; (j) third MLP output y m,3 ; (k) ctrl signal; and (l) recognized class.

Figure 12 .
Figure 12.(a) feature vectors based on STFT extracted from an A call; (b) MLP neural network outputs; and (c) class recognizer output.During the first three frames, the A call is confounded with B.

Figure 13 .
Figure 13.(a) feature vectors based on STFT extracted from a B call; (b) MLP outputs; and (c) class recognizer output.During the first frames, the B call is confounded with A.

Figure 14 .
Figure 14.(a) feature vectors based on STFT extracted from a D call; (b) MLP outputs; and (c) class recognizer output.Over all frames, the D call is discriminated from the A and B calls.

Table 1 .
Resource utilization, maximum operating frequency, and total power consumption obtained for the Artix-7 XC7A100T and Virtex-6 XC6VLX240T chips, as reported by the Xilinx ISE 14.7 tool.Each FPGA slice contains four LUTs (look-up tables) and eight flip-flops.Each DSP48E1 slice includes a 25 × 18 multiplier, an adder, and an accumulator.

Table 2 .
Performances obtained with XSG and Matlab based implementations.

Table 3 .
Confusion matrix of XSG and Matlab based implementations.