# Hardware–Software Co-Design of an Audio Feature Extraction Pipeline for Machine Learning Applications

^{1}

^{2}

^{3}

^{*}

## Abstract

**:**

## 1. Introduction

## 2. Audio Preprocessing

#### 2.1. Classic MFCC Features

- We start by sampling the audio with a microphone and an Analog-to-Digital converter. We take 16,384 samples because it simplifies framing in the next step.$$X\left[n\right],\phantom{\rule{1.em}{0ex}}n\in [0,16,383]$$
- Next, we separate the audio into frames and use a windowing function to reduce the effect of spectral leakage. In the illustrated example, we employ a frame size of 512, thus obtaining 32 frames ($32\times 512=$ 16,384). A commonly used windowing function is the Hamming window shown in Figure 2:$$X\left[n\right]\to {X}^{\prime}[w,{n}^{\prime}]\xb7win\left[{n}^{\prime}\right]={X}^{\u2033}[w,{n}^{\prime}],\phantom{\rule{1.em}{0ex}}w\in [0,31],\phantom{\rule{1.em}{0ex}}{n}^{\prime}\in [0,511]$$
- We perform the Short-Time Fourier Transform (STFT) [9], which is essentially a Discrete Fourier Transform (DFT) applied to each frame. Due to the symmetry of the DFT for real signals, it suffices to consider only the first half and one of the frequency bins of the DFT (512/2 + 1 = 257):$${X}^{\u2033}[w,{n}^{\prime}]\to STFT\left({X}^{\u2033}\right)\to Y[w,k],\phantom{\rule{1.em}{0ex}}w\in [0,31],\phantom{\rule{1.em}{0ex}}k\in [0,256]$$
- We filter the power spectrum by a mel-frequency filter bank matrix. The mel-frequency filter bank is a set of triangle filters where the triangle width increases exponentially to mimic the non-linear human ear perception of sound. The number of filters in the bank is a settable parameter. In our example, it is set to 6. Figure 3 shows an example mel-frequency filter bank with 6 filters:$$Y[w,k]\to \left|Y[w,k]\right|\xb7M={Y}^{\prime},\phantom{\rule{1.em}{0ex}}shape\left(M\right)=(257,6),\phantom{\rule{1.em}{0ex}}shape\left({Y}^{\prime}\right)=(32,6)$$
- We apply the logarithmic function to each element of matrix ${Y}^{\prime}$.
- Finally, we compute the Discrete Cosine Transform (DCT) on ${Y}^{\prime}$ to obtain the MFCC features.

#### 2.2. Simplified MFCC

- The arithmetic in floating-point number representation requires complex circuits to implement the multiplication and addition operations. Therefore, we use fixed-point arithmetic, which is simpler to implement and is more energy-efficient.
- Instead of computing the full power spectrum, we use just the real part of the DFT result. We do this because the real part holds most of the information.
- Instead of the natural logarithm, we use an approximation of the logarithm of number two. Essentially, we take an integer part of the result and determine the position of the leading one bit.
- We noticed that due to the logarithm approximation, low-amplitude sound intervals were indistinguishable from completely silent intervals. To mitigate this, we added the value 1.0 to each DFT frequency bin, which increased the frequency bin amplitude while preserving their variations.
- We skip the DCT calculation entirely.

## 3. Hardware

#### 3.1. SDF-FFT

#### 3.2. Mel-Engine

## 4. Python Integration-Chisel4ml

## 5. Case Study: Keyword Spotting on Google Speech Commands

#### 5.1. Depthwise Separable Convolutional Neural Network

#### 5.2. Experimental Settings

#### 5.3. Keyword-Spotting Results

- MFCC—The full floating-point MFCC as described in Section 2.1.
- W/O DCT—Same as MFCC but without the DCT at the end.
- LOG2APPROX—Same as W/O DCT but using the logarithm base-2 approximation instead of the natural logarithm.
- LMFE—The LMFE features as described in Section 2.2.

## 6. Hardware Synthesis Results

- frame_size—The size of the frame in STFT (128, 256, 512, and 1024).
- num_frames—Number windows that together make a frame (8, 16, 32, and 64).
- num_mels—Number of mel filters (10, 13, 15, 20).

## 7. Conclusions

## Author Contributions

## Funding

## Data Availability Statement

## Conflicts of Interest

## Abbreviations

MFCC | Mel-Frequency Cepstrum Coefficients |

LMFE | Log-Mel Filter Bank Real Energy |

DCT | Discrete Cosine Transform |

DFT | Discrete Fourier Transform |

FFT | Fast Fourier Transform |

STFT | Short-time Fourier Transform |

LUT | Look-up Table |

FF | Flip-Flop |

DSP | Digital-Signal Processing block |

DS-CNN | Depthwise Seperable Convolutional Neural Network |

FPGA | Field-Programmable Gate Array |

KWS | Keyword Spotting |

MFROM | Mel-Filter Read Only Memory |

ReLU | Rectified Linear Unit |

## Appendix A. Using Chisel4ml

Listing A1. Using chisel4ml. | |

1 import tensorflow as tf | |

2 import numpy as np | |

3 from chisel4ml import generate, FFTConfig, LMFEConfig, FFTLayer, LMFELayer | |

4 | |

5 preproc_model = tf.keras.Sequential() | |

6 preproc_model.add(tf.keras.layers.Input(num_frames, frame_length)) | |

7 preproc_model.add( | |

8 FFTLayer( | |

9 FFTConfig( | |

10 fft_size=frame_length, | |

11 num_frames=num_frames, | |

12 win_fn=np.hamming(frame_length), | |

13 ) | |

14 ) | |

15 ) | |

16 preproc_model.add( | |

17 LMFELayer( | |

18 LMFEConfig( | |

19 fft_size=frame_length, | |

20 num_frames=num_frames, | |

21 num_mels=num_mels, | |

22 ) | |

23 ) | |

24 ) | |

25 preproc_circuit = generate.circuit(preproc_model) | |

26 | |

27 sw_res = preproc_model(audio_sample) | |

28 hw_res = preproc_circuit(audio_sample) | |

29 assert np.allclose( | |

30 sw_res.numpy().flatten(), | |

31 hw_res.flatten(), | |

32 atol=1, | |

33 rtol=0.05 | |

34 ) |

## Appendix B. Synthesis Results

Parameterization | LUT | FF | DSP | CYCLES | T [Msamples/s] | DP [W] |
---|---|---|---|---|---|---|

(128, 8, 10) | 4353 | 2557 | 40 | 225 | 301.72 | 0.105 |

(128, 8, 13) | 4362 | 2557 | 40 | 225 | 301.72 | 0.105 |

(128, 8, 15) | 4351 | 2557 | 40 | 225 | 301.72 | 0.106 |

(128, 8, 20) | 4353 | 2558 | 40 | 225 | 301.72 | 0.106 |

(128, 16, 10) | 4360 | 2558 | 40 | 225 | 603.44 | 0.105 |

(128, 16, 13) | 4358 | 2558 | 40 | 225 | 603.44 | 0.106 |

(128, 16, 15) | 4355 | 2558 | 40 | 225 | 603.44 | 0.105 |

(128, 16, 20) | 4362 | 2559 | 40 | 225 | 603.44 | 0.106 |

(128, 32, 10) | 4349 | 2559 | 40 | 225 | 1206.87 | 0.105 |

(128, 32, 13) | 4355 | 2559 | 40 | 225 | 1206.87 | 0.105 |

(128, 32, 15) | 4354 | 2559 | 40 | 225 | 1206.87 | 0.105 |

(128, 32, 20) | 4361 | 2560 | 40 | 225 | 1206.87 | 0.105 |

(128, 64, 10) | 4356 | 2560 | 40 | 225 | 2413.74 | 0.106 |

(128, 64, 13) | 4355 | 2560 | 40 | 225 | 2413.74 | 0.106 |

(128, 64, 15) | 4357 | 2560 | 40 | 225 | 2413.74 | 0.106 |

(128, 64, 20) | 4363 | 2561 | 40 | 225 | 2413.74 | 0.105 |

(256, 8, 10) | 5110 | 2870 | 46 | 421 | 322.50 | 0.117 |

(256, 8, 13) | 5112 | 2870 | 46 | 421 | 322.50 | 0.117 |

(256, 8, 15) | 5114 | 2870 | 46 | 421 | 322.50 | 0.117 |

(256, 8, 20) | 5115 | 2871 | 46 | 421 | 322.50 | 0.117 |

(256, 16, 10) | 5114 | 2871 | 46 | 421 | 645.00 | 0.117 |

(256, 16, 13) | 5114 | 2871 | 46 | 421 | 645.00 | 0.117 |

(256, 16, 15) | 5113 | 2871 | 46 | 421 | 645.00 | 0.117 |

(256, 16, 20) | 5114 | 2872 | 46 | 421 | 645.00 | 0.117 |

(256, 32, 10) | 5107 | 2872 | 46 | 421 | 1290.00 | 0.117 |

(256, 32, 13) | 5113 | 2872 | 46 | 421 | 1290.00 | 0.117 |

(256, 32, 15) | 5107 | 2872 | 46 | 421 | 1290.00 | 0.117 |

(256, 32, 20) | 5112 | 2873 | 46 | 421 | 1290.00 | 0.117 |

(256, 64, 10) | 5113 | 2873 | 46 | 421 | 2580.01 | 0.117 |

(256, 64, 13) | 5113 | 2873 | 46 | 421 | 2580.01 | 0.117 |

(256, 64, 15) | 5115 | 2873 | 46 | 421 | 2580.01 | 0.117 |

(256, 64, 20) | 5116 | 2874 | 46 | 421 | 2580.01 | 0.117 |

(512, 8, 10) | 6216 | 3269 | 52 | 809 | 335.66 | 0.140 |

(512, 8, 13) | 6218 | 3269 | 52 | 809 | 335.66 | 0.140 |

(512, 8, 15) | 6221 | 3269 | 52 | 809 | 335.66 | 0.140 |

(512, 8, 20) | 6224 | 3270 | 52 | 809 | 335.66 | 0.140 |

(512, 16, 10) | 6222 | 3270 | 52 | 809 | 671.31 | 0.140 |

(512, 16, 13) | 6217 | 3270 | 52 | 809 | 671.31 | 0.140 |

(512, 16, 15) | 6215 | 3270 | 52 | 809 | 671.31 | 0.140 |

(512, 16, 20) | 6222 | 3271 | 52 | 809 | 671.31 | 0.140 |

(512, 32, 10) | 6221 | 3271 | 52 | 809 | 1342.63 | 0.140 |

(512, 32, 13) | 6221 | 3271 | 52 | 809 | 1342.63 | 0.140 |

(512, 32, 15) | 6219 | 3271 | 52 | 809 | 1342.63 | 0.140 |

(512, 32, 20) | 6226 | 3272 | 52 | 809 | 1342.63 | 0.140 |

(512, 64, 10) | 6219 | 3272 | 52 | 809 | 2685.25 | 0.140 |

(512, 64, 13) | 6221 | 3272 | 52 | 809 | 2685.25 | 0.140 |

(512, 64, 15) | 6222 | 3272 | 52 | 809 | 2685.25 | 0.140 |

(512, 64, 20) | 6228 | 3273 | 52 | 809 | 2685.25 | 0.140 |

(1024, 8, 10) | 7802 | 3716 | 58 | 1581 | 343.51 | 0.173 |

(1024, 8, 13) | 7805 | 3716 | 58 | 1581 | 343.51 | 0.173 |

(1024, 8, 15) | 7810 | 3716 | 58 | 1581 | 343.51 | 0.173 |

(1024, 8, 20) | 7813 | 3717 | 58 | 1581 | 343.51 | 0.173 |

(1024, 16, 10) | 7806 | 3717 | 58 | 1581 | 687.02 | 0.173 |

(1024, 16, 13) | 7807 | 3717 | 58 | 1581 | 687.02 | 0.173 |

(1024, 16, 15) | 7809 | 3717 | 58 | 1581 | 687.02 | 0.173 |

(1024, 16, 20) | 7809 | 3718 | 58 | 1581 | 687.02 | 0.173 |

(1024, 32, 10) | 7806 | 3718 | 58 | 1581 | 1374.05 | 0.173 |

(1024, 32, 13) | 7809 | 3718 | 58 | 1581 | 1374.05 | 0.173 |

(1024, 32, 15) | 7805 | 3718 | 58 | 1581 | 1374.05 | 0.173 |

(1024, 32, 20) | 7810 | 3719 | 58 | 1581 | 1374.05 | 0.173 |

(1024, 64, 10) | 7807 | 3719 | 58 | 1581 | 2748.09 | 0.173 |

(1024, 64, 13) | 7807 | 3719 | 58 | 1581 | 2748.09 | 0.173 |

(1024, 64, 15) | 7808 | 3719 | 58 | 1581 | 2748.09 | 0.173 |

(1024, 64, 20) | 7813 | 3720 | 58 | 1581 | 2748.09 | 0.173 |

## References

- Baevski, A.; Zhou, Y.; Mohamed, A.; Auli, M. wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations. In Proceedings of the Advances in Neural Information Processing Systems, Virtual, 6–12 December 2020; Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M., Lin, H., Eds.; Curran Associates, Inc.: Nice, France, 2020; Volume 33, pp. 12449–12460. [Google Scholar]
- Zhang, Y.; Suda, N.; Lai, L.; Chandra, V. Hello Edge: Keyword Spotting on Microcontrollers. arXiv
**2018**, arXiv:cs.SD/1711.07128. [Google Scholar] - Fariselli, M.; Rusci, M.; Cambonie, J.; Flamand, E. Integer-Only Approximated MFCC for Ultra-Low Power Audio NN Processing on Multi-Core MCUs. In Proceedings of the 2021 IEEE 3rd International Conference on Artificial Intelligence Circuits and Systems (AICAS), Washington, DC, USA, 6–9 June 2021; pp. 1–4. [Google Scholar] [CrossRef]
- Paul S, B.S.; Glittas, A.X.; Gopalakrishnan, L. A low latency modular-level deeply integrated MFCC feature extraction architecture for speech recognition. Integration
**2021**, 76, 69–75. [Google Scholar] [CrossRef] - Bae, S.; Kim, H.; Lee, S.; Jung, Y. FPGA Implementation of Keyword Spotting System Using Depthwise Separable Binarized and Ternarized Neural Networks. Sensors
**2023**, 23, 5701. [Google Scholar] [CrossRef] [PubMed] - Zhang, Y.; Qiu, X.; Li, Q.; Qiao, F.; Wei, Q.; Luo, L.; Yang, H. Optimization and Evaluation of Energy-Efficient Mixed-Signal MFCC Feature Extraction Architecture. In Proceedings of the 2020 IEEE Computer Society Annual Symposium on VLSI (ISVLSI), Limassol, Cyprus, 6–8 July 2020; pp. 506–511. [Google Scholar] [CrossRef]
- Vreča, J.; Biasizzo, A. Towards Deploying Highly Quantized Neural Networks on FPGA Using Chisel. In Proceedings of the 26th Euromicro Conference on Digital System Design (DSD), Durres, Albania, 6–8 September 2023; IEEE: Piscataway, NJ, USA, 2023. [Google Scholar]
- Abdul, Z.K.; Al-Talabani, A.K. Mel Frequency Cepstral Coefficient and its Applications: A Review. IEEE Access
**2022**, 10, 122136–122158. [Google Scholar] [CrossRef] - Allen, J. Short term spectral analysis, synthesis, and modification by discrete Fourier transform. IEEE Trans. Acoust. Speech Signal Process.
**1977**, 25, 235–238. [Google Scholar] [CrossRef] - Bachrach, J.; Vo, H.; Richards, B.; Lee, Y.; Waterman, A.; Avižienis, R.; Wawrzynek, J.; Asanović, K. Chisel: Constructing Hardware in a Scala Embedded Language. In Proceedings of the 49th Annual Design Automation Conference, New York, NY, USA, 3–7 June 2012; pp. 1216–1225. [Google Scholar] [CrossRef]
- Milovanović, V.M.; Petrović, M.L. A Highly Parametrizable Chisel HCL Generator of Single-Path Delay Feedback FFT Processors. In Proceedings of the 2019 IEEE 31st International Conference on Microelectronics (MIEL), Niš, Serbia, 16–18 September 2019; pp. 247–250. [Google Scholar] [CrossRef]
- Warden, P. Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv
**2018**, arXiv:cs.CL/1804.03209. [Google Scholar] - Chollet, F. Xception: Deep Learning With Depthwise Separable Convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017. [Google Scholar]
- Howard, A.G.; Zhu, M.; Chen, B.; Kalenichenko, D.; Wang, W.; Weyand, T.; Andreetto, M.; Adam, H. MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications. arXiv
**2017**, arXiv:1704.04861. [Google Scholar]

**Table 1.**Comparison of our hardware results with [5].

MPU-256 [5] | ME/FFT-256 | |
---|---|---|

LUT | 5349 | 5116 |

FF | 3735 | 2874 |

DSP | 5 | 46 |

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

## Share and Cite

**MDPI and ACS Style**

Vreča, J.; Pilipović, R.; Biasizzo, A.
Hardware–Software Co-Design of an Audio Feature Extraction Pipeline for Machine Learning Applications. *Electronics* **2024**, *13*, 875.
https://doi.org/10.3390/electronics13050875

**AMA Style**

Vreča J, Pilipović R, Biasizzo A.
Hardware–Software Co-Design of an Audio Feature Extraction Pipeline for Machine Learning Applications. *Electronics*. 2024; 13(5):875.
https://doi.org/10.3390/electronics13050875

**Chicago/Turabian Style**

Vreča, Jure, Ratko Pilipović, and Anton Biasizzo.
2024. "Hardware–Software Co-Design of an Audio Feature Extraction Pipeline for Machine Learning Applications" *Electronics* 13, no. 5: 875.
https://doi.org/10.3390/electronics13050875