Spectrogram Features for Audio and Speech Analysis
Featured Application
Abstract
1. Introduction
2. Taxonomy of Spectrograms
2.1. Basic Spectrogram
2.2. Spectrograms Are Not Pictures
2.2.1. Colour and Greyscale
2.2.2. Translation Invariance and Scaling
2.2.3. Local Features
2.3. Mel-Spectrogram
2.4. Constant-Q Spectrogram
2.5. Correlogram
2.6. Stabilised Auditory Image
2.7. Patches and Regions
2.8. Scaling and Number Representation
2.9. Pooling and Downsampling
2.10. Variance Normalised Features
3. Audio Analysis
3.1. Taxonomy of Audio Analysis
3.1.1. Feature Extraction
3.1.2. Overlap and Occluded Sounds
3.1.3. Early Sound Event Classification
3.2. Sound Event Detection
3.3. Anomalous Sound Detection
3.4. Bioacoustics
4. Speech Analysis
4.1. Language and Dialect Identification
4.2. Speaker Verification
4.3. Speech Emotion Recognition
5. Conclusions
Future Directions
- Noise robustness, particularly towards overlapping sounds and reverberation.
- Model complexity and real-time operation on edge devices.
- Robust separation of intertwined sounds, particularly for multi-channel recordings.
- Timeliness—early detection before a sound has completed.
- Generalisation to unseen sounds, e.g., few- and zero-shot classification, including from multimodal prompts [38].
Author Contributions
Funding
Data Availability Statement
Conflicts of Interest
Abbreviations
| AE | Autoencoder |
| AIM | Auditory Image Model |
| ASA | Acoustic Scene Analysis |
| ASD | Anomalous Sound Detection |
| AST | Audio Spectrogram Transformer |
| CNN | Convolutional Neural Network |
| CQT | Constant-Q Transform |
| DCASE | Detection and Classification of Acoustic Scenes and Events |
| DCT | Discrete Cosine Transform |
| DFT | Discrete Fourier Transform |
| DID | Dialect Identification |
| DWT | Discrete Wavelet Transform |
| ERB | Equivalent Rectangular Banks |
| FB | Filterbanks |
| FFT | Fast Fourier Transform |
| GTG | Gammatonegram |
| LID | Language Identification |
| LMS | Log-Mel Spectrogram |
| LNS | Log Non-uniform Spectrum |
| LS | Linear Spectrogram |
| LSS | Log-Scaled Spectrogram |
| LSTM | Long-Short Term Memory |
| MFCC | Mel-frequency Cepstral Coefficients |
| MS | Mel Spectrogram |
| OE | Outlier Exposure |
| PAMIR | Passive-aggressive Model for Image Retrieval |
| PLP | Perceptual Linear Prediction |
| PSDS | Polyphonic Sound Detection Score |
| RWCP | Real World Computing Partnership |
| RNN | Recurrent Neural Network |
| SAI | Stabilised Auditory Image |
| SDC | Shifted Delta Cepstra |
| SED | Sound Event Detection |
| SID | Speaker Identification |
| SSA | Sound Scene Analysis |
| SSD | Sound Scene Detection |
| SER | Speech Emotion Recognition |
| SNR | Signal to Noise Ratio |
| SV | Speaker Verification |
| SVM | Support Vector Machine |
| STFT | Short Time Fourier Transform |
| VAE | Variational Autoencoder |
| VNF | Variance Normalised Features |
References
- Koenig, W.; Dunn, H.K.; Lacy, L.Y. The Sound Spectrograph. J. Acoust. Soc. Am. 1946, 18, 19–49. [Google Scholar] [CrossRef]
- Zue, V.W.; Cole, R.A. Experiments on spectrogram reading. In Proceedings of the Acoustics, Speech, and Signal Processing, IEEE International Conference on ICASSP’79, Washington, DC, USA, 2–4 April 1979; Volume 4, pp. 116–119. [Google Scholar]
- McLoughlin, I.V. Speech and Audio Processing: A MATLAB-Based Approach; Cambridge University Press: Cambridge, UK, 2016. [Google Scholar]
- Gibbs, J.W. Fourier Series. Nature 1899, 59, 200, 606. [Google Scholar] [CrossRef]
- Ifeachor, E.C.; Jervis, B.W. Digital Signal Processing: A Practical Approach; Addison-Wesley: Boston, MA, USA, 1993. [Google Scholar]
- McLoughlin, I.; Zhang, H.M.; Xie, Z.P.; Song, Y.; Xiao, W. Robust Sound Event Classification using Deep Neural Networks. IEEE Trans. Audio Speech Lang. Process. 2015, 23, 540–552. [Google Scholar] [CrossRef]
- Wang, Z.J.; Turko, R.; Shaikh, O.; Park, H.; Das, N.; Hohman, F.; Kahng, M.; Polo Chau, D.H. CNN Explainer: Learning Convolutional Neural Networks with Interactive Visualization. IEEE Trans. Vis. Comput. Graph. 2021, 27, 1396–1406. [Google Scholar] [CrossRef]
- Desplanques, B.; Thienpondt, J.; Demuynck, K. ECAPA-TDNN: Emphasized Channel Attention, Propagation and Aggregation in TDNN Based Speaker Verification. In Proceedings of the Interspeech, Shanghai, China, 25 October 2020; pp. 3830–3834. [Google Scholar]
- Radford, A.; Kim, J.W.; Xu, T.; Brockman, G.; McLeavey, C.; Sutskever, I. Robust speech recognition via large-scale weak supervision. In Proceedings of the ICML, Honolulu, HI, USA, 23 July 2023. [Google Scholar]
- Kim, J.Y.; Lee, S.H. Accuracy enhancement method for speech emotion recognition from spectrogram using temporal frequency correlation and positional information learning through knowledge transfer. IEEE Access 2024, 12, 128039–128048. [Google Scholar] [CrossRef]
- Ellis, D.P. Gammatone-Like Spectrograms. 2009. Available online: https://www.ee.columbia.edu/~dpwe/resources/matlab/gammatonegram/ (accessed on 1 January 2020).
- FitzGerald, D. Vocal separation using nearest neighbours and median filtering. In Proceedings of the IET Irish Signals and Systems Conference (ISSC 2012), Maynooth, Ireland, 28 June 2012; p. 98G. [Google Scholar]
- Brown, J.C. Calculation of a constant Q spectral transform. J. Acoust. Soc. Am. 1991, 89, 425–434. [Google Scholar] [CrossRef]
- Yizhi, L.; Yuan, R.; Zhang, G.; Ma, Y.; Chen, X.; Yin, H.; Xiao, C.; Lin, C.; Ragni, A.; Benetos, E.; et al. MERT: Acoustic music understanding model with large-scale self-supervised training. In Proceedings of the ICLR, Kigali, Rwanda, 1 May 2023. [Google Scholar]
- Huang, H.; Man, J.; Li, L.; Zeng, R. Musical timbre style transfer with diffusion model. PeerJ Comput. Sci. 2024, 10, e2194. [Google Scholar] [CrossRef]
- Ma, N.; Green, P.; Barker, J.; Coy, A. Exploiting correlogram structure for robust speech recognition with multiple speech sources. Speech Commun. 2007, 49, 874–891. [Google Scholar] [CrossRef]
- Patterson, R.D.; Allerhand, M.H.; Giguere, C. Time-domain modeling of peripheral auditory processing: A modular architecture and a software platform. J. Acoust. Soc. Am. 1995, 98, 1890–1894. [Google Scholar] [CrossRef]
- Walters, T.C. Auditory-Based Processing of Communication Sounds. Ph.D. Thesis, University of Cambridge, Cambridge, UK, 2011. [Google Scholar]
- Lyon, R.F.; Rehn, M.; Bengio, S.; Walters, T.C.; Chechik, G. Sound retrieval and ranking using sparse auditory representations. Neural Comput. 2010, 22, 2390–2416. [Google Scholar] [CrossRef]
- Lyon, R.F.; Rehn, M.; Walters, T.; Bengio, S.; Chechik, G. Audio Classification for Information Retrieval Using Sparse Features. U.S. Patent 8,463,719, 11 June 2013. [Google Scholar]
- Lyon, R.F.; Ponte, J.; Chechik, G. Sparse coding of auditory features for machine hearing in interference. In Proceedings of the International Conference on Acoustics, Speech and Signal Processing (ICASSP), Prague, Czech Republic, 22 May 2011; pp. 5876–5879. [Google Scholar]
- Lyon, R.F. Machine hearing: Audio analysis by emulation of human hearing. In Proceedings of the Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), New Palz, NY, USA, 16 October 2011; p. viii. [Google Scholar]
- Zhang, H.; McLoughlin, I.; Song, Y. Robust Sound Event Recognition using Convolutional Neural Networks. In Proceedings of the International Conference on Acoustics, Speech and Signal Processing (ICASSP), Brisbane, Australia, 19 April 2015; pp. 559–563. [Google Scholar]
- Nowak, E.; Jurie, F.; Triggs, B. Sampling Strategies for Bag-of-Features Image Classification. In Proceedings of the Computer Vision—ECCV 2006; Leonardis, A., Bischof, H., Pinz, A., Eds.; Springer: Berlin/Heidelberg, Germany, 2006; pp. 490–503. [Google Scholar] [CrossRef]
- Gong, Y.; Chung, Y.A.; Glass, J. AST: Audio Spectrogram Transformer. In Proceedings of the Interspeech 2021, Graz, Austria, 15 September 2021; pp. 571–575. [Google Scholar] [CrossRef]
- Dennis, J.; Tran, H.D.; Chng, E.S. Image feature representation of the subband power distribution for robust sound event classification. IEEE Trans. Audio Speech Lang. Process. 2013, 21, 367–377. [Google Scholar] [CrossRef]
- Snyder, D.; Garcia-Romero, D.; Sell, G.; Povey, D.; Khudanpur, S. X-Vectors: Robust DNN Embeddings for Speaker Recognition. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Calgary, AB, Canada, 15 April 2018; pp. 5329–5333. [Google Scholar] [CrossRef]
- Xie, Z.; McLoughlin, I.; Zhang, H.; Song, Y.; Xiao, W. A new variance-based approach for discriminative feature extraction in machine hearing classification using spectrogram features. Digit. Signal Process. 2016, 54, 119–128. [Google Scholar] [CrossRef]
- Miao, X.; McLoughlin, I.; Song, Y. Variance Normalised Features for Language and Dialect Discrimination. Circuits Syst. Signal Process. 2021, 40, 3621–3638. [Google Scholar] [CrossRef]
- Jin, M.; Song, Y.; McLoughlin, I. LID-senone Extraction via Deep Neural Networks for End-to-End Language Identification. In Proceedings of the Odyssey, Bilbao, Spain, 21–24 June 2016. [Google Scholar]
- Davis, S.; Mermelstein, P. Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences. IEEE Trans. Acoust. Speech Signal Process. 1980, 28, 357–366. [Google Scholar] [CrossRef]
- Malina, W. On an Extended Fisher Criterion for Feature Selection. IEEE Trans. Pattern Anal. Mach. Intell. 1981, PAMI-3, 611–614. [Google Scholar] [CrossRef]
- Oord, A.v.d.; Dieleman, S.; Zen, H.; Simonyan, K.; Vinyals, O.; Graves, A.; Kalchbrenner, N.; Senior, A.; Kavukcuoglu, K. WaveNet: A Generative Model for Raw Audio. arXiv 2016, arXiv:1609.03499. [Google Scholar] [CrossRef]
- Rafsanjani, M.A.H.; Mawalim, C.O.; Lestari, D.P.; Sakti, S.; Unoki, M. Unsupervised Anomalous Sound Detection Using Timbral and Human Voice Disorder-Related Acoustic Features. In Proceedings of the Asia Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC), Macao, China, 3 December 2024; pp. 1–6. [Google Scholar]
- Miao, X.; McLoughlin, I.; Yan, Y. A New Time-Frequency Attention Mechanism for TDNN and CNN-LSTM-TDNN, with Application to Language Identification. In Proceedings of the Interspeech, Graz, Austria, 15 September 2019. [Google Scholar]
- Koutini, K.; Schlüter, J.; Eghbal-zadeh, H.; Widmer, G. Efficient Training of Audio Transformers with Patchout. In Proceedings of the Interspeech, Incheon, Republic of Korea, 18 September 2022; pp. 2753–2757. [Google Scholar] [CrossRef]
- Chen, K.; Du, X.; Zhu, B.; Ma, Z.; Berg-Kirkpatrick, T.; Dubnov, S. HTS-AT: A Hierarchical Token-Semantic Audio Transformer for Sound Classification and Detection. In Proceedings of the ICASSP, Singapore, 22 May 2022. [Google Scholar]
- Cai, P.; Song, Y.; Gu, Q.; Jiang, N.; Song, H.; McLoughlin, I. Detect Any Sound: Open-Vocabulary Sound Event Detection with Multi-Modal Queries. In Proceedings of the 33rd ACM International Conference on Multimedia, Dublin, Ireland, 27 October 2025; pp. 582–591. [Google Scholar] [CrossRef]
- Chen, T.; Yang, Y.; Qiu, C.; Fan, X.; Guo, X.; Shangguan, L. Enabling Hands-Free Voice Assistant Activation on Earphones. In Proceedings of the 22nd Annual International Conference on Mobile Systems, Applications and Services (MOBISYS), Tokyo, Japan, 3–7 June 2024; pp. 155–168. [Google Scholar] [CrossRef]
- Bregman, A.S. Auditory Scene Analysis: The Perceptual Organization of Sound; MIT Press: Cambridge, MA, USA, 1994. [Google Scholar]
- Mesaros, A.; Heittola, T.; Virtanen, T.; Plumbley, M.D. Sound Event Detection: A tutorial. IEEE Signal Process. Mag. 2021, 38, 67–83. [Google Scholar] [CrossRef]
- Zeng, X.M.; Song, Y.; Zhuo, Z.; Zhou, Y.; Li, Y.H.; Xue, H.; Dai, L.R.; McLoughlin, I. Joint Generative-Contrastive Representation Learning for Anomalous Sound Detection. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Rhodes, Greece, 4 June 2023; pp. 1–5. [Google Scholar] [CrossRef]
- Miao, X.; McLoughlin, I. LSTM-TDNN with convolutional front-end for Dialect Identification in the 2019 Multi-Genre Broadcast Challenge. arXiv 2019, arXiv:1912.09003. [Google Scholar]
- Jiang, Y.; Song, Y.; McLoughlin, I.; Gao, Z.; Dai, L.R. An Effective Deep Embedding Learning Architecture for Speaker Verification. In Proceedings of the Interspeech, Graz, Austria, 15 September 2019; pp. 4040–4044. [Google Scholar]
- Sun, L.; Du, J.; Jiang, C.; Zhang, X.; He, S.; Yin, B.; Lee, C.H. Speaker Diarization with Enhancing Speech for the First DIHARD Challenge. In Proceedings of the Interspeech, Hyderabad, India, 2 September 2018; pp. 2793–2797. [Google Scholar]
- Gao, Z.; Song, Y.; McLoughlin, I.; Li, P.; Jiang, Y.; Dai, L.R. Improving Aggregation and Loss Function for Better Embedding Learning in End-to-End Speaker Verification System. In Proceedings of the Interspeech, Graz, Austria, 15 September 2019; pp. 361–365. [Google Scholar]
- Pham, L.; Phan, H.; Palaniappan, R.; Mertins, A.; McLoughlin, I. CNN-MoE Based Framework for Classification of Respiratory Anomalies and Lung Disease Detection. IEEE J. Biomed. Health Inform. 2021, 25, 2938–2947. [Google Scholar] [CrossRef] [PubMed]
- Nguyen, T.; Pernkopf, F. Lung Sound Classification Using Co-Tuning and Stochastic Normalization. IEEE Trans. Biomed. Eng. 2022, 69, 2872–2882. [Google Scholar] [CrossRef] [PubMed]
- Milling, M.; Pokorny, F.B.; Bartl-Pokorny, K.D.; Schuller, B.W. Is Speech the New Blood? Recent Progress in AI-Based Disease Detection From Audio in a Nutshell. Front. Digit. Health 2022, 4, 886615. [Google Scholar] [CrossRef] [PubMed]
- Rashid, M.M.; Li, G.; Du, C. Nonspeech7k dataset: Classification and analysis of human non-speech sound. IET Signal Process. 2023, 17, e12233. [Google Scholar] [CrossRef]
- Kim, S.Y.; Lee, H.M.; Lim, C.Y.; Kim, H.W. Detection of Abnormal Symptoms Using Acoustic-Spectrogram-Based Deep Learning. Appl. Sci. 2025, 15, 4679. [Google Scholar] [CrossRef]
- Moysis, L.; Iliadis, L.A.; Sotiroudis, S.P.; Boursianis, A.D.; Papadopoulou, M.S.; Kokkinidis, K.I.D.; Volos, C.; Sarigiannidis, P.; Nikolaidis, S.; Goudos, S.K. Music Deep Learning: Deep Learning Methods for Music Signal Processing—A Review of the State-of-the-Art. IEEE Access 2023, 11, 17031–17052. [Google Scholar] [CrossRef]
- Chen, R.; Ghobakhlou, A.; Narayanan, A. Hierarchical Residual Attention Network for Musical Instrument Recognition Using Scaled Multi-Spectrogram. Appl. Sci. 2024, 14, 10837. [Google Scholar] [CrossRef]
- Buisson, M.; McFee, B.; Essid, S.; Crayencour, H.C. Self-Supervised Learning of Multi-Level Audio Representations for Music Segmentation. IEEE/ACM Trans. Audio Speech Lang. Process. 2024, 32, 2141–2152. [Google Scholar] [CrossRef]
- Thapa, N.; Lee, J. Dual-Path Beat Tracking: Combining Temporal Convolutional Networks and Transformers in Parallel. Appl. Sci. 2024, 14, 11777. [Google Scholar] [CrossRef]
- Verma, P.; Berger, J. Audio transformers: Transformer architectures for large scale audio understanding. adieu convolutions. arXiv 2021, arXiv:2105.00335. [Google Scholar]
- Grzeszick, R.; Plinge, A.; Fink, G.A. Bag-of-features methods for acoustic event detection and classification. IEEE/ACM Trans. Audio Speech Lang. Process. 2017, 25, 1242–1252. [Google Scholar] [CrossRef]
- Dennis, J.; Tran, H.D.; Chng, E.S. Overlapping sound event recognition using local spectrogram features and the generalised hough transform. Pattern Recognit. Lett. 2013, 34, 1085–1093. [Google Scholar] [CrossRef]
- Adavanne, S.; Politis, A.; Nikunen, J.; Virtanen, T. Sound Event Localization and Detection of Overlapping Sources Using Convolutional Recurrent Neural Networks. IEEE J. Sel. Top. Signal Process. 2019, 13, 34–48. [Google Scholar] [CrossRef]
- Xia, W.; Koishida, K. Sound Event Detection in Multichannel Audio using Convolutional Time-Frequency-Channel Squeeze and Excitation. arXiv 2019, arXiv:1908.01399. [Google Scholar][Green Version]
- Alcázar, J.N.; Zuccarello, P.; Cobos, M. Classification of Sound Scenes and Events in Real-World Scenarios with Deep Learning Techniques. Ph.D. Thesis, Universitat de Valencia, Valencia, Spain, 2020. Available online: https://roderic.uv.es/rest/api/core/bitstreams/d50a3d5d-8481-4e33-8196-ff5af972771f/content (accessed on 1 January 2026).
- Wisdom, S.; Tzinis, E.; Erdogan, H.; Weiss, R.; Wilson, K.; Hershey, J. Unsupervised Sound Separation Using Mixture Invariant Training. In Proceedings of the Advances in Neural Information Processing Systems; Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M., Lin, H., Eds.; Curran Associates, Inc.: New York, NY, USA, 2020; Volume 33, pp. 3846–3857. [Google Scholar]
- Kahl, S.; Wood, C.M.; Eibl, M.; Klinck, H. BirdNET: A deep learning solution for avian diversity monitoring. Ecol. Inform. 2021, 61, 101236. [Google Scholar] [CrossRef]
- Nath, K.; Sarma, K.K. Separation of overlapping audio signals: A review on current trends and evolving approaches. Signal Process. 2024, 221, 109487. [Google Scholar] [CrossRef]
- Sudo, Y.; Itoyama, K.; Nishida, K.; Nakadai, K. Environmental sound segmentation utilizing Mask U-Net. In Proceedings of the 2019 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Macau, China, 4 November 2019; pp. 5340–5345. [Google Scholar] [CrossRef]
- Sudo, Y.; Itoyama, K.; Nishida, K.; Nakadai, K. Multi-channel Environmental sound segmentation. In Proceedings of the IEEE/SICE International Symposium on System Integration (SII), Honolulu, HI, USA, 12 January 2020; pp. 820–825. [Google Scholar] [CrossRef]
- Baelde, M.; Biernacki, C.; Greff, R. A mixture model-based real-time audio sources classification method. In Proceedings of the 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), New Orleans, LA, USA, 5 March 2017; pp. 2427–2431. [Google Scholar] [CrossRef]
- Xu, K.; Feng, D.; Mi, H.; Zhu, B.; Wang, D.; Zhang, L.; Cai, H.; Liu, S. Mixup-Based Acoustic Scene Classification Using Multi-channel Convolutional Neural Network. In Proceedings of the Advances in Multimedia Information Processing—PCM 2018; Hong, R., Cheng, W.H., Yamasaki, T., Wang, M., Ngo, C.W., Eds.; Springer International Publishing: Cham, Switzerland, 2018; pp. 14–23. [Google Scholar]
- Phan, H.; Maass, M.; Mazur, R.; Mertins, A. Early event detection in audio streams. In Proceedings of the 2015 IEEE International Conference on Multimedia and Expo (ICME), Torino, Italy, 29 June 2015; pp. 1–6. [Google Scholar] [CrossRef]
- Zhao, X.; Zhang, X.; Zhao, C.; Cho, J.H.; Kaplan, L.; Jeong, D.H. Multi-Label Temporal Evidential Neural Networks for Early Event Detection. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Rhodes, Greece, 4 June 2023. [Google Scholar] [CrossRef]
- Zhao, X.; Zhang, X.; Cheng, W.; Yu, W.; Chen, Y.; Chen, H.; Chen, F. Seed: Sound event early detection via evidential uncertainty. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Singapore, 22 May 2022; pp. 3618–3622. [Google Scholar]
- Mesaros, A.; Heittola, T.; Virtanen, T. TUT database for acoustic scene classification and sound event detection. In Proceedings of the 2016 24th European Signal Processing Conference (EUSIPCO), Budapest, Hungary, 29 August 2016; pp. 1128–1132. [Google Scholar] [CrossRef]
- Stowell, D.; Giannoulis, D.; Benetos, E.; Lagrange, M.; Plumbley, M.D. Detection and Classification of Acoustic Scenes and Events. IEEE Trans. Multimed. 2015, 17, 1733–1746. [Google Scholar] [CrossRef]
- Çakir, E.; Virtanen, T. End-to-End Polyphonic Sound Event Detection Using Convolutional Recurrent Neural Networks with Learned Time-Frequency Representation Input. In Proceedings of the 2018 International Joint Conference on Neural Networks (IJCNN), Rio de Janeiro, Brazil, 8 July 2018; pp. 1–7. [Google Scholar] [CrossRef]
- Bittner, R.; McFee, B.; Salamon, J.; Li, P.; Bello, J. Deep Salience Representations for F0 Estimation in Polyphonic Music. In Proceedings of the 18th International Society for Music Information Retrieval Conference, Suzhou, China, 23–27 October 2017. [Google Scholar]
- Leng, Y.R.; Tran, H.D.; Kitaoka, N.; Li, H. Selective gammatone filterbank feature for robust sound event recognition. In Proceedings of the Interspeech, Makuhari, Chiba, Japan, 26 September 2010; pp. 2246–2249. [Google Scholar] [CrossRef]
- Pham, L.; Phan, H.; Nguyen, T.; Palaniappan, R.; Mertins, A.; McLoughlin, I. Robust acoustic scene classification using a multi-spectrogram encoder-decoder framework. Digit. Signal Process. 2021, 110, 102943. [Google Scholar] [CrossRef]
- Wang, Y.; Getreuer, P.; Hughes, T.; Lyon, R.F.; Saurous, R.A. Trainable frontend for robust and far-field keyword spotting. In Proceedings of the 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), New Orleans, LA, USA, 5 March 2017; pp. 5670–5674. [Google Scholar]
- Yin, H.; Bai, J.; Xiao, Y.; Wang, H.; Zheng, S.; Chen, Y.; Das, R.K.; Deng, C.; Chen, J. Exploring Text-Queried Sound Event Detection with Audio Source Separation. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Hyderabad, India, 6 April 2025; pp. 1–5. [Google Scholar] [CrossRef]
- Adavanne, S.; Virtanen, T. A Report on Sound Event Detection with Different Binaural Features; Technical Report, DCASE2017 Challenge; Munich, Germany, 2017; Available online: https://arxiv.org/abs/1710.02997 (accessed on 1 January 2025).
- Heittola, T.; Mesaros, A. DCASE 2017 Challenge Setup: Tasks, Datasets and Baseline System; Technical Report, DCASE2017 Challenge; Munich, Germany, 2017; Available online: https://inria.hal.science/hal-01627981/ (accessed on 1 January 2025).
- Lin, L.; Wang, X. Guided Learning Convolution System For Dcase 2019 Task 4; Technical Report; Institute of Computing Technology, Chinese Academy of Sciences: Beijing, China, 2019. [Google Scholar]
- Gemmeke, J.F.; Ellis, D.P.W.; Freedman, D.; Jansen, A.; Lawrence, W.; Moore, R.C.; Plakal, M.; Ritter, M. Audio Set: An ontology and human-labeled dataset for audio events. In Proceedings of the IEEE ICASSP 2017, New Orleans, LA, USA, 5–9 March 2017. [Google Scholar]
- Fonseca, E.; Pons, J.; Favory, X.; Font, F.; Bogdanov, D.; Ferraro, A.; Oramas, S.; Porter, A.; Serra, X. Freesound Datasets: A platform for the creation of open audio datasets. In Proceedings of the 18th International Society for Music Information Retrieval Conference (ISMIR), Suzhou, China, 23–27 October 2017; pp. 486–493. [Google Scholar]
- McLoughlin, I.; Xie, Z.; Song, Y.; Phan, H.; Palaniappan, R. Time–Frequency Feature Fusion for Noise Robust Audio Event Classification. Circuits Syst. Signal Process. 2020, 39, 1672–1687. [Google Scholar] [CrossRef]
- Ebbers, J.; Haeb-Umbach, R. Pre-Training and Self-Training for Sound Event Detection in Domestic Environments; Technical Report, DCASE2022 Challenge; Nancy, France, 2022; Available online: https://ris.uni-paderborn.de/record/49113 (accessed on 1 January 2025).
- Turpault, N.; Serizel, R.; Parag Shah, A.; Salamon, J. Sound event detection in domestic environments with weakly labeled data and soundscape synthesis. In Proceedings of the Workshop on Detection and Classification of Acoustic Scenes and Events, New York, NY, USA, 25–26 October 2019. [Google Scholar]
- Kim, J.W.; Son, S.W.; Song, Y.; Kim, H.K.; Song, I.H.; Lim, J.E. Semi-Supervised Learning-Based Sound Event Detection Using Frequency Dynamic Convolution With Large Kernel Attention For DCASE Challenge 2023 Task 4; Technical Report, DCASE2023 Challenge; Tampere, Finland, 2023; Available online: https://arxiv.org/abs/2306.06461 (accessed on 1 January 2026).
- Schmid, F.; Primus, P.; Morocutti, T.; Greif, J.; Widmer, G. Improving Audio Spectrogram Transformers For Sound Event Detection Through Multi-Stage Training. arXiv 2024, arXiv:2408.00791. [Google Scholar]
- Martín-Morató, I.; Mesaros, A. Strong Labeling of Sound Events Using Crowdsourced Weak Labels and Annotator Competence Estimation. IEEE/ACM Trans. Audio Speech Lang. Process. 2023, 31, 902–914. [Google Scholar] [CrossRef]
- Tang, L.; Tian, H.; Huang, H.; Shi, S.; Ji, Q. A survey of mechanical fault diagnosis based on audio signal analysis. Measurement 2023, 220, 113294. [Google Scholar] [CrossRef]
- Qurthobi, A.; Maskeliūnas, R.; Damaševičius, R. Detection of Mechanical Failures in Industrial Machines Using Overlapping Acoustic Anomalies: A Systematic Literature Review. Sensors 2022, 22, 3888. [Google Scholar] [CrossRef]
- Zeng, X.M.; Song, Y.; McLoughlin, I.; Liu, L.; Dai, L.R. Robust Prototype Learning for Anomalous Sound Detection. In Proceedings of the Interspeech, Dublin, Ireland, 20 August 2023; pp. 261–265. [Google Scholar] [CrossRef]
- Koizumi, Y.; Kawaguchi, Y.; Imoto, K.; Nakamura, T.; Nikaido, Y.; Tanabe, R.; Purohit, H.; Suefusa, K.; Endo, T.; Yasuda, M.; et al. Description and Discussion on DCASE2020 Challenge Task2; Tokyo, Japan: Unsupervised Anomalous Sound Detection for Machine Condition Monitoring. In Proceedings of the DCASE, Virtual, 2–4 November 2020; pp. 81–85. [Google Scholar]
- Suefusa, K.; Nishida, T.; Purohit, H.; Tanabe, R.; Endo, T.; Kawaguchi, Y. Anomalous Sound Detection Based on Interpolation Deep Neural Network. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Virtual, 4 May 2020; pp. 271–275. [Google Scholar] [CrossRef]
- Zeng, X.M.; Song, Y.; Dai, L.R.; Liu, L. Predictive AutoEncoders Are Context-Aware Unsupervised Anomalous Sound Detectors. In Proceedings of the Man-Machine Speech Communication; Zhenhua, L., Jianqing, G., Kai, Y., Jia, J., Eds.; Springer Nature: Singapore, 2023; pp. 101–113. [Google Scholar]
- Hendrycks, D.; Mazeika, M.; Dietterich, T. Deep Anomaly Detection with Outlier Exposure. In Proceedings of the ICLR, New Orleans, LA, USA, 6 May 2019. [Google Scholar]
- Liu, Y.; Guan, J.; Zhu, Q.; Wang, W. Anomalous Sound Detection Using Spectral-Temporal Information Fusion. In Proceedings of the ICASSP 2022—2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Singapore, 22 May 2022. [Google Scholar] [CrossRef]
- Han, B.; Lv, Z.; Jiang, A.; Huang, W.; Chen, Z.; Deng, Y.; Ding, J.; Lu, C.; Zhang, W.Q.; Fan, P.; et al. Exploring Large Scale Pre-Trained Models for Robust Machine Anomalous Sound Detection. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, Seoul, Republic of Korea, 14 April 2024; pp. 1326–1330. [Google Scholar] [CrossRef]
- Jiang, A.; Han, B.; Lv, Z.; Deng, Y.; Zhang, W.Q.; Chen, X.; Qian, Y.; Liu, J.; Fan, P. AnoPatch: Towards Better Consistency in Machine Anomalous Sound Detection. In Proceedings of the Interspeech, Kos, Greece, 1 September 2024; pp. 107–111. [Google Scholar] [CrossRef]
- Chakrabarty, D.; Elhilali, M. Abnormal sound event detection using temporal trajectories mixtures. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Shanghai, China, 20 March 2016; pp. 216–220. [Google Scholar] [CrossRef]
- Li, K.; Zaman, K.; Li, X.; Akagi, M.; Unoki, M. Machine Anomalous Sound Detection Using Spectral-temporal Modulation Representations Derived from Machine-specific Filterbanks. arXiv 2024, arXiv:2409.05319. [Google Scholar] [CrossRef]
- Yin, J.; Gao, Y.; Zhang, W.; Wang, T.; Zhang, M. Diffusion Augmentation Sub-center Modeling for Unsupervised Anomalous Sound Detection with Partially Attribute-Unavailable Conditions. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Hyderabad, India, 6 April 2025; pp. 1–5. [Google Scholar] [CrossRef]
- Stowell, D. Computational bioacoustics with deep learning: A review and roadmap. PeerJ 2022, 10, e13152. [Google Scholar] [CrossRef]
- Tosato, G.; Shehata, A.; Janssen, J.; Kamp, K.; Jati, P.; Stowell, D. Auto deep learning for bioacoustic signals. arXiv 2023, arXiv:2311.04945. [Google Scholar] [CrossRef]
- Heinrich, R.; Sick, B.; Scholz, C. AudioProtoPNet: An Interpretable Deep Learning Model for Bird Sound Classification. arXiv 2024, arXiv:2404.10420. [Google Scholar] [CrossRef]
- Hershey, J.R.; Chen, Z.; Le Roux, J.; Watanabe, S. Deep clustering: Discriminative embeddings for segmentation and separation. arXiv 2015, arXiv:1508.04306. [Google Scholar] [CrossRef]
- Dang, T.M.; Wang, T.S.; Lekhak, H.; Zhu, K.Q. EmotionalCanines: A Dataset for Analysis of Arousal and Valence in Dog Vocalization. In Proceedings of the ACM International Conference on Multimedia. Association for Computing Machinery, Dublin, Ireland, 27–31 October 2025; pp. 13281–13288. [Google Scholar] [CrossRef]
- Suzuki, K.; Sakamoto, S.; Taniguchi, T.; Kameoka, H. Speak like a dog: Human to non-human creature voice conversion. In Proceedings of the Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC), Chiang Mai, Thailand, 7 November 2022; pp. 1388–1393. [Google Scholar]
- Kang, M.; Lee, S.; Lee, C.; Cho, N. When Humans Growl and Birds Speak: High-Fidelity Voice Conversion from Human to Animal and Designed Sounds. arXiv 2025, arXiv:2505.24336. [Google Scholar] [CrossRef]
- Lostanlen, V.; Salamon, J.; Cartwright, M.; McFee, B.; Farnsworth, A.; Kelling, S.; Bello, J.P. Per-Channel Energy Normalization: Why and How. IEEE Signal Process. Lett. 2019, 26, 39–43. [Google Scholar] [CrossRef]
- Hexeberg, S.; Leite, R.; Ewers, R.M.; Stowell, D. Semi-supervised classification of bird vocalizations using spatiotemporal features. Sci. Rep. 2023, 13, 12345. [Google Scholar]
- Bravo Sánchez, V.; Stowell, D.; Drossos, K.; Virtanen, T. Bioacoustic classification of avian calls from raw sound waveforms with an open-source deep learning architecture. Sci. Rep. 2021, 11, 15740. [Google Scholar] [CrossRef]
- Zeghidour, N.; Luebs, F.; Synnaeve, G.; Collobert, R. LEAF: A learnable frontend for audio classification. In Proceedings of the International Conference on Learning Representations (ICLR), Virtual, 3 May 2021. [Google Scholar]
- Ravanelli, M.; Bengio, Y. Speaker recognition from raw waveform with SincNet. In Proceedings of the IEEE Spoken Language Technology Workshop (SLT), Athens, Greece, 18 December 2018; pp. 1021–1028. [Google Scholar]
- Lee, J.; Park, J.; Nam, J. Sample-level deep convolutional neural networks for music auto-tagging using raw waveforms. In Proceedings of the ISMIR, Suzhou, China, 23 October 2017. [Google Scholar]
- Jung, J.w.; Kim, H.S.; Kim, M.J.; Yoon, S.H.; Lee, B.J.; Kim, H. RawNet: Advanced end-to-end deep neural network using raw waveforms for text-independent speaker verification. In Proceedings of the Interspeech, Graz, Austria, 15 September 2019; pp. 1268–1272. [Google Scholar]
- Allen, A.N.; Harvey, M.; Harrell, L.; Jansen, A.; Merkens, K.P.; Wall, C.C.; Cattiau, J.; Oleson, E.M. A CNN for humpback whale song detection in diverse long-term datasets. Front. Mar. Sci. 2021, 8, 653740. [Google Scholar] [CrossRef]
- Kong, Q.; Cao, Y.; Iqbal, T.; Wang, Y.; Wang, W.; Plumbley, M.D. PANNs: Large-scale pretrained audio neural networks for audio pattern recognition. IEEE/ACM Trans. Audio Speech Lang. Process. 2020, 28, 2880–2894. [Google Scholar] [CrossRef]
- O’Shaughnessy, D. Spoken language identification: An overview of past and present research trends. Speech Commun. 2025, 167, 103167. [Google Scholar] [CrossRef]
- Miao, X.; McLoughlin, I.; Wang, W.; Zhang, P. D-MONA: A dilated mixed-order non-local attention network for speaker and language recognition. Neural Netw. 2021, 139, 201–211. [Google Scholar] [CrossRef]
- Liu, H.; Perera, L.P.G.; Khong, A.W.; Chng, E.S.; Styles, S.J.; Khudanpur, S. Efficient self-supervised learning representations for spoken language identification. IEEE J. Sel. Top. Signal Process. 2022, 16, 1296–1307. [Google Scholar] [CrossRef]
- Dey, S.; Sahidullah, M.; Saha, G. Towards Cross-Corpora Generalization for Low-Resource Spoken Language Identification. IEEE/ACM Trans. Audio Speech Lang. Process. 2024, 32, 5040–5050. [Google Scholar] [CrossRef]
- Kohler, M.A.; Kennedy, M. Language identification using shifted delta cepstra. In Proceedings of the 45th Midwest Symposium on Circuits and Systems (MWSCAS), Fukuoka, Japan, 7 August 2002; Volume 3, pp. III–69. [Google Scholar]
- Cai, W.; Cai, Z.; Zhang, X.; Wang, X.; Li, M. A novel learnable dictionary encoding layer for end-to-end language identification. In Proceedings of the 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Calgary, AB, Canada, 15 April 2018; pp. 5189–5193. [Google Scholar]
- Alumäe, T.; Kukk, K.; Le, V.B.; Barras, C.; Messaoudi, A.; Ben Kheder, W. Exploring the impact of pretrained models and web-scraped data for the 2022 NIST language recognition evaluation. In Proceedings of the Interspeech, Dublin, Ireland, 20 August 2023; pp. 516–520. [Google Scholar]
- Jin, M.; Song, Y.; McLoughlin, I.; Dai, L.R. LID-Senones and Their Statistics for Language Identification. IEEE/ACM Trans. Audio Speech Lang. Process. 2018, 26, 171–183. [Google Scholar] [CrossRef]
- Kaiyr, A.; Kadyrov, S.; Bogdanchikov, A. Automatic Language Identification from Spectorgam Images. In Proceedings of the 2021 IEEE International Conference on Smart Information Systems and Technologies (SIST), Nur-Sultan, Kazakhstan, 28 April 2021; pp. 1–4. [Google Scholar] [CrossRef]
- Tjandra, A.; Choudhury, D.G.; Zhang, F.; Singh, K.; Conneau, A.; Baevski, A.; Sela, A.; Saraf, Y.; Auli, M. Improved language identification through cross-lingual self-supervised learning. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Singapore, 22 May 2022; pp. 6877–6881. [Google Scholar]
- Reynolds, D.A.; Quatieri, T.F.; Dunn, R.B. Speaker Verification Using Adapted Gaussian Mixture Models. Digit. Signal Process. 2000, 10, 19–41. [Google Scholar] [CrossRef]
- Dehak, N.; Kenny, P.; Dehak, R.; Dumouchel, P.; Ouellet, P. Front-End Factor Analysis for Speaker Verification. IEEE Trans. Audio Speech Lang. Process. 2011, 19, 788–798. [Google Scholar] [CrossRef]
- Variani, E.; Lei, X.; McDermott, E.; Lopez-Moreno, I.; Gonzalez-Dominguez, J. Deep Neural Networks for Small-Footprint Text-Dependent Speaker Verification. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Florence, Italy, 25 May 2014; pp. 4052–4056. [Google Scholar] [CrossRef]
- Liu, Z.L.; Song, Y.; Zeng, X.M.; Dai, L.R.; McLoughlin, I. DP-MAE: A dual-path masked autoencoder based self-supervised learning method for anomalous sound detection. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Seoul, Republic of Korea, 14 April 2024; pp. 1481–1485. [Google Scholar]
- Nagrani, A.; Chung, J.S.; Zisserman, A. VoxCeleb: A Large-Scale Speaker Identification Dataset. In Proceedings of the Interspeech, Stockholm, Sweden, 20 August 2017; pp. 2616–2620. [Google Scholar] [CrossRef]
- Campbell, W.M.; Campbell, J.P.; Reynolds, D.A.; Singer, E.; Torres-Carrasquillo, P.A. Support Vector Machines for Speaker and Language Recognition. Comput. Speech Lang. 2006, 20, 210–229. [Google Scholar] [CrossRef]
- Schuller, B.; Steidl, S.; Batliner, A. The Interspeech 2009 Emotion Challenge 2009. In Proceedings of the Interspeech 2009, Brighton, UK, 6 September 2009; pp. 312–315. Available online: https://www.isca-archive.org/interspeech_2009/schuller09_interspeech.html (accessed on 1 January 2025).
- Eyben, F.; Wöllmer, M.; Schuller, B. Opensmile: The munich versatile and fast open-source audio feature extractor. In Proceedings of the 18th ACM International Conference on Multimedia, Florence, Italy, 25 October 2010; pp. 1459–1462. [Google Scholar]
- Schuller, B.W.; Zhang, Z.; Weninger, F.; Rigoll, G. Using multiple databases for training in emotion recognition: To unite or to vote? In Proceedings of the Interspeech, Florence, Italy, 27 August 2011; pp. 1553–1556. [Google Scholar]
- Parry, J.; Palaz, D.; Clarke, G.; Lecomte, P.; Mead, R.; Berger, M.; Hofer, G. Analysis of Deep Learning Architectures for Cross-Corpus Speech Emotion Recognition. In Proceedings of the Interspeech, Graz, Austria, 15 September 2019; pp. 1656–1660. [Google Scholar]
- Satt, A.; Rozenberg, S.; Hoory, R. Efficient Emotion Recognition from Speech Using Deep Learning on Spectrograms. In Proceedings of the Interspeech, Stockholm, Sweden, 20 August 2017. [Google Scholar] [CrossRef]
- Trigeorgis, G.; Ringeval, F.; Brueckner, R.; Marchi, E.; Nicolaou, M.A.; Schuller, B.; Zafeiriou, S. Adieu features? end-to-end speech emotion recognition using a deep convolutional recurrent network. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Shanghai, China, 20 March 2016; pp. 5200–5204. [Google Scholar]
- Mirsamadi, S.; Barsoum, E.; Zhang, C. Automatic speech emotion recognition using recurrent neural networks with local attention. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), New Orleans, LA, USA, 5 March 2017; pp. 2227–2231. [Google Scholar] [CrossRef]
- Pepino, L.; Riera, P.; Ferrer, L. Emotion Recognition from Speech Using wav2vec 2.0 Embeddings. In Proceedings of the Interspeech 2021, Brno, Czech Republic, 30 August 2021; pp. 161–165. [Google Scholar]
- Chen, S.; Wang, C.; Chen, Z.; Wu, Y.; Liu, S.; Chen, Z.; Li, J.; Kanda, N.; Yoshioka, T.; Xiao, X.; et al. Wavlm: Large-scale self-supervised pre-training for full stack speech processing. IEEE J. Sel. Top. Signal Process. 2022, 16, 1505–1518. [Google Scholar] [CrossRef]
- Diatlova, D.; Udalov, A.; Shutov, V.; Spirin, E. Adapting WavLM for Speech Emotion Recognition. arXiv 2024, arXiv:2405.04485. [Google Scholar] [CrossRef]
- Chowdhury, S.Y.; Banik, B.; Hoque, M.T.; Banerjee, S. A Novel Hybrid Deep Learning Technique for Speech Emotion Detection using Feature Engineering. arXiv 2025, arXiv:2507.07046. [Google Scholar] [CrossRef]
- Amjad, A.; Khuntia, S.; Chang, H.T.; Tai, L.C. Multi-Domain Emotion Recognition Enhancement: A Novel Domain Adaptation Technique for Speech-Emotion Recognition. IEEE Trans. Audio Speech Lang. Process. 2025, 33, 528–541. [Google Scholar] [CrossRef]
- Chen, Z.; Wang, J.; Hu, W.; Li, L.; Hong, Q. Unsupervised Speaker Verification Using Pre-Trained Model and Label Correction. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Rhodes, Greece, 4 June 2023; pp. 1–5. [Google Scholar] [CrossRef]
- Ghaffari, H.; Devos, P. Robust Weakly Supervised Bird Species Detection via Peak Aggregation and PIE. IEEE Trans. Audio Speech Lang. Process. 2025, 33, 1427–1439. [Google Scholar] [CrossRef]



| Description | Dimensions | Element Scale | Frequency Span |
|---|---|---|---|
| Linear spectrogram (LS) | time, frequency (T, F) | scalar (0, 1) | (0, Nyquist) |
| Log-scaled spectrogram (LSS) | T, F | log dB | Nyquist) |
| A/-law scaling | T, F | log | Nyquist) |
| Mel-spectrogram (MS) | T, Mel-F | linear | Mel(Nyquist)) |
| Log-Mel-spectrogram (LMS) | T, Mel-F | log dB | Mel(Nyquist)) |
| Gammatonegram (GTG) | trapeziodal/squared-T, F | log | (ERB(0), ERB(Nyquist)) |
| Constant-Q transform (CQT) | trapeziodal/squared-T, F | log/linear | Nyquist) |
| Stabilised auditory image (SAI) | non-linear F, lag | scaled | ms by default) |
| Task | Details | Fixed Pooling | VNFs |
|---|---|---|---|
| SED | 50 class RWCP, 20 dB SNR | 94.8% accuracy | 96.3% accuracy |
| SED | 50 class RWCP, 0 dB SNR | 75.1% accuracy | 84.0% accuracy |
| LID | NIST LRE07 DNN x-vector 3s | 10.17 | 8.80 |
| LID | NIST LRE07 CLSTM 3s | 7.15 | 6.70 |
| DID | Arabic dialect challenge | 3.20 | 2.62 |
| Input | Feature Extraction | Stack and Classify | Output | |
| continuous, | raw waveform/spectrum, | one-hot class per instance, | ||
| full recording, | named features, | posterior probabilities, | ||
| utterance, | trained feature extractor or | vote over multiple instances, | ||
| segment or | pre-trained feature extractor | average/threshold over time or | ||
| frame | localisation in time |
| Year | Ref. | Task | Spectrogram Type | Resolution and Settings | Pooling |
|---|---|---|---|---|---|
| 2014 | [6] | 50 class RWCP | LS | , 16 kHz | vote |
| 2017 | [80] | 6 class TUT events [81] 1 | LMS + LS | , 44.1 kHz | max |
| 2019 | [82] | 10 class from [83] + [84] | LMS | , 16 kHz | median |
| 2020 | [85] | 50 class RWCP | LS, GTG, CQT | , 16 kHz | mean |
| 2022 | [86] | 10 class DESED [87] 2 | LMS | , 16 kHz | mean |
| 2023 | [88] | 10 class DESED [87] 2 | MS | , 16 kHz | mean |
| 2024 | [89] | 11 class DCASE24 task 4 [90] | MS | 3 in AST + fPaSST, 16 kHz | ensemble-mean |
| Method | Spectrogram | Pixels | Sample Rate | Scale |
|---|---|---|---|---|
| Chakrabarty et al. [101] | LMS | 128 × T | 8 kHz | log |
| Zeng et al. [93] | LMS | 128 × T | 16 kHz | inverted log |
| Li et al. [102] | LNS | 128 × T | 16 kHz | log |
| Liu et al. [98] | LMS + Tgram | 128 × T | 16 kHz | log |
| Yin et al. [103] | LMS | 128 × T | 16 kHz | normalised |
| Technique | Input Type | Task(s) | Taxa |
|---|---|---|---|
| SincNet [115] | Raw waveform | Species classification | Birds |
| SampleCNN [116] | Raw waveform | Music auto-tagging | Music |
| RawNet [117] | Raw waveform | Speaker verification | Humans |
| CNN/ResNet [104,113] | LMS | Species classification, SED | Birds, Frogs, Whales |
| PCEN-enhanced CNN [78,114,118] | PCEN-Mel | Low-SNR event detection | Birds, Whales |
| CNN on STFT [112] | LS | Call segmentation | Bats |
| CNN with spectrogram and stabilised auditory image input [6] | LS + SAI | Sound event classification | General sounds |
| LEAF [114] | Learned spectrogram from raw waveform | Species classification, detection in noise | Birds, Whales |
| Wavegram-Logmel-CNN [119] | Wavegram + LMS | General classification | Various |
| Method | Spectrogram | Resolution | Task |
|---|---|---|---|
| Ma et al. [127] | PLP + bottleneck | 48 × 21 | 23 languages, EER 4.38% 1 |
| Kaiyr et al. [128] | LS, CNN-RNN | 116 × 200 5–10 s segments | 7 languages, acc. 94.3% 2 |
| Liu et al. [122] | MFCC + delta + delta-delta | 39 + 39 + 39, 25 ms window, 10 ms hop | 14 languages, EER 3.82% 3 |
| Miao et al. [121] | MFCC+D-MONA | 23 × 5 frames | 14 languages, EER 1.15% 3 |
| Tjanda et al. [129] | LMS | 80 × 4, 25 ms window, 10 ms hop | 26 languages, acc. 90.3% 4 |
| Method | Spectrogram | Resolution |
|---|---|---|
| Reynolds et al. [130] | MFCC + context | 13 × T |
| Dehak et al. [131] | i-vector from 60 d MFCC | 200 WCCN |
| Variani et al. [132] | trained from 40 d FB + context | 256 d-vector |
| Snyder et al. [27] | 60 d MFCC + delta + delta-delta | 150 x-vector |
| Desplanques et al. [8] | LMS | 80 × 80 |
| Liu et al. [133] | LMS | 128 × 304 |
| Method | Feature Type | Representation |
|---|---|---|
| Schuller et al. [138] | MFCC + prosody + energy | 1582-d openSMILE |
| Satt et al. [140] | LMS | ∼40–64 Mel bands, 25 ms window, 10 ms hop |
| Mirsamadi et al. [142] | FB with frame attention | 40 dim FB, 25 ms window, 10 ms hop |
| Trigeorgis et al. [141] | LMS + channel attention | 40 dim FB with 40 ms frame, 5 ms hop |
| Pepino et al. [143] | wav2vec 2.0 SSL embedding | 768 dim contextual frames |
| Chen et al. [144] | WavLM SSL embedding | 1024 dim contextual frames |
| Chowdhury et al. [146] | LMS + 5 other features | 64 dim LMS 20–30 ms, and 126 dim other features |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2026 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license.
Share and Cite
McLoughlin, I.; Pham, L.; Song, Y.; Miao, X.; Phan, H.; Cai, P.; Gu, Q.; Nan, J.; Song, H.; Soh, D. Spectrogram Features for Audio and Speech Analysis. Appl. Sci. 2026, 16, 572. https://doi.org/10.3390/app16020572
McLoughlin I, Pham L, Song Y, Miao X, Phan H, Cai P, Gu Q, Nan J, Song H, Soh D. Spectrogram Features for Audio and Speech Analysis. Applied Sciences. 2026; 16(2):572. https://doi.org/10.3390/app16020572
Chicago/Turabian StyleMcLoughlin, Ian, Lam Pham, Yan Song, Xiaoxiao Miao, Huy Phan, Pengfei Cai, Qing Gu, Jiang Nan, Haoyu Song, and Donny Soh. 2026. "Spectrogram Features for Audio and Speech Analysis" Applied Sciences 16, no. 2: 572. https://doi.org/10.3390/app16020572
APA StyleMcLoughlin, I., Pham, L., Song, Y., Miao, X., Phan, H., Cai, P., Gu, Q., Nan, J., Song, H., & Soh, D. (2026). Spectrogram Features for Audio and Speech Analysis. Applied Sciences, 16(2), 572. https://doi.org/10.3390/app16020572

