applsci-logo

Journal Browser

Journal Browser

AI in Audio Analysis: Spectrogram-Based Recognition

A special issue of Applied Sciences (ISSN 2076-3417). This special issue belongs to the section "Computing and Artificial Intelligence".

Deadline for manuscript submissions: closed (20 November 2025) | Viewed by 20593

Special Issue Editor


E-Mail Website
Guest Editor
Information Technology Cluster, Singapore Institute of Technology, Singapore 138683, Singapore
Interests: audio and speech AI; sound event detection; sound scene analysis; language identification; dialect identification; speech enhancement
Special Issues, Collections and Topics in MDPI journals

Special Issue Information

Dear Colleagues,

The general field of machine hearing [1] involves algorithms capable of interpreting and extracting meaning from auditory information, similar to how humans recognize sounds, voices, environments, and activities from sound. This research field encompasses sound event detection and classification, auditory scene analysis, forensic audio analysis, medical diagnosis from sound, and more. It includes speech-related tasks such as language and dialect identification, speaker identification, and emotion recognition. Automatic and machine learning approaches, collectively classified as AI, have achieved remarkable performance gains in recent years by representing one-dimensional single-channel audio as a two-dimensional spectrogram [2], whether linear, logarithmic, mel-scaled, constant-Q, stacked filterbanks, or encoded in other ways. Adding a dimension has allowed researchers to unlock the considerable power of image-processing AI techniques and is now commonly used in systems such as the audio spectrum transformer (AST) [3].

In this Special Issue, we explore and extend the field of spectrogram-based recognition. High-quality original research papers are sought in areas including (but not limited to) the following:

  • Applications of spectrogram-based audio classification;
  • Audio spectrogram-based regression;
  • Spectrogram-like representations for deep learning;
  • Audio feature transformation of spectrograms;
  • Efficient spectrogram-based recognition;
  • Spectrograms in speech analysis, enhancement, and coding;
  • Anomaly detection from spectral representations;
  • Speech and medical applications of audio spectrogram analysis.

[1] Richard F. Lyon, “Machine hearing: An emerging field”, IEEE signal processing magazine 27 (5), 131-139.

[2] H. Zhang, I. McLoughlin and Y. Song, "Robust sound event recognition using convolutional neural networks," 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2015, pp. 559-563, doi: 10.1109/ICASSP.2015.7178031.

[3] Gong, Yuan, Yu-An Chung, and James Glass. "Ast: Audio spectrogram transformer." arXiv preprint arXiv:2104.01778 (2021).

Prof. Dr. Ian McLoughlin
Guest Editor

Manuscript Submission Information

Manuscripts should be submitted online at www.mdpi.com by registering and logging in to this website. Once you are registered, click here to go to the submission form. Manuscripts can be submitted until the deadline. All submissions that pass pre-check are peer-reviewed. Accepted papers will be published continuously in the journal (as soon as accepted) and will be listed together on the special issue website. Research articles, review articles as well as short communications are invited. For planned papers, a title and short abstract (about 250 words) can be sent to the Editorial Office for assessment.

Submitted manuscripts should not have been published previously, nor be under consideration for publication elsewhere (except conference proceedings papers). All manuscripts are thoroughly refereed through a single-blind peer-review process. A guide for authors and other relevant information for submission of manuscripts is available on the Instructions for Authors page. Applied Sciences is an international peer-reviewed open access semimonthly journal published by MDPI.

Please visit the Instructions for Authors page before submitting a manuscript. The Article Processing Charge (APC) for publication in this open access journal is 2400 CHF (Swiss Francs). Submitted papers should be well formatted and use good English. Authors may use MDPI's English editing service prior to publication or during author revisions.

Keywords

  • sound event detection and classification
  • sound scene detection
  • acoustic scene analysis
  • machine hearing
  • spectrograms
  • spectral estimation
  • audio spectrum transformer
  • acoustic feature maps
  • bioacoustics

Benefits of Publishing in a Special Issue

  • Ease of navigation: Grouping papers by topic helps scholars navigate broad scope journals more efficiently.
  • Greater discoverability: Special Issues support the reach and impact of scientific research. Articles in Special Issues are more discoverable and cited more frequently.
  • Expansion of research network: Special Issues facilitate connections among authors, fostering scientific collaborations.
  • External promotion: Articles in Special Issues are often promoted through the journal's social media, increasing their visibility.
  • Reprint: MDPI Books provides the opportunity to republish successful Special Issues in book format, both online and in print.

Further information on MDPI's Special Issue policies can be found here.

Published Papers (7 papers)

Order results
Result details
Select all
Export citation of selected articles as:

Research

Jump to: Review

20 pages, 23952 KB  
Article
Deepfake Speech Detection Using Perceptual Pathological Features Related to Timbral Attributes and Deep Learning
by Anuwat Chaiwongyen, Khalid Zaman, Kai Li, Suradej Duangpummet, Jessada Karnjana, Waree Kongprawechnon and Masashi Unoki
Appl. Sci. 2026, 16(4), 2077; https://doi.org/10.3390/app16042077 - 20 Feb 2026
Viewed by 522
Abstract
The detection of deepfake speech has become a significant research area due to rapid advancements in generative AI for speech synthesis. These technologies pose significant security risks in applications such as biometric authentication, voice-controlled systems, and automatic speaker verification (ASV) systems. Therefore, enhancing [...] Read more.
The detection of deepfake speech has become a significant research area due to rapid advancements in generative AI for speech synthesis. These technologies pose significant security risks in applications such as biometric authentication, voice-controlled systems, and automatic speaker verification (ASV) systems. Therefore, enhancing the detection capabilities of such applications is essential to mitigate potential threats. This study investigates perceptual speech-pathological features, which are commonly used to evaluate the unnaturalness of voice disorders in clinical settings, as potential indicators for detecting deepfake speech. Specifically, the timbral attributes of hardness, depth, brightness, roughness, sharpness, warmth, boominess, and reverberation are examined. The analysis reveals that these attributes provide meaningful distinctions between genuine and synthetic speech. Furthermore, the detection performance is enhanced by extending the dimensional representation of timbral attributes, enabling a more comprehensive characterization of the speech signal. This paper proposes a method that combines two models: one utilizing the different dimensions of speech-pathological features with a deep neural network (DNN), and another employing a gammatone filterbank model that simulates the auditory processing mechanism of the human cochlea with ResNet-18 architecture, improving deepfake speech detection. The proposed method is evaluated on the Automatic Speaker Verification Spoofing and Countermeasures Challenge (ASVspoof) 2019 dataset. Experimental results demonstrate that the proposed approach outperforms baseline models in terms of Equal Error Rate (EER), achieving an EER of 5.93%. Full article
(This article belongs to the Special Issue AI in Audio Analysis: Spectrogram-Based Recognition)
Show Figures

Figure 1

15 pages, 1493 KB  
Article
Benchmarking Automated and Semi-Automated Vocal Clustering Methods
by Kanghwi Lee, Maris Basha, Anja T. Zai and Richard H. R. Hahnloser
Appl. Sci. 2026, 16(2), 810; https://doi.org/10.3390/app16020810 - 13 Jan 2026
Viewed by 705
Abstract
Analyzing large datasets of animal vocalizations requires efficient bioacoustic methods for categorizing the vocalization types. This study evaluates the effectiveness of different vocalization clustering methods, comparing fully automated and semi-automated methods against the gold standard of manual expert annotations. Effective methods achieve good [...] Read more.
Analyzing large datasets of animal vocalizations requires efficient bioacoustic methods for categorizing the vocalization types. This study evaluates the effectiveness of different vocalization clustering methods, comparing fully automated and semi-automated methods against the gold standard of manual expert annotations. Effective methods achieve good clustering performance whilst minimizing human effort. We release a new dataset of 1454 zebra finch vocalizations manually clustered by experts, on which we evaluate (i) fully automated clustering using off-the-shelf methods based on sound embeddings and (ii) a semi-automated workflow relying on refining the embedding-derived clusters. Clustering performance is assessed using the Adjusted Rand Index (ARI) and Normalized Mutual Information (NMI). Results indicate that while fully automated methods provide a useful baseline, they generally fall short of human-level consistency. In contrast, the semi-automated workflow achieved agreement scores comparable to inter-expert reliability, approaching the levels of expert manual clustering. This demonstrates that refining embedding-derived clusters reduces annotation time while maintaining gold standard accuracy. We conclude that semi-automated workflows offer an optimal strategy for bioacoustics, enabling the scalable analysis of large datasets without compromising the precision required for robust behavioral insights. Full article
(This article belongs to the Special Issue AI in Audio Analysis: Spectrogram-Based Recognition)
Show Figures

Figure 1

23 pages, 7047 KB  
Article
UaVirBASE: A Public-Access Unmanned Aerial Vehicle Sound Source Localization Dataset
by Gabriel Jekateryńczuk, Rafał Szadkowski and Zbigniew Piotrowski
Appl. Sci. 2025, 15(10), 5378; https://doi.org/10.3390/app15105378 - 12 May 2025
Cited by 5 | Viewed by 4529
Abstract
This article presents UaVirBASE, a publicly available dataset for the sound source localization (SSL) of unmanned aerial vehicles (UAVs). The dataset contains synchronized multi-microphone recordings captured under controlled conditions, featuring variations in UAV distances, altitudes, azimuths, and orientations relative to a fixed microphone [...] Read more.
This article presents UaVirBASE, a publicly available dataset for the sound source localization (SSL) of unmanned aerial vehicles (UAVs). The dataset contains synchronized multi-microphone recordings captured under controlled conditions, featuring variations in UAV distances, altitudes, azimuths, and orientations relative to a fixed microphone array. UAV orientations include front, back, left, and right-facing configurations. UaVirBASE addresses the growing need for standardized SSL datasets tailored for UAV applications, filling a gap left behind by existing databases that often lack such specific variations. Additionally, we describe the software and hardware employed for data acquisition and annotation alongside an analysis of the dataset’s structure. With its well-annotated and diverse data, UaVirBASE is ideally suited for applications in artificial intelligence, particularly in developing and benchmarking machine learning and deep learning models for SSL. Controlling the dataset’s variations enables the training of AI systems capable of adapting to complex UAV-based scenarios. We also demonstrate the architecture and results of the deep neural network (DNN) trained on this dataset, evaluating model performance across different features. Our results show an average Mean Absolute Error (MAE) of 0.5 m for distance and height, an average azimuth error of around 1 degree, and side errors under 10 degrees. UaVirBASE serves as a valuable resource to support reproducible research and foster innovation in UAV-based acoustic signal processing by addressing the need for a standardized and versatile UAV SSL dataset. Full article
(This article belongs to the Special Issue AI in Audio Analysis: Spectrogram-Based Recognition)
Show Figures

Figure 1

13 pages, 8836 KB  
Article
Detection of Abnormal Symptoms Using Acoustic-Spectrogram-Based Deep Learning
by Seong-Yoon Kim, Hyun-Min Lee, Chae-Young Lim and Hyun-Woo Kim
Appl. Sci. 2025, 15(9), 4679; https://doi.org/10.3390/app15094679 - 23 Apr 2025
Cited by 4 | Viewed by 3152
Abstract
Acoustic data inherently contain a variety of information, including indicators of abnormal symptoms. In this study, we propose a method for detecting abnormal symptoms by converting acoustic data into spectrogram representations and applying a deep learning model. Spectrograms effectively capture the temporal and [...] Read more.
Acoustic data inherently contain a variety of information, including indicators of abnormal symptoms. In this study, we propose a method for detecting abnormal symptoms by converting acoustic data into spectrogram representations and applying a deep learning model. Spectrograms effectively capture the temporal and frequency characteristics of acoustic signals. In this work, we extract key features such as spectrograms, Mel-spectrograms, and MFCCs from raw acoustic data and use them as input for training a convolutional neural network. The proposed model is based on a custom ResNet architecture that incorporates Bottleneck Residual Blocks to improve training stability and computational efficiency. The experimental results show that the model trained with Mel-spectrogram data achieved the highest classification accuracy at 97.13%. The models trained with spectrogram and MFCC data achieved 95.22% and 93.78% accuracy, respectively. The superior performance of the Mel-spectrogram model is attributed to its ability to emphasize critical acoustic features through Mel-filter banks, which enhances learning performance. These findings demonstrate the effectiveness of spectrogram-based deep learning models in identifying latent patterns within acoustic data and detecting abnormal symptoms. Future research will focus on applying this approach to a wider range of acoustic domains and environments. The results of this study are expected to contribute to the development of disease surveillance systems by integrating acoustic data analysis with artificial intelligence techniques. Full article
(This article belongs to the Special Issue AI in Audio Analysis: Spectrogram-Based Recognition)
Show Figures

Figure 1

17 pages, 3902 KB  
Article
Dual-Path Beat Tracking: Combining Temporal Convolutional Networks and Transformers in Parallel
by Nikhil Thapa and Joonwhoan Lee
Appl. Sci. 2024, 14(24), 11777; https://doi.org/10.3390/app142411777 - 17 Dec 2024
Cited by 4 | Viewed by 4024
Abstract
The Transformer, a deep learning architecture, has shown exceptional adaptability across fields, including music information retrieval (MIR). Transformers excel at capturing global, long-range dependencies in sequences, which is valuable for tracking rhythmic patterns over time. Temporal Convolutional Networks (TCNs), with their dilated convolutions, [...] Read more.
The Transformer, a deep learning architecture, has shown exceptional adaptability across fields, including music information retrieval (MIR). Transformers excel at capturing global, long-range dependencies in sequences, which is valuable for tracking rhythmic patterns over time. Temporal Convolutional Networks (TCNs), with their dilated convolutions, are effective at processing local, temporal patterns with reduced complexity. Combining these complementary characteristics, global sequence modeling from Transformers and local temporal detail from TCNs enhances beat tracking while reducing the model’s overall complexity. To capture beat intervals of varying lengths and ensure optimal alignment of beat predictions, the model employs a Dynamic Bayesian Network (DBN), followed by Viterbi decoding for effective post-processing. This system is evaluated across diverse public datasets spanning various music genres and styles, achieving performance on par with current state-of-the-art methods yet with fewer trainable parameters. Additionally, we also explore the interpretability of the model using Grad-CAM to visualize the model’s learned features, offering insights into how the TCN-Transformer hybrid captures rhythmic patterns in the data. Full article
(This article belongs to the Special Issue AI in Audio Analysis: Spectrogram-Based Recognition)
Show Figures

Figure 1

16 pages, 10466 KB  
Article
Hierarchical Residual Attention Network for Musical Instrument Recognition Using Scaled Multi-Spectrogram
by Rujia Chen, Akbar Ghobakhlou and Ajit Narayanan
Appl. Sci. 2024, 14(23), 10837; https://doi.org/10.3390/app142310837 - 22 Nov 2024
Cited by 5 | Viewed by 2388
Abstract
Musical instrument recognition is a relatively unexplored area of machine learning due to the need to analyze complex spatial–temporal audio features. Traditional methods using individual spectrograms, like STFT, Log-Mel, and MFCC, often miss the full range of features. Here, we propose a hierarchical [...] Read more.
Musical instrument recognition is a relatively unexplored area of machine learning due to the need to analyze complex spatial–temporal audio features. Traditional methods using individual spectrograms, like STFT, Log-Mel, and MFCC, often miss the full range of features. Here, we propose a hierarchical residual attention network using a scaled combination of multiple spectrograms, including STFT, Log-Mel, MFCC, and CST features (Chroma, Spectral contrast, and Tonnetz), to create a comprehensive sound representation. This model enhances the focus on relevant spectrogram parts through attention mechanisms. Experimental results with the OpenMIC-2018 dataset show significant improvement in classification accuracy, especially with the “Magnified 1/4 Size” configuration. Future work will optimize CST feature scaling, explore advanced attention mechanisms, and apply the model to other audio tasks to assess its generalizability. Full article
(This article belongs to the Special Issue AI in Audio Analysis: Spectrogram-Based Recognition)
Show Figures

Figure 1

Review

Jump to: Research

29 pages, 808 KB  
Review
Spectrogram Features for Audio and Speech Analysis
by Ian McLoughlin, Lam Pham, Yan Song, Xiaoxiao Miao, Huy Phan, Pengfei Cai, Qing Gu, Jiang Nan, Haoyu Song and Donny Soh
Appl. Sci. 2026, 16(2), 572; https://doi.org/10.3390/app16020572 - 6 Jan 2026
Viewed by 2721
Abstract
Spectrogram-based representations have grown to dominate the feature space for deep learning audio analysis systems, and are often adopted for speech analysis also. Initially, the primary motivation behind spectrogram-based representations was their ability to present sound as a two-dimensional signal in the time–frequency [...] Read more.
Spectrogram-based representations have grown to dominate the feature space for deep learning audio analysis systems, and are often adopted for speech analysis also. Initially, the primary motivation behind spectrogram-based representations was their ability to present sound as a two-dimensional signal in the time–frequency plane, which not only provides an interpretable physical basis for analysing sound, but also unlocks the use of a range of machine learning techniques such as convolutional neural networks, which had been developed for image processing. A spectrogram is a matrix characterised by the resolution and span of its dimensions, as well as by the representation and scaling of each element. Many possibilities for these three characteristics have been explored by researchers across numerous application areas, with different settings showing affinity for various tasks. This paper reviews the use of spectrogram-based representations and surveys the state-of-the-art to question how front-end feature representation choice allies with back-end classifier architecture for different tasks. Full article
(This article belongs to the Special Issue AI in Audio Analysis: Spectrogram-Based Recognition)
Show Figures

Figure 1

Back to TopTop