Spectrogram Features for Audio and Speech Analysis

McLoughlin, Ian; Pham, Lam; Song, Yan; Miao, Xiaoxiao; Phan, Huy; Cai, Pengfei; Gu, Qing; Nan, Jiang; Song, Haoyu; Soh, Donny

doi:10.3390/app16020572

Open AccessReview

Spectrogram Features for Audio and Speech Analysis

by

Ian McLoughlin

^1,*

,

Lam Pham

²

,

Yan Song

³

,

Xiaoxiao Miao

⁴

,

Huy Phan

⁵

,

Pengfei Cai

³

,

Qing Gu

³

,

Jiang Nan

³

,

Haoyu Song

¹

and

Donny Soh

¹

ICT Cluster, Singapore Institute of Technology, 1 Punggol Coast Road, Singapore 828608, Singapore

²

Austrian Institute of Technology, 1210 Vienna, Austria

³

National Engineering Research Center of Speech and Language Information Processing (NERC-SLIP), The University of Science and Technology of China, Hefei 230026, China

⁴

Computer Science, Duke-Kunshan University, Kunshan, Suzhou 215316, China

⁵

Meta Inc., Reality Labs, 75002 Paris, France

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2026, 16(2), 572; https://doi.org/10.3390/app16020572

Submission received: 27 November 2025 / Revised: 29 December 2025 / Accepted: 2 January 2026 / Published: 6 January 2026

(This article belongs to the Special Issue AI in Audio Analysis: Spectrogram-Based Recognition)

Download

Browse Figures

Versions Notes

Featured Application

Spectrogram-based input features have become the most popular choice for deep learning models that classify audio and speech, yet there are many settings related to resolution and representation type. This article surveys those choices and discusses their suitability for different application areas.

Abstract

Spectrogram-based representations have grown to dominate the feature space for deep learning audio analysis systems, and are often adopted for speech analysis also. Initially, the primary motivation behind spectrogram-based representations was their ability to present sound as a two-dimensional signal in the time–frequency plane, which not only provides an interpretable physical basis for analysing sound, but also unlocks the use of a range of machine learning techniques such as convolutional neural networks, which had been developed for image processing. A spectrogram is a matrix characterised by the resolution and span of its dimensions, as well as by the representation and scaling of each element. Many possibilities for these three characteristics have been explored by researchers across numerous application areas, with different settings showing affinity for various tasks. This paper reviews the use of spectrogram-based representations and surveys the state-of-the-art to question how front-end feature representation choice allies with back-end classifier architecture for different tasks.

Keywords:

spectrogram; spectrogram image feature; mel-frequency spectrogram; mel frequency cepstral coefficient (MFCC); constant-Q transform; audio analysis; speech classification

1. Introduction

The spectrogram, considered to have been invented at Bell Labs in the 1940s [1], was initially generated by a sound spectrograph machine as a stylus-on-paper plot to visualise the distribution of sound energy in a time–frequency plane. Then, and now, it transforms a one-dimensional sound waveform into a two-dimensional image.

Originally popular for ease of visualisation, allowing for identification of important structures within a sound signal by eye, the spectrogram became useful in phonetics and various branches of acoustics. Well established by the late 1970s, its information-carrying abilities were highlighted by Victor Zue and Ron Cole, who demonstrated that it could be used for speech recognition [2].

2. Taxonomy of Spectrograms

At its heart, a spectrogram is a two-dimensional picture of sound, usually with one axis representing frequency and the other axis representing time. Individual pixel intensity represents, in some way, the strength of each frequency element at a particular time instant. In the earliest systems [1], the frequency axis and intensity were non-linear. Advances in sensors, high-quality analogue-to-digital converters, and improvements in signal processing led to the ability to form linear spectrograms [3]. More recently, non-linear representations have again become prevalent, tuned for different tasks, as we will explore.

2.1. Basic Spectrogram

Spectrograms are typically formed as a matrix of stacked frequency vectors, each of which represents the frequency magnitude over a short duration of time, referred to as an analysis frame. The frequency magnitude vectors are obtained from an orthogonal time–frequency transform such as a discrete Fourier transform (DFT), a fast Fourier transform (FFT), or filterbank (FB). Discrete cosine transform (DCT), modified DCT, discrete wavelet transform (DWT) and many other transforms have been employed [3]. Fixed-size analysis frames are typically slices of an input audio waveform of between 10 and 30 ms for audible signals. Hence, an auditory spectrogram is formed from a stack of frequency vectors obtained from successive frames, as illustrated in Figure 1. In almost all cases, to avoid issues with spectral leakage and edge effects (e.g., Gibbs phenomenon [4]), the slices of input audio are windowed [5] prior to the time–frequency transform, and because the window functions usually taper to zero or near-zero at the edges of each frame, the frames are overlapped to ensure that frequencies from all regions in the input audio waveform (i.e., including the tapered edge regions) will contribute to the frequency representation [3]. The overlap between frames is specified either as a percentage (often 50% overlap), or as a step between frames, e.g., 256 sample windows advanced by 128 steps between frames or a 30 ms window advancing 10 ms each step. The step can be referred to as a hop, advance, or shift. The maximum frequency resolution is limited by the number of samples in the analysis window, and the spectrogram time axis resolution is defined by the step size.

Let us denote an input audio waveform as

x (n)

and set the frame length to be

w_{s}

samples for the current analysis frame f. With 50% overlap between frames, the analysis frame is

x_{f} (n) = x [f . w_{s} / 2 : f . w_{s} / 2 + w_{s}]

. Given a length

w_{s}

window function

w (n)

, the spectral magnitude representation

X (k)

is then,

X_{f} (k) = |\sum_{n = 0}^{w_{s} - 1} w (n) x_{f} (n) e^{- j 2 π n k / w_{s}}| f o r k = 0 \dots w_{s} - 1

(1)

Spectrogram

S

is obtained by stacking the frequency vectors directly into a rectangular matrix, i.e., for a time duration of F frames,

S_{F, w_{s}} = [\begin{matrix} X_{0} & X_{1} & X_{2} \dots X_{F} \end{matrix}]

(2)

When used as an input feature to a deep learning system, it is also common that frequency downsampling or pooling happens at this point [6] to reduce the frequency dimension. Pooling is discussed further in Section 2.9.

Numerous alternative methods of forming spectrograms exist, with the main variants shown in Table 1 along with their dimension, element scaling, and frequency span. The top three are the linear spectrogram (LS) as described above, followed by variants in which each element of the matrix has been scaled using log, A- or

μ

-law. The next two variants use Mel and log-Mel scaling, discussed in Section 2.3, while the derivations of the bottom three are explored subsequently.

2.2. Spectrograms Are Not Pictures

While spectrograms allow audio and speech to be processed in a deep learning system using techniques that have originally been developed for image processing, caution should be observed for the following three aspects in which spectrograms and picture images differ significantly:

2.2.1. Colour and Greyscale

Basic linear spectrograms are greyscale with pixel values that are typically scaled to the range

[0, 1]

, but are often colourised for ease of viewing. Colourisation maps scalar pixel values to RGB values [3]. In MATLAB^® The Mathworks Inc., Natick, MA, USA which is often used for visualisation of spectrograms, each pixel is scaled from 0 to 1. Prior to MATLAB release R2014b, the ‘Jet’ colourmap was used by default to scale a spectrum through blue–green–yellow–orange–red across the range 0 to 1. More recent versions of MATLAB use the ‘parula’ colourmap, which scales blue–green–yellow. The popular audio-handling tool Audacity maps from −100 dB in black, through purple–magenta–light orange to white, for pixel values above −20 dB. Both can be modified to display in either greyscale or using other colourmaps. Python (version 3.0 and onward) based tools also impose a colourmap, which may differ based on the library used. In matplotlib, the pcolormap and pcolor functions both default to using the ‘viridis’ colourmap, which scales blue–green-yellow across the range 0 to 1.

While colour scaling produces pretty plots, many researchers simply input a spectrogram, or a spectrogram patch, into a convolutional neural network (CNN) that has been designed for image processing, and thus assumes three input channels to handle RGB components separately. Since the mapping from spectral magnitude to RGB depends arbitrarily on the kind of spectrogram used and the version of the tool used to produce it, there is no logical justification for processing using colour spectrograms. Networks such as CNNs can learn a mapping from any scaling, but it may be at the cost of three times the front-end complexity compared to a greyscale spectrogram input.

2.2.2. Translation Invariance and Scaling

Structures or objects in pictures can very easily be translated to different locations in the image, while remaining the same object. So classification is usually invariant to the absolute location within the image. In a spectrogram, translation of structures along the time axis does not alter the fundamental nature of what is represented, but a significant shift of sound structures along the frequency axis can result in something that resembles a completely different sound. Unlike in a picture, relationships in the X axis and the Y axis have very different meanings in a spectrogram.

Furthermore, scaling an object in a picture does not change the nature of the object; it only changes its size, i.e., making it appear closer or further away. In a spectrogram, scaling a structure that represents a sound event yields a very different result. It adjusts both the time duration of the event and its frequency span, and has the potential to result in something that resembles a completely different sound. Importantly, audio deep learning systems need careful matching of scaling between training and inference.

2.2.3. Local Features

Advanced image processing techniques can exploit both local and global regional characteristics to interpret the content of an image. To do this, systems perform neighbourhood correlations, as well as global texture correlations across an image, and this is part of the motivation behind the use of CNNs [7]. While both local and global correlations are also important in audio tasks such as sound event detection, the nature of those correlations will be very different. For example similar ‘textures’ in frequency ranges of 0–50 Hz and 16–18 kHz of a spectrogram are unlikely to be significant to understanding the content, whereas in a picture, similar regions of the same texture might be patches of grass at the bottom left and top right of an image—which also relates to the translation invariance noted above.

Local correlations across the time axis of spectrograms may be more akin to the frame-to-frame difference in video frames than they are to physically proximate points in a picture.

2.3. Mel-Spectrogram

Mel-spectrograms were inspired by the Mel scale, which utilises human equal-loudness data to map frequencies in Hertz to a non-linear scale corresponding to human auditory perception. The mapping from linear frequency

f_{h z}

(in Hertz) to Mel frequency

f_{m e l}

[3] is generally computed by:

f_{m e l} = 2595 . l o g (1 + f_{h z} / 700),

(3)

A short-time Fourier transform (STFT) output vector (i.e., a vector of instantaneous power values for uniformly sampled frequency bins), is transformed into a Mel scale representation vector via a set of bandpass filters, which usually have a triangular shape. The bandpass filters are centred at Mel scale frequencies based on Equation (3). Each triangular filter accumulates the weighted power spectrum sum along the frequency dimension [3].

Just as a linear spectrogram is constructed from a stack of linear frequency vectors, a Mel-spectrogram is constructed from a stack of Mel-frequency vectors obtained from successive analysis frames.

Since the Mel-spectrogram is based on Mel filters developed from human auditory perception experiments, both individual Mel-frequency feature vectors and their stacking into a Mel-spectrogram have proven effective for various tasks related to human speech analysis. State-of-the-art systems proposed for Speaker Identification [8], Speech-to-Text [9], Speech Emotion Detection [10], and so on, have used Mel-based spectrograms for the pre-processing feature engineering.

Given that applying Mel filterbanks to the SFTF spectrogram across the frequency dimension is effective for capturing distinct features in an audio signal (i.e., Mel filters are widely used in human speech analysis), several similarly inspired filterbank representations have also been proposed. These include the Gammatone filter [11], inspired by cochlea simulation, and the Nearest Neighbour filter [12], inspired by image pre-processing. Several are illustrated in Figure 2.

2.4. Constant-Q Spectrogram

The Constant-Q spectrogram is generated by applying a constant-Q transform (CQT), which was first introduced in [13] and is closely related to the Fourier Transform. Like the Fourier Transform, the CQT is formed from a bank of filters, but with the difference that the centre frequencies of each CQT element are spaced in a geometrical tonal space as

f_{k} = f_{m i n} . 2^{\frac{k}{b}} f o r 1 \leq k \leq K

(4)

where

f_{k}

denotes the centre frequency of

k^{t h}

,

f_{m i n}

is the minimum frequency, b is the number of filters per octave. As the name suggests, the Q value, which is the ratio of central frequency to bandwidth, is constant. It is computed as follows:

Q = \frac{f_{k}}{Δ f_{k}} = \frac{f_{k}}{f_{k + 1} - f_{k}} = {(2^{\frac{1}{b}} - 1)}^{- 1}

(5)

In musical analysis, by setting

f_{m i n}

and b to directly correspond to musical notes (i.e., choosing

b = 12

and

f_{m i n}

as the frequency of, for example midinote 0 or

C_{- 1}

), the central frequencies in the CQT will correspond to musical note frequencies, making it effective at capturing musical tones. As a result, the Constant-Q spectrogram has been widely used for musical analysis [14,15], but has also been applied to more general sound event detection. Since it is effectively a triangular representation, it is often transformed into a rectangular matrix prior to use.

2.5. Correlogram

The Correlogram utilises autocorrelation to capture the similarity between an audio signal and itself at a given time lag. Autocorrelation vectors are computed for a range of different time lags [16]. Given a long audio signal, it is first separated into short audio segments. A correlogram (or auto-correlation vector) is obtained for each audio segment from the auto-correlation coefficients of frequency components along the time axis. As a result, the long audio signal is represented by a matrix of auto-correlation variables, with each matrix column being an auto-correlation vector representing a given time lag. In structure, it presents similarities to the stabilised audio image (SAI) of Section 2.6.

2.6. Stabilised Auditory Image

Patterson et al. [17] proposed the auditory image model (AIM) in 1995, aiming to simulate the frequency discrimination and amplitude sensitivity of neural activity patterns from hearing. Essentially, an AIM models the function of the basilar membrane, which is part of the organ of hearing within the human cochlea [3], when exposed to pure tone. Walters [18] integrated this in time (strobed temporal integration, essentially a type of correlogram) to yield the stabilised auditory image (SAI), aiming to improve noise robustness and enhance the detection of periodicity compared to the AIM.

A single SAI is a two-dimensional representation similar to a classical linear spectrogram, but where the y-axis is frequency and the x-axis represents lag or periodicity. As such, it captures the characteristics of that sound in a fixed time window, e.g., 35 ms in Figure 2a.

SAIs were used as input features in several of Google’s early audio recall systems, as developed by Lyon et al. [19,20,21,22], specifically employing PAMIR (passive-aggressive model for image retrieval), a pre-deep learning ordering algorithm based on statistics obtained from regions within the SAI. Early attempts at using SAIs with deep learning architectures for sound event detection [23] were outperformed by linear spectrogram equivalents, probably because of the limited short-time window represented in single SAIs.

2.7. Patches and Regions

Object detection from images, where the characteristic shape of an item can be in any location within the image, as well as any size from small to large, often benefits from techniques where the image is divided into randomly scaled and located patches, each of which is processed independently by the deep learning model [24]. When the image is a spectrogram, such a process no longer has a physical justification (see Section 2.2 above). The exception is slicing a spectrogram in the time domain to yield shorter windows, maintaining the original frequency range. However, there seems to be limited benefit in allowing the time windows to have different durations, so in practice, fixed-sized regions are usually independently inferenced, as in audio spectrum transformer (AST) [25].

2.8. Scaling and Number Representation

Audio samples, such as in the WAVE file format, are typically represented as 16-bit signed linear fixed-point numbers with range [−32,768, 32,767]. During computation, they would generally be converted to 32-bit floating point and then divided by

2^{15}

so they are scaled to a range of

[- 1, 1)

. An FFT of those samples will, by default, also retain the same 32-bit floating point number format. Although a true Fourier Transform yields a complex spectrum, spectrograms are usually formed from the magnitude spectrum; hence, each pixel value is always positive, and can be scaled to a range of

[0, 1]

. As mentioned above, log encoding is often used to provide a perceptually relevant emphasis to the samples. For lower complexity,

μ

- or A-law is used to convert samples to 8-bit fixed-point scaled values in the range

[0, 255]

. This can help to substantially reduce downstream computational complexity, at the cost of higher quantisation noise.

2.9. Pooling and Downsampling

State-of-the-art deep learning architectures for performing sound event detection, audio analysis and related tasks, including language identification, speaker verification and speech emotion recognition, generally utilise front-end learned layers to compute a one-dimensional representation vector (e.g., an embedding) from a raw feature input. Thus, whatever input feature is ingested, an intermediate representation—a fixed dimension embedding—is produced. A time-stepped series of features yields a time-stepped series of embeddings for analysis. This stack of embeddings obtained over time can reveal statistics of how the underlying feature, and hence the underlying audio signal, varies over time (e.g., over an utterance, or a sound event). The two-dimensional block of embeddings from a well-trained front-end is usually amenable for classification. The general process is illustrated in Figure 3, with the feature extraction on the left, the pooling function in the centre and the back-end classification on the right.

As we have seen for spectrograms, which are computed from overlapping frames of speech, which, in turn, are windowed and then transformed to a magnitude spectrum, magnitude spectra from successive frames are stacked into a two-dimensional spectrogram image. This was discussed in Section 2.1, and shown in Equations (1) and (2). In early machine hearing systems that pre-dated deep learning approaches, meta-features were extracted from the two-dimensional spectrogram and those features were classified. For example, Dennis et al. [26] divided a spectrogram into nine equal-sized regions, and classified the zero and first-order statistics from each of the nine regions using SVM. Lyon et al. [22] classified the marginal statistics from rows and columns of an SAI.

The advent of deep learning allowed neural networks to become capable of classifying raw spectrograms directly, but not at full resolution. Thus, downsampling (of samples prior to forming the spectrogram) or pooling of the frequency representation vector (i.e., combining frequency bins) have been common approaches since the very first DNN spectrogram classifier [6].

In fact, a very similar process happens in many deep learning systems that classify non-spectrogram features too. Examples include MFCC, perceptual linear prediction (PLP) coefficients and filterbank coefficients that are used in tasks such as LID [27,28,29], discussed further in Section 4.1. In almost all cases, one-dimensional features are extracted from overlapping input audio frames and then stacked into a two-dimensional time–frequency block for classification. The frequency-domain features can be pooled at that point, or may have been already downsampled.

Pooling or downsampling along the frequency axis involves taking the mean of typically 2, 4 or 8 neighbouring spectral magnitudes to reduce dimensionality by the same factor. In some cases, particularly for MFCC features, in addition to averaging, either max-pooling or standard deviation is computed.

Delta-coefficients are derived in the time-domain to capture changes from one frame to the next. MFCC features are then concatenated with delta-MFCC, and even delta-delta-MFCC features to capture acceleration characteristics [30]. Shifted delta cepstral (SDC) coefficients are commonly used in speech analysis to expand the context window of a classifier. These are formed from a few sequential cepstral delta coefficients per block. For example [31] concatenated coefficients over 7 blocks, with a shift of 3 between them (called a 7-1-3-7 arrangement). The aim is to capture statistics in a way that mean-pooling in time would not. The same kind of delta computation, shift, and concatenation has also been used with other features like filterbanks and PLPs. The same functionality could be learned within a neural network, particularly a recurrent neural network for time-based changes, but at the cost of additional parameters and training time.

2.10. Variance Normalised Features

Intuitively speaking, when attempting to classify features, the more their statistics differ between two classes, the more discriminative the feature is likely to be, compared to the within-class difference. This is essentially Fisher’s criterion [32] restated. Applying this viewpoint to the downsampling or feature pooling operations used in almost all neural network classifiers (noted above), three of the current authors sought a data-driven approach to maximise Fisher’s criterion—using between-class and within-class variance difference over a development dataset—to identify optimal spectral pooling rules.

Instead of mean pooling fixed blocks of spectral bins (e.g., 8) to reduce the frequency dimension (e.g., from 2048 to 256), the size of the pool is varied across the spectral range based on the variance difference between/within classes. The aim is to normalise the variance contribution of each downsampled feature point. Thus, the technique is called variance normalised features (VNF).

Both standard pooling and VNF begin with an identical high-resolution spectrum and aim to reduce the dimensionality before stacking into a spectrogram. For standard pooling, the low-resolution

N^{'}

point spectrum

X^{'} (k)

is obtained from the high-resolution N point spectrum

X (k)

, where the downsampling factor

D_{s} = ⌊ N / N^{'} ⌋

. As noted, it is usually accomplished via mean-pooling;

X^{'} (k) = \frac{1}{D_{s}} \sum_{n = k D_{s}}^{(k + 1) D_{s}} X (n) f o r k = 0 \dots N^{'}

(6)

Alternatively, max-pooling would be

X^{'} (k) = m a x {X_{k D_{s}} \dots X_{(k + 1) D_{s}}}

for

k = 0 \dots N^{'}

.

To obtain VNFs, a pre-processing step is required. In that step, the spectrum X is computed over every analysis frame F from all examples in each of

C

classes in the development dataset. The bin-wise spectral mean

\bar{S_{c}}

and variance

\tilde{S_{c}}

are obtained for each class, c, in that set,

\begin{matrix} {\bar{S}}_{c} (k) & = & \frac{1}{F} \sum_{f = 0}^{F \in c} X^{f} (k) \end{matrix}

(7)

\begin{matrix} {\tilde{S}}_{c} (k) & = & \frac{1}{F - 1} \sum_{f = 0}^{F \in c} {(X^{f} (k) - {\bar{S}}_{c} (k))}^{2} \end{matrix}

(8)

for all N spectral bins,

0 \leq k < N

, and for every class

c \in C

.

Given the variance and mean spectral characteristics of each class, the per-bin variance is accumulated over all

C

classes. This is referred to as the total variance budget.

\begin{matrix} V_{c} = \sum_{k = 0}^{N} |\tilde{S_{c}} (k) - {\bar{S}}_{c} (k)| \end{matrix}

(9)

In standard downsampling, the variance contribution of each downsampled point differs depending upon the variance difference across the underlying region. VNF attempts to normalise it so that each downsampled point contributes approximately equal variance difference. This is done by changing from fixed-size pooling regions, with equally spaced partitions, to different sized pooling regions defined by data-driven partition rules.

Those partition rules are computed iteratively from the development set data to specify pooling regions with near-equal amounts of variance contribution. The sum of the variance contributions equals the total budget. One possible partition-setting heuristic is outlined in [29]. Once all partitions are defined, the pre-processing stage has completed.

During operation (i.e., model training or inference) using the VNF pooled features, pooling is applied by obtaining the mean spectral magnitude within each of the pooling partitions. The difference between VNF and standard features is that the former uses data-driven pooling regions computed as discussed, whereas the latter uses a fixed pooling size to compute all downsized elements.

The performance of VNFs for three different tasks is shown in Table 2. Sound event detection (SED) on real-world computing partnership (RWCP) test data [28], language identification (LID) on NIST Language Recognition Evaluation 2007 challenge data, and dialect identification (DID) for spoken Arabic [29], are performed by models trained from standard fixed pooling inputs, as well as identical models trained with VNF pooled inputs. The aim is for a higher accuracy score or a lower

C_{a v g}

score. For the tested tasks, VNF-based pooling tended to outperform mean or max pooling, but it cannot compensate for architectural deficiencies, i.e., it is more important to employ a good classifier architecture than to optimise the features. Having found a good classifier architecture, VNF has potential to improve results compared to fixed pooling.

Essentially, any system where spectral bins are mean or max-pooled before classification could potentially benefit from a data-driven approach such as VNF, as long as a representative development dataset exists from which a one-time pre-processing step can infer suitable partition rules.

3. Audio Analysis

Audio analysis refers to the detection and classification of sounds that lie within the range of human hearing (approx. 20 Hz to 20 kHz) [3]. It is related to the field of machine hearing [22], which involves endowing computers with the ability to detect and interpret sound in ways analogous to humans. Generally, we use the term ‘audio analysis’ to refer to non-speech sounds, since speech analysis involves additional techniques which will be considered separately in Section 4—although there is considerable overlap in the types of features used.

This section will first present a taxonomy of audio analysis, with emphasis on original published sources, before briefly describing three application areas of audio event detection, anomalous sound detection and the related area of bioacoustics. In each area, the original spectrogram-based publications are cited, as well as more recent notable advances, including current state-of-the-art approaches, particularly methods employing spectrograms.

3.1. Taxonomy of Audio Analysis

Audio-based classification systems tend to follow a sequential taxonomy as shown in Table 3, although much depends upon the task being performed. For example in clip detection, or acoustic scene analysis (ASA), a short recording may be analysed, whereas animal call detection in bioacoustics, where segmenting the input into separate animal calls may be difficult, could involve analysing an entire recording.

3.1.1. Feature Extraction

Input audio needs to be processed and transformed into features suitable for classification. These could be raw waveform segments [33], spectra (including spectrograms), statistical or timbral [34] features. As we have seen in Section 2, there are many variants of stacked spectra, which could be clipped, segmented, or pooled. Features such as MFCC or perceptual linear prediction (PLP) coefficients can also be stacked to form time–frequency features, as shown in Table 3 as ‘named features’. There may be a natural affinity of certain classification models to particular feature types, but a common alternative is to train a data-driven feature extractor. This is a front-end feature extraction network, such as a few CNN layers, that produce features suitable for classification by a back-end classifier [35]. The front-end and back-end networks can be trained separately, or end-to-end, if appropriate loss functions can be defined.

Authors also increasingly make use of feature extractors that have been effectively pre-trained by other authors to extract discriminative features for related tasks. Prominent examples include the AST [25] as mentioned in Section 2.7, PaSST [36] or HTS-AT [37]. These can be fine-tuned to be used in different tasks, or coupled with domain adaptor layers/blocks [38].

The stacked features are then classified by a back-end classifier, which typically outputs a one-hot class prediction per instance (e.g., in a detector system) or posterior probabilities in a multi-sound classifier. As shown in Table 3, many other possibilities exist for output processing. For example, where a clip of audio to be classified has been split into multiple classification instances, majority voting, or some kind of weighted averaging, provides a single per-class score over multiple classifications for the whole clip. In continuous audio, thresholding of posterior probabilities over a sliding window can yield an activation signal (e.g., for a wake word system [39]). Some tasks are not concerned with clip-level classifications, but require precise detection of the timing of events, in terms of start and end timestamps [38].

A very wide variety of tasks use this basic method of audio analysis based on time–frequency spectrogram features. These include sound scene detection (SSD) and auditory scene analysis (ASA) [40], clip recall and recognition [19], sound event detection (SED) [41], anomalous sound detection (ASD) [42], and acoustic classification of speech for purposes such as language identification (LID) [30], dialect identification (DID) [43], speaker identification (SID) [44], diarization [45], speaker verification (SV) [46]. There are also medical uses for spectrogram-based auditory analysis, which include lung auscultation (stethoscope signal) analysis [47,48], disease diagnosis from speech [49], breathing, and non-speech vocalisation. This involves data from humans [50] as well as animals [51]. Music classification or retrieval [52], analysis [53,54] and even beat tracking [55] utilise the same basic steps. While spectrogram-based methods either predominate or show excellent performance in most of these fields, alternative approaches exist. Most prominent are those based on direct time-domain waveform analysis [56] as well as bag-of-features approaches using statistical indicators [19,54,57].

3.1.2. Overlap and Occluded Sounds

In simple environments, targeted sound events may occur in isolation, so that at most one event occurs at any time. This is the basis of clip-level recognition systems, which assume that an audio clip contains one type of sound, such as the song of a single bird, the sound of a gunshot, or part of a music recording. However, in complex real-world environments, sound events often coincide with other sounds. These may be target sounds (i.e., known sound classes to be recognised) or non-target sounds (i.e., out-of-set sound classes and background acoustic noise). The former situation is sometimes referred to as ‘polyphonic’, meaning ‘many sounds’; however, the term polyphonic is already used in audio literature to refer to the existence of multiple audio channels—something that we are not considering here. Almost all sound analysis tasks assume a single channel of audio, but this could potentially have recorded many sounds. It is thus best described as having “overlapping or occluded sounds” [58,59] to avoid confusion with the terminology of multi-channel audio (if multiple audio channels are available for analysis, this can improve noise removal [3] as well as enhance localisation and classification performance [60]). Real-world sounds never occur in isolation, and always have at least some acoustic background noise, so in a sense, there are always ‘many sounds’ present in audio, especially when collected in the wild. Research has shown that the performance of audio classification systems in even very low levels of noise (i.e., real-world scenarios, even in a quiet environment) can be very different from the performance with sounds recorded under anechoic conditions of almost zero background noise [6,23]. Hence, real-world deployments of sound classifiers require careful attention to several different techniques that may not always be found in challenge competitions [61].

The temporal occlusion and overlap could be partial or in full. Co-occurring sound events have their frequency-temporal content mixed together. i.e., unlike image occlusion, which implies masking of one object by another, co-incidental audio events are recorded as the linear complex sum of the two events. A visual inspection of sound mixture spectrograms can sometimes reveal indications of coinciding sounds that are unlike each other (e.g., a long, slow, low-frequency background during which several short, high-frequency squeaks or wideband snapping sounds appear). However, sounds with similar characteristics can be difficult to discern visually as separate instances, and hence machine learning classifiers or detectors likewise have difficulty in being able to detect similar coinciding sounds [58,61].

In general, classification/detection of occluded or overlapping sound events is more challenging than that of isolated ones. In the research literature, three alternative approaches can be taken:

Recognise from mixed sounds by implicitly learning from non-occluded sounds as well as all kinds of overlapping sounds. It is also necessary to reframe the classification problem from multi-class to multi-label or 1-vs-rest [6,62,63].
Separate first using a source separation framework and then classify/detect on the separate channels [64,65,66].
Introduce mixed classes where potential mixtures are tagged as new combined classes [38,67].

Separation-based methods are a natural choice for multi-channel audio recordings, which can leverage spatial localisation to improve separation. However, these methods have shown limited success for single-channel audio. Introducing mixed classes can be useful for commonly mixed sounds, but where there are different degrees and sequences of overlap (e.g., sound 1 occurs first and sound 2 occurs during sound 1, or sound 1 occurs after sound 2 starts, plus more variations), the complexity of possibilities can negate any benefits of learning new combined classes. Hence, most current approaches are trained in the presence of random mixtures in order to improve robustness and generality. Interestingly, this may be similar to the robustness benefits gained from the widely used mixup technique for training classifiers [68].

3.1.3. Early Sound Event Classification

Sound events have a time duration that can range from around a hundred milliseconds (e.g., transient events like door knocking or hand clapping) to dozens of seconds (like car passing by or baby crying) or even continual (e.g., power line hum). Sound classification systems, initially driven by datasets of individually labelled sound clips, were first trained to classify at the clip level [19]. Yet many deployment scenarios involve monitoring of live feeds. Hence, the sound event detection task (the identification of what sound is present as well as when the sound is present—see Section 3.2 below) acknowledges that reality. The nature of the task also means that a definitive SED classification may only be possible after the sound has ended.

However, it is easy to see that sometimes classification needs to occur before a sound event has ended. This is obviously the case for continuous sounds, but also for something like an alarm or turning off music when someone starts to speak. Some sounds can be classified at the frame level, but where the characteristic frequency patterns used to identify a sound have a time span extending beyond a frame (e.g., beyond 10 to 30 ms), a different paradigm is necessary.

This was the motivation behind investigations into the timeliness perspective of sound event detection and classification systems. Specifically, the question ‘how early can a system reliably detect ongoing sound events’ from a partial observation of the initial section of an event. So-called “early detection” systems [69,70] require a classifier to fulfil a monotonicity property on continual input. Early sound event detection systems that scan a sliding window of spectrogram features to dynamically classify segments of input [71] are particularly useful in surveillance and safety-related applications, which require a low-latency response from a continuous feed.

3.2. Sound Event Detection

Sound Event Detection (SED) aims to identify and temporally localise sound events in audio recordings. It outputs either onset–offset pairs or frame-level activity probabilities for each class [72], and is essential for applications like environmental monitoring, surveillance and multimedia analysis [38].

Early SED systems used handcrafted features (e.g., MFCC), while recent approaches rely on deep learning with spectrogram inputs. This is because spectrograms, especially log-mel representations, are widely adopted due to their alignment with human auditory perception and their ability to capture the time–frequency evolution of sound events [6,73]. CNNs and CRNNs [35,74] model local correlations effectively, whereas transformer-based models such as PaSST [36] and HTS-AT [37] better capture long-range dependencies.

Different spectrogram variants have been explored for task-specific benefits. Mel-spectrograms are compact and perceptually motivated; CQT spectrograms suit tonal event detection [75]; gammatonegrams offer robustness in low SNR [76,77]; and PCEN [78] improves invariance to background noise through dynamic compression. Overlapping sound events remain a major challenge. As we have noted in Section 2.2, spectrograms are not translation-invariant, and co-occurring sounds may occupy similar frequency regions. Thus, multi-label classification strategies, attention mechanisms, and source separation methods are often employed [79].

In summary, spectrograms are central to recent SED research due to their compatibility with modern deep learning models and their descriptive time–frequency structure, but challenges persist, including overlap and domain generalisation.

Table 4 samples approximately a decade of advances in the SED field since the first published use of spectrograms with deep learning [6]. Numerous spectrogram variants have been applied to this field, with many recent systems favouring Mel spectrograms or log-Mel spectrograms. The common evaluation tasks have largely been driven by DCASE (Detection and Classification of Acoustic Scenes and Events) challenges and workshops (DCASE (Detection and Classification of Acoustic Scenes and Events): https://dcase.community/, accessed on 1 January 2025), which has also resulted in a tendency towards datasets with a relatively small number of classes. Finally, we note that embeddings from pre-trained transformers operating on spectrogram inputs, as mentioned above, can be utilised to improve performance, while effectively reducing the overall training resource required through model re-use and adaptation.

3.3. Anomalous Sound Detection

Anomalous Sound Detection (ASD) is the task of determining whether a given audio signal contains abnormal or anomalous sounds that deviate from patterns typically observed under normal conditions [42,51,91]. This task is of significant importance in scenarios such as industrial monitoring and security surveillance. For instance, in factory environments, the early detection of machine failures, system anomalies, or unexpected environmental events can effectively prevent accidents, reduce downtime, and enhance overall safety [92].

Unlike conventional audio classification tasks, ASD is typically conducted in an unsupervised manner, as anomalous events are rare and diverse, making them difficult to define and annotate in advance during training. As a result, most methods rely on modelling the distribution of normal sounds and identifying anomalies by evaluating reconstruction errors, likelihood scores, or deviations in feature embedding space [93]. For example, in DCASE 2025 Challenge Task 2 (and similar to 2020–2024 task 2), only normal audio data is provided for training. Models are required to learn the feature distribution of normal audio samples and perform classification during testing by comparing the characteristics of normal and abnormal audio samples.

Current ASD systems can be broadly categorised into two main approaches: generative and discriminative. Generative methods associated with self-supervised learning tend to employ autoencoder-based models, such as AE [94], VAE [95], and PAE [96], to learn the feature distribution of normal audio. Anomalies are identified by computing the reconstruction error between generated and original samples. The underlying assumption is that normal samples result in low reconstruction error, while anomalies yield significantly higher error. However the strong generalisation ability of generative models, even in mismatched domains, means that they can be capable of reconstructing some anomalous samples, leading to false negatives [93].

Recent discriminative learning methods use an Outlier Exposure (OE) strategy [97]. Additional meta-information, obtained during the data collection process (e.g., machine ID and attributes, such as operating condition), is utilised to train a classifier on the acoustic features of normal samples. Normal samples from different categories are treated as pseudo-anomalies relative to the target category. A compact normal feature space is then constructed using both this meta-information and a feature extractor, such as ResNet [42], MobileFaceNet [98], or Transformer-based models [99,100]. During the inference phase, the feature distance between a test sample and the normal training samples is regarded as a proxy indicator of the degree of anomaly.

Table 5 summarises representative spectrogram-based approaches used in ASD, where LNS denotes log non-uniform spectrum. Chakrabarty et al. [101] were the first to apply spectrograms for anomalous detection, utilising log-mel spectrograms (LMS) with 10-frame concatenation as input to Restricted Boltzmann Machines. Zeng et al. [93] employed log-mel spectrograms with transposed filters, where filters are sparse in the low-frequency region and dense in the high-frequency region. For machine sounds, high-frequency components often contain richer and more discriminative information, while the low-frequency part tends to be more noise-prone. Li et al. [102] introduced using F-ratios to analyse how information distributes across the spectrum, and designed machine-specific non-uniform filterbanks. Recently, more ASD studies began combining spectrogram with time-domain information as input features. For example, Liu et al. [98] proposed the STgram-MFN method, which concatenates temporal features with a log-Mel spectrogram for classification. The time-domain features (called a ‘Tgram’) are derived from a trained CNN network, with an ArcFace-derived loss. Interestingly, this revealed that Tgram features were, in part, complementary to the log-Mel spectrogram information. Taking a different approach, Yin et al. [103] applied a diffusion model to synthesise log-mel spectrograms for data augmentation, achieving state-of-the-art results with an official score of 67.12% in DCASE 2024 Challenge Task 2.

Despite the promising performance of spectrogram-based methods in ASD, several challenges remain. First, spectrograms are sensitive to noise and machine type, requiring tuning. This includes different parameter settings for different machine categories, which limits model generalisation. Secondly, under domain shift scenarios, reconstruction-based spectrogram methods may fail to detect anomalies due to misleadingly low reconstruction errors. Finally, in the absence of machine metadata, discriminative models struggle to construct a compact normal sound space, resulting in significant performance degradation. There is thus significant potential for ongoing research in this area.

3.4. Bioacoustics

Bioacoustics is primarily applied for three core tasks: species classification, call segmentation, and sound event detection. The purpose is to automate the analysis of animal vocalisations recorded in natural environments. Species classification involves identifying a species from audio recordings [104,105,106]. Call segmentation aims to isolate or count individual vocalisations (e.g., bird syllables, frog calls, whale units) within recorded audio streams [104,107].

We also note recent research on animal vocalisations for purposes such as health monitoring [51], emotion recognition [108] and potentially communications. This includes human-to-animal voice conversion techniques such as “Speak like a dog” [109] or wider species-to-species conversion using feature fusion that includes spectrograms [110].

Animal call segmentation identifies when and which biological sounds occur and is a subset of Sound event detection (SED). This is characterised by long, noisy recordings, often with multiple overlapping species [104]. Each task has distinct challenges. For classification, models must differentiate highly similar calls across species, sometimes with very few labelled examples [105]. Segmentation is complicated by overlapping sounds, variable call durations, and background noise [104,107] (sometimes with the background noise inextricably correlated to the species). Detectors and classifiers must deal with complex real-world soundscapes, requiring highly robust models to generalise across time and different environments. Despite these difficulties, recent spectrogram-based approaches have shown strong performance across all three main tasks [104,106].

Table 6 summarises recent bioacoustic deep learning tasks, highlighting a strong preference for spectrograms that are either linear (LS) or log-Mel spectrograms (LMS). The PCEN (per-channel energy normalisation) enhanced Mel spectrograms address background noise by essentially performing scaling and auto-gain control on each frequency bin [111]. However, the baseline linear spectrogram remains relevant in segmentation tasks—particularly in high-frequency domains such as bat echolocation—due to its simplicity and fine temporal resolution [112]. It is also true that the Mel scale, based on human hearing, would be inappropriate for spectrograms encompassing the ultrasonic region. At audible frequencies, log-Mel spectrograms remain widely adopted for species classification, given their well-demonstrated predictive performance when paired with CNNs. For example, on a bird classification task, a ResNet50 trained on Mel spectrograms achieved an accuracy of 0.77 and ROC AUC of 0.80, outperforming a raw waveform CNN baseline, which yielded 0.71 and 0.76, respectively [113]. PCEN-enhanced spectrograms have gained traction for their robustness in noisy settings, enabling better detection of faint or distant calls across taxa [104,114]. Although hybrid models such as LEAF attempt to combine the strengths of raw waveforms and spectral representations, real-world applications still show a clear advantage for spectrogram-based features, which have been shown to be robust, interpretable, and scalable across taxa and datasets.

Spectrogram-based methods are not without limitations, however. In species classification, they often struggle with fine-grained distinctions between species that produce acoustic calls in overlapping frequency bands [104,112]. Variability in vocal structure, such as regional ‘dialects’ and age-related changes, can also reduce accuracy. One structural limitation may be the use of fixed-size spectrogram windows that can truncate short calls, or blend closely spaced vocalisations in recordings with many overlapping signals [104,107]. This is also problematic for insect chirps and brief bat calls. Although preprocessing using PCEN improves robustness to loudness variation and background noise [78,114], it does not resolve the segmentation and overlap challenges. Significant training challenges also exist for all tasks due to class imbalance, sparse labels for rare species, and taxonomic bias in training data. These issues limit generalisation and deployment, especially when extending models to new ecosystems or to poorly studied taxa [104] for which, ironically, they may be most needed.

4. Speech Analysis

Speech analysis differs from pure auditory analysis due to the linguistic and semantic nature of the speech signal. It not only conveys different information, but our understanding of it (i.e., labels) has greater complexity and allows more resources to be applied for speech analysis, compared to general audio analysis. Speech analysed at the frame level primarily captures acoustic properties of the human vocal system during the production of the current senone or phonetic unit [30] or non-verbal vocalisation. It provides a snapshot view of both how the speech is being produced, which reveals information about the speaker, as well as the nature of what is being produced, which reveals information about the current pronunciation unit or sound.

Speech analysed at the utterance level captures linguistic content, which reveals information about the semantic meaning conveyed by the speech. Dynamics of frame-level changes also reveal information about the speaker, including their identity, their mood, gender, and age, as well as potentially reflecting several physical and mental conditions. The analysis often makes use of the statistical variation in time as the frame-level features evolve [30]. Stacking frequency-domain frame-level features creates a time–frequency image, which is a type of spectrogram.

Considering the audio analysis framework taxonomy of Table 3, front-end features tend to be at the frame level, whereas the back-end classification tends to be at the utterance level. Many variants to this simple understanding exist, such as front-end features spanning several frames, or word-level, chunk-level and entire recording analysis.

The following subsections survey three specific speech tasks that encompass that range, albeit with different objectives: language and dialect identification, speaker verification and speech emotion recognition.

4.1. Language and Dialect Identification

Language and dialect identification (LID/DID) methods need to extract information from recorded speech utterances, which may have different linguistic content, spoken by different speakers, and captured in varying acoustic environments. The aim is to reliably determine which language or dialect was spoken, typically from among a closed set of known alternatives. While many approaches have been proposed over the years, state-of-the-art systems rely on the evidence that acoustic features carry robust language-specific cues that are suitable for front-end feature extraction [120]. Extracted embeddings are then typically stacked and classified as noted above, using deep neural network-based back-end architectures.

The well-established MFCC [31] features, representing the short-time power cepstrum of speech, are mapped onto a Mel scale—essentially the discrete cosine transform (DCT) of log-Mel filterbank features. They involve weighted pooling across overlapping spectral regions (i.e., Mel coefficients). Because MFCCs are extracted from short (e.g., 25 ms) frames, they are limited in their ability to capture longer-term temporal dependencies in speech; hence, they are typically stacked with their delta and delta-delta (framewise difference, and difference between framewise differences) across an utterance. This captures first- and second-order temporal derivatives to improve the modelling of speech dynamics [35,121,122,123]. Shifted delta coefficients [124], as discussed in Section 2.9, similarly help to capture patterns that extend beyond individual frames, which is important for modelling the sequential nature of speech.

More recently, research has moved beyond handcrafted features and instead demonstrated the effectiveness of directly using raw log-Mel spectrograms as input [125,126]. The fact that spectrograms better preserve the time–frequency structures of speech enables convolutional or recurrent neural networks to learn discriminative representations from the data, without relying heavily on engineered features.

Table 7 presents some representative works for LID. Evaluation tasks vary widely in terms of the number of languages, the utterance length and the variety of speakers. Features also vary between approaches, and clearly, there are trade-offs between feature size and resolution. Performance is generally measured by Equal Error Rate (EER), or accuracy (and by C_avg in newer works). In general, the more languages and the shorter the evaluation clip, the more difficult the task becomes. Confusion matrices [127] reveal that inclusion of similar languages can significantly reduce average performance scores. All of the systems in Table 7 contain deep recurrent networks, hence even input features with a short context length are able to benefit from time-domain context within the network to perform well.

4.2. Speaker Verification

Speaker verification (SV) is the task of determining whether an input speech signal matches a claimed speaker identity from a set of enrolled speakers. SV is useful for biometric authentication, forensic analysis, and secure access control systems—for example, enabling user verification for banking transactions and for unlocking mobile devices [130,131]. The related ‘speaker validation’ task computes the probability of a given speaker being who they claim to be, while ‘forensic speaker identification’ aims to discern as much information as possible, including identity, from recordings of an unknown speaker.

Unlike most automatic conventional speech classification tasks, SV operates in an open-set setting because speakers being analysed may not have been seen during training. However, training can be conducted in a supervised manner, since labelled speaker data are readily available for enrolment. During training, models learn discriminative representations of target and non-target speakers, while during inference, they compare embeddings of test utterances against enrolled models to make acceptance or rejection decisions [27,132].

In benchmarks such as the NIST Speaker Recognition Evaluations (SRE) and the VoxCeleb challenges, thousands of labelled utterances recorded over telephone (8 kHz) and “in the wild” (16 kHz) conditions are provided for system development. Evaluation is performed by scoring the similarity. This often uses cosine distance or probabilistic linear discriminant analysis (PLDA) between enrolment and test embeddings [133]. Performance is generally measured by Equal Error Rate (EER) and Detection Cost Function (DCF) [134].

Current SV systems can be broadly categorised into generative embedding methods and discriminative embedding methods. Generative embedding methods, such as GMM-UBM and total variability (i-vector) frameworks, model speaker and channel variability via statistical supervectors and employ PLDA for scoring [130,131]. Discriminative embedding methods leverage deep neural networks to directly learn fixed-dimensional speaker embeddings: the d-vector approach averages frame-level DNN activations [132], the x-vector architecture uses TDNNs with statistics pooling [27], and enhanced variants like ECAPA-TDNN incorporate channel attention and hierarchical feature aggregation to further improve robustness [8].

Table 8 summarises several representative works for SV, including recent spectrogram-based feature representations. Despite the promising performance of these methods, several challenges remain. First, channel and domain mismatch between training and test recordings leads to performance degradation under cross-corpus and cross-device conditions [131]. Second, short-duration utterances often yield unreliable embeddings, increasing error rates. Finally, spectrogram parameter tuning for different languages, noise environments, and recording devices remains a manual and time-consuming process. The need to re-tune systems for new tasks, coupled with limitations in both robustness and generalisation, limits the widespread deployment of this technology [135]. It also presents opportunities for future research.

4.3. Speech Emotion Recognition

Speech Emotion Recognition (SER) seeks to infer affective states from analysis of speech. It generally models prosodic, spectral, and temporal variations correlated with arousal and valence, as well as with discrete emotion categories [136]. Unlike speaker verification, which benefits from representations that are stable across a speaker’s vocal space, SER requires speaker-invariant embeddings that encode emotion-related acoustic fluctuations but ignore (as much as possible) speaker identity. Much past research has indicated that emotional expressions often correlate with changes in fundamental frequency, harmonic-to-noise ratio, spectral tilt, and formant bandwidth. Spectrograms have seen widespread adoption in recent years, given their time–frequency viewpoint of evolving speech signals. Unlike raw waveform inputs, spectrograms reveal clear energy distributions across frequency bands over time, and these can be used to describe affective correlates.

Early SER systems generally employed features such as MFCC, pitch- and energy-based prosodic information, as well as jitter/shimmer measures. These would often be combined into utterance-level statistics, e.g., openSMILE [137,138]. While these features have been shown effective for acted emotional speech (i.e., databases of actors representing emotions on demand), they inherently compress some spectral detail. For example, MFCC decorrelates and smooths spectral envelopes using a DCT. High-frequency cues, which can be associated with emotional arousal, can be lost or misaligned in the process. Utterance-level pooling further removes useful temporal evolution information, which limits the ability to capture brief spectral features and may hide discriminative distributions. This probably contributes to a widely observed performance degradation in which models trained using one corpus perform much less well when they are evaluated using another, i.e., cross-corpus evaluation, or generalisation testing [139].

Spectrogram usage has enabled richer modelling of time–frequency features compared to earlier approaches. Linear, log-Mel and Mel-spectrograms preserve local spectral–temporal patterns, allowing convolutional neural networks (CNNs) to learn filters sensitive to emotion-associated frequency characteristics as noted above [140]. Time domain cues that unfold over several successive frames can be effectively captured by recurrent architectures like BLSTMs. For example, the evolution of pitch and intensity profiles and spectral modulation over time [141]. Attention mechanisms further refine the time-domain sensitivity by weighting frames that contribute more strongly to emotional perception, while down-weighting linguistically dominant or neutral segments [142]. Importantly, spectrogram configuration choices such as window length, hop size, and number of Mel bands can affect the emotional cue representation granularity in both the time and frequency dimensions. Smaller hop sizes, allied with a recurrent network, increase sensitivity to rapid prosodic change, whereas higher Mel-band resolution allows finer modelling of high-frequency structures. As with other spectrogram-based analysis tasks, trade-offs must be made between granularity and context, especially in the time domain.

As with SV and, to some extent, LID, recent SER systems integrate high-resolution spectrograms with deep sequence models such as CNN–Transformer hybrids. Transformers capture long-range dependencies and global contextual structures. This can complement the ability of CNNs to extract localised patterns. Meanwhile, self-supervised learning (SSL) models such as wav2vec 2.0, HuBERT, and WavLM provide contextualised frame-level embeddings that have been learned from large unlabelled speech corpora [143,144]. Although WavLM operates on raw waveforms, there is some evidence that intermediate layers can encode spectrally relevant information, such as fundamental frequency trajectories and amplitude modulation patterns. These attributes are observable in spectrograms [145]. As a result, SSL embeddings can serve as either an alternative to, or a complement for, spectrogram features.

Table 9 summarises representative SER approaches and highlights the shift from coarse statistical descriptors to high-resolution spectrograms, and to spectrogram-informed latent embeddings from pre-trained models.

Despite many recent advances, several challenges persist in SER research using spectrograms. Firstly, emotional correlates vary across speakers, languages, and speaking styles, as well as recording conditions. Spectral tilt, harmonic structure, and prosodic patterns are inconsistent across corpora, which remain highly influenced by their recording conditions, task (e.g., spontaneous and scripted) and labelling methodology. This contributes to domain mismatch [147]. Secondly, many emotion-relevant cues occur at short temporal scales that are sensitive to spectrogram settings, where inappropriate frame sizes, windowing or Mel resolution may obscure rapid spectral transitions. Other emotion cues may evolve over a long timescale that is sensitive to the recurrence length or context size of features. Third, while SSL features appear to offer robustness, their lack of explicit frequency structure makes it difficult to model multi-resolution emotional cues or to interpret how spectral information influences predictions. Addressing these issues may require frameworks that can integrate interpretable time–frequency structures obtained from spectrograms with the robustness and abstraction provided by waveform-based analysis.

5. Conclusions

This paper surveys the nature and application of spectrogram time–frequency features when used for audio and speech analysis. Beginning with the definition of a spectrogram, we considered element scalings such as Mel, log-Mel, A-law, and

μ

-law, as well as alternative transforms including Gammatonegrams, stabilised auditory images (SAI) and constant-Q transforms. Spectrograms were formed from stacking other kinds of frequency vectors in time, such as MFCC, PLP, filterbanks, and embeddings from pre-trained models. Settings including frequency resolution, time span, range, frame size and hop were considered alongside the related task of pooling or downsampling resolution, and the need to vote, or otherwise process individual frames to obtain a per-utterance/chunk/recording classification, as well as timestamps for start and end of events, was addressed where detection is required.

In each of the analysis domains sampled within this paper (namely SED, ASD, bioacoustics, LID/DID, SV, and SER), the past decade has seen a shift away from statistical features (e.g., variance, shimmer, skewness), through handcrafted features, such as MFCC and filterbanks, to spectrograms. Early spectrogram-based deep learning classifiers [6,23] used small rectangular patches. This was because limited training datasets restricted the model complexity that could be effectively trained, which in turn limited the size of input features. As more training data became available, larger models were possible, and spectrogram patch sizes tended to increase. Recurrent networks enabled deep neural networks to exploit time-domain statistics, allowing the size of spectrogram patches to be reduced, or to be reduced in time span but increase in frequency resolution (often through larger analysis windows with smaller hop sizes).

While spectrogram features have been shown to be effective in many audio analysis domains, pooling to downsample tends to obscure fine detail. Variance normalised features (VNF) were proposed to define more nuanced pooling rules to maximise between-class variance compared to within-class variance. However, the complexity of optimising front-end features and the need for a large representative dataset have led researchers to increasingly adopt pre-trained foundation models. These have often been well trained for a different but related task, such as automatic speech recognition, where large, high-quality datasets are readily available. For speech systems, models are often pre-trained as ASR phone detectors. For audio, models such as AST [25], PaSST [36], and HTS-AT [37] are trained as classifiers using large-scale audio datasets. Adaptation methods or fine-tuning are then used to harness these models for target tasks such as SER [37], SV [148], and more, using pre-trained speech models or SED [38], bioacoustics [149], and more for audio models. In each situation, the performance of the adaptation techniques is crucial to the resulting system performance. Apart from reasons of training efficiency, the use of pre-trained models is supported through the example of the human auditory system—a fixed external capture system (pinna, outer and middle ear) and a fixed feature transformer (inner ear, auditory nerve) [3], both of which handle all auditory tasks. However, there are specialised back-end processes operating within physically separate areas of the brain for different target tasks.

Auditory analysis foundation models continue to grow in capability, yielding greater opportunities for adaptation and fine-tuning methods. Since the time duration of sounds and events has been cited as a problematic issue for balancing feature size and resolution, it seems likely that advances will me made in the area of more effective multi-scale analysis methods.

Future Directions

As we have seen, deep learning methods that make use of spectrogram features have gained prominence across the field of audio analysis and for several speech analysis tasks, too. However, there are general aspects in which performance needs to improve substantially before widespread deployment becomes possible. These include the following;

Noise robustness, particularly towards overlapping sounds and reverberation.
Model complexity and real-time operation on edge devices.
Robust separation of intertwined sounds, particularly for multi-channel recordings.
Timeliness—early detection before a sound has completed.
Generalisation to unseen sounds, e.g., few- and zero-shot classification, including from multimodal prompts [38].

Beyond this, determining optimal spectrogram settings for a particular back-end architecture and task is currently a largely empirical process. These settings refer to the dimension of spectrogram patches, their resolution in the time and frequency domain, and whether the frequency dimension is linear or non-linear. Then each element (pixel) in a spectrogram is a scaled value, using a scheme such as log/A-law and more. Data-driven methods of determining the optimal settings are required—but this conflicts with the aim of adopting generalised pre-trained foundation models, unless those can be multi-resolution and multi-scaled, or make use of data fusion techniques. We also note that many researchers rely on Python libraries such as librosa (https://librosa.org, accessed on 1 January 2025), adopting default settings, so any future techniques should ideally be as stable and easy to use.

Author Contributions

Conceptualisation, I.M. and Y.S.; methodology, investigation and writing—original draft preparation, I.M., L.P., X.M., H.P., P.C., Q.G., J.N., H.S. and D.S.; writing—review and editing, I.M.; supervision, I.M. and Y.S. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

No new data were created or analysed in this review. Data sharing is not applicable to this article.

Conflicts of Interest

Author Huy Phan is currently employed by the Meta Inc. The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Abbreviations

The following abbreviations are used in this manuscript:

AE	Autoencoder
AIM	Auditory Image Model
ASA	Acoustic Scene Analysis
ASD	Anomalous Sound Detection
AST	Audio Spectrogram Transformer
CNN	Convolutional Neural Network
CQT	Constant-Q Transform
DCASE	Detection and Classification of Acoustic Scenes and Events
DCT	Discrete Cosine Transform
DFT	Discrete Fourier Transform
DID	Dialect Identification
DWT	Discrete Wavelet Transform
ERB	Equivalent Rectangular Banks
FB	Filterbanks
FFT	Fast Fourier Transform
GTG	Gammatonegram
LID	Language Identification
LMS	Log-Mel Spectrogram
LNS	Log Non-uniform Spectrum
LS	Linear Spectrogram
LSS	Log-Scaled Spectrogram
LSTM	Long-Short Term Memory
MFCC	Mel-frequency Cepstral Coefficients
MS	Mel Spectrogram
OE	Outlier Exposure
PAMIR	Passive-aggressive Model for Image Retrieval
PLP	Perceptual Linear Prediction
PSDS	Polyphonic Sound Detection Score
RWCP	Real World Computing Partnership
RNN	Recurrent Neural Network
SAI	Stabilised Auditory Image
SDC	Shifted Delta Cepstra
SED	Sound Event Detection
SID	Speaker Identification
SSA	Sound Scene Analysis
SSD	Sound Scene Detection
SER	Speech Emotion Recognition
SNR	Signal to Noise Ratio
SV	Speaker Verification
SVM	Support Vector Machine
STFT	Short Time Fourier Transform
VAE	Variational Autoencoder
VNF	Variance Normalised Features

References

Koenig, W.; Dunn, H.K.; Lacy, L.Y. The Sound Spectrograph. J. Acoust. Soc. Am. 1946, 18, 19–49. [Google Scholar] [CrossRef]
Zue, V.W.; Cole, R.A. Experiments on spectrogram reading. In Proceedings of the Acoustics, Speech, and Signal Processing, IEEE International Conference on ICASSP’79, Washington, DC, USA, 2–4 April 1979; Volume 4, pp. 116–119. [Google Scholar]
McLoughlin, I.V. Speech and Audio Processing: A MATLAB-Based Approach; Cambridge University Press: Cambridge, UK, 2016. [Google Scholar]
Gibbs, J.W. Fourier Series. Nature 1899, 59, 200, 606. [Google Scholar] [CrossRef]
Ifeachor, E.C.; Jervis, B.W. Digital Signal Processing: A Practical Approach; Addison-Wesley: Boston, MA, USA, 1993. [Google Scholar]
McLoughlin, I.; Zhang, H.M.; Xie, Z.P.; Song, Y.; Xiao, W. Robust Sound Event Classification using Deep Neural Networks. IEEE Trans. Audio Speech Lang. Process. 2015, 23, 540–552. [Google Scholar] [CrossRef]
Wang, Z.J.; Turko, R.; Shaikh, O.; Park, H.; Das, N.; Hohman, F.; Kahng, M.; Polo Chau, D.H. CNN Explainer: Learning Convolutional Neural Networks with Interactive Visualization. IEEE Trans. Vis. Comput. Graph. 2021, 27, 1396–1406. [Google Scholar] [CrossRef]
Desplanques, B.; Thienpondt, J.; Demuynck, K. ECAPA-TDNN: Emphasized Channel Attention, Propagation and Aggregation in TDNN Based Speaker Verification. In Proceedings of the Interspeech, Shanghai, China, 25 October 2020; pp. 3830–3834. [Google Scholar]
Radford, A.; Kim, J.W.; Xu, T.; Brockman, G.; McLeavey, C.; Sutskever, I. Robust speech recognition via large-scale weak supervision. In Proceedings of the ICML, Honolulu, HI, USA, 23 July 2023. [Google Scholar]
Kim, J.Y.; Lee, S.H. Accuracy enhancement method for speech emotion recognition from spectrogram using temporal frequency correlation and positional information learning through knowledge transfer. IEEE Access 2024, 12, 128039–128048. [Google Scholar] [CrossRef]
Ellis, D.P. Gammatone-Like Spectrograms. 2009. Available online: https://www.ee.columbia.edu/~dpwe/resources/matlab/gammatonegram/ (accessed on 1 January 2020).
FitzGerald, D. Vocal separation using nearest neighbours and median filtering. In Proceedings of the IET Irish Signals and Systems Conference (ISSC 2012), Maynooth, Ireland, 28 June 2012; p. 98G. [Google Scholar]
Brown, J.C. Calculation of a constant Q spectral transform. J. Acoust. Soc. Am. 1991, 89, 425–434. [Google Scholar] [CrossRef]
Yizhi, L.; Yuan, R.; Zhang, G.; Ma, Y.; Chen, X.; Yin, H.; Xiao, C.; Lin, C.; Ragni, A.; Benetos, E.; et al. MERT: Acoustic music understanding model with large-scale self-supervised training. In Proceedings of the ICLR, Kigali, Rwanda, 1 May 2023. [Google Scholar]
Huang, H.; Man, J.; Li, L.; Zeng, R. Musical timbre style transfer with diffusion model. PeerJ Comput. Sci. 2024, 10, e2194. [Google Scholar] [CrossRef]
Ma, N.; Green, P.; Barker, J.; Coy, A. Exploiting correlogram structure for robust speech recognition with multiple speech sources. Speech Commun. 2007, 49, 874–891. [Google Scholar] [CrossRef]
Patterson, R.D.; Allerhand, M.H.; Giguere, C. Time-domain modeling of peripheral auditory processing: A modular architecture and a software platform. J. Acoust. Soc. Am. 1995, 98, 1890–1894. [Google Scholar] [CrossRef]
Walters, T.C. Auditory-Based Processing of Communication Sounds. Ph.D. Thesis, University of Cambridge, Cambridge, UK, 2011. [Google Scholar]
Lyon, R.F.; Rehn, M.; Bengio, S.; Walters, T.C.; Chechik, G. Sound retrieval and ranking using sparse auditory representations. Neural Comput. 2010, 22, 2390–2416. [Google Scholar] [CrossRef]
Lyon, R.F.; Rehn, M.; Walters, T.; Bengio, S.; Chechik, G. Audio Classification for Information Retrieval Using Sparse Features. U.S. Patent 8,463,719, 11 June 2013. [Google Scholar]
Lyon, R.F.; Ponte, J.; Chechik, G. Sparse coding of auditory features for machine hearing in interference. In Proceedings of the International Conference on Acoustics, Speech and Signal Processing (ICASSP), Prague, Czech Republic, 22 May 2011; pp. 5876–5879. [Google Scholar]
Lyon, R.F. Machine hearing: Audio analysis by emulation of human hearing. In Proceedings of the Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), New Palz, NY, USA, 16 October 2011; p. viii. [Google Scholar]
Zhang, H.; McLoughlin, I.; Song, Y. Robust Sound Event Recognition using Convolutional Neural Networks. In Proceedings of the International Conference on Acoustics, Speech and Signal Processing (ICASSP), Brisbane, Australia, 19 April 2015; pp. 559–563. [Google Scholar]
Nowak, E.; Jurie, F.; Triggs, B. Sampling Strategies for Bag-of-Features Image Classification. In Proceedings of the Computer Vision—ECCV 2006; Leonardis, A., Bischof, H., Pinz, A., Eds.; Springer: Berlin/Heidelberg, Germany, 2006; pp. 490–503. [Google Scholar] [CrossRef]
Gong, Y.; Chung, Y.A.; Glass, J. AST: Audio Spectrogram Transformer. In Proceedings of the Interspeech 2021, Graz, Austria, 15 September 2021; pp. 571–575. [Google Scholar] [CrossRef]
Dennis, J.; Tran, H.D.; Chng, E.S. Image feature representation of the subband power distribution for robust sound event classification. IEEE Trans. Audio Speech Lang. Process. 2013, 21, 367–377. [Google Scholar] [CrossRef]
Snyder, D.; Garcia-Romero, D.; Sell, G.; Povey, D.; Khudanpur, S. X-Vectors: Robust DNN Embeddings for Speaker Recognition. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Calgary, AB, Canada, 15 April 2018; pp. 5329–5333. [Google Scholar] [CrossRef]
Xie, Z.; McLoughlin, I.; Zhang, H.; Song, Y.; Xiao, W. A new variance-based approach for discriminative feature extraction in machine hearing classification using spectrogram features. Digit. Signal Process. 2016, 54, 119–128. [Google Scholar] [CrossRef]
Miao, X.; McLoughlin, I.; Song, Y. Variance Normalised Features for Language and Dialect Discrimination. Circuits Syst. Signal Process. 2021, 40, 3621–3638. [Google Scholar] [CrossRef]
Jin, M.; Song, Y.; McLoughlin, I. LID-senone Extraction via Deep Neural Networks for End-to-End Language Identification. In Proceedings of the Odyssey, Bilbao, Spain, 21–24 June 2016. [Google Scholar]
Davis, S.; Mermelstein, P. Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences. IEEE Trans. Acoust. Speech Signal Process. 1980, 28, 357–366. [Google Scholar] [CrossRef]
Malina, W. On an Extended Fisher Criterion for Feature Selection. IEEE Trans. Pattern Anal. Mach. Intell. 1981, PAMI-3, 611–614. [Google Scholar] [CrossRef]
Oord, A.v.d.; Dieleman, S.; Zen, H.; Simonyan, K.; Vinyals, O.; Graves, A.; Kalchbrenner, N.; Senior, A.; Kavukcuoglu, K. WaveNet: A Generative Model for Raw Audio. arXiv 2016, arXiv:1609.03499. [Google Scholar] [CrossRef]
Rafsanjani, M.A.H.; Mawalim, C.O.; Lestari, D.P.; Sakti, S.; Unoki, M. Unsupervised Anomalous Sound Detection Using Timbral and Human Voice Disorder-Related Acoustic Features. In Proceedings of the Asia Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC), Macao, China, 3 December 2024; pp. 1–6. [Google Scholar]
Miao, X.; McLoughlin, I.; Yan, Y. A New Time-Frequency Attention Mechanism for TDNN and CNN-LSTM-TDNN, with Application to Language Identification. In Proceedings of the Interspeech, Graz, Austria, 15 September 2019. [Google Scholar]
Koutini, K.; Schlüter, J.; Eghbal-zadeh, H.; Widmer, G. Efficient Training of Audio Transformers with Patchout. In Proceedings of the Interspeech, Incheon, Republic of Korea, 18 September 2022; pp. 2753–2757. [Google Scholar] [CrossRef]
Chen, K.; Du, X.; Zhu, B.; Ma, Z.; Berg-Kirkpatrick, T.; Dubnov, S. HTS-AT: A Hierarchical Token-Semantic Audio Transformer for Sound Classification and Detection. In Proceedings of the ICASSP, Singapore, 22 May 2022. [Google Scholar]
Cai, P.; Song, Y.; Gu, Q.; Jiang, N.; Song, H.; McLoughlin, I. Detect Any Sound: Open-Vocabulary Sound Event Detection with Multi-Modal Queries. In Proceedings of the 33rd ACM International Conference on Multimedia, Dublin, Ireland, 27 October 2025; pp. 582–591. [Google Scholar] [CrossRef]
Chen, T.; Yang, Y.; Qiu, C.; Fan, X.; Guo, X.; Shangguan, L. Enabling Hands-Free Voice Assistant Activation on Earphones. In Proceedings of the 22nd Annual International Conference on Mobile Systems, Applications and Services (MOBISYS), Tokyo, Japan, 3–7 June 2024; pp. 155–168. [Google Scholar] [CrossRef]
Bregman, A.S. Auditory Scene Analysis: The Perceptual Organization of Sound; MIT Press: Cambridge, MA, USA, 1994. [Google Scholar]
Mesaros, A.; Heittola, T.; Virtanen, T.; Plumbley, M.D. Sound Event Detection: A tutorial. IEEE Signal Process. Mag. 2021, 38, 67–83. [Google Scholar] [CrossRef]
Zeng, X.M.; Song, Y.; Zhuo, Z.; Zhou, Y.; Li, Y.H.; Xue, H.; Dai, L.R.; McLoughlin, I. Joint Generative-Contrastive Representation Learning for Anomalous Sound Detection. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Rhodes, Greece, 4 June 2023; pp. 1–5. [Google Scholar] [CrossRef]
Miao, X.; McLoughlin, I. LSTM-TDNN with convolutional front-end for Dialect Identification in the 2019 Multi-Genre Broadcast Challenge. arXiv 2019, arXiv:1912.09003. [Google Scholar]
Jiang, Y.; Song, Y.; McLoughlin, I.; Gao, Z.; Dai, L.R. An Effective Deep Embedding Learning Architecture for Speaker Verification. In Proceedings of the Interspeech, Graz, Austria, 15 September 2019; pp. 4040–4044. [Google Scholar]
Sun, L.; Du, J.; Jiang, C.; Zhang, X.; He, S.; Yin, B.; Lee, C.H. Speaker Diarization with Enhancing Speech for the First DIHARD Challenge. In Proceedings of the Interspeech, Hyderabad, India, 2 September 2018; pp. 2793–2797. [Google Scholar]
Gao, Z.; Song, Y.; McLoughlin, I.; Li, P.; Jiang, Y.; Dai, L.R. Improving Aggregation and Loss Function for Better Embedding Learning in End-to-End Speaker Verification System. In Proceedings of the Interspeech, Graz, Austria, 15 September 2019; pp. 361–365. [Google Scholar]
Pham, L.; Phan, H.; Palaniappan, R.; Mertins, A.; McLoughlin, I. CNN-MoE Based Framework for Classification of Respiratory Anomalies and Lung Disease Detection. IEEE J. Biomed. Health Inform. 2021, 25, 2938–2947. [Google Scholar] [CrossRef] [PubMed]
Nguyen, T.; Pernkopf, F. Lung Sound Classification Using Co-Tuning and Stochastic Normalization. IEEE Trans. Biomed. Eng. 2022, 69, 2872–2882. [Google Scholar] [CrossRef] [PubMed]
Milling, M.; Pokorny, F.B.; Bartl-Pokorny, K.D.; Schuller, B.W. Is Speech the New Blood? Recent Progress in AI-Based Disease Detection From Audio in a Nutshell. Front. Digit. Health 2022, 4, 886615. [Google Scholar] [CrossRef] [PubMed]
Rashid, M.M.; Li, G.; Du, C. Nonspeech7k dataset: Classification and analysis of human non-speech sound. IET Signal Process. 2023, 17, e12233. [Google Scholar] [CrossRef]
Kim, S.Y.; Lee, H.M.; Lim, C.Y.; Kim, H.W. Detection of Abnormal Symptoms Using Acoustic-Spectrogram-Based Deep Learning. Appl. Sci. 2025, 15, 4679. [Google Scholar] [CrossRef]
Moysis, L.; Iliadis, L.A.; Sotiroudis, S.P.; Boursianis, A.D.; Papadopoulou, M.S.; Kokkinidis, K.I.D.; Volos, C.; Sarigiannidis, P.; Nikolaidis, S.; Goudos, S.K. Music Deep Learning: Deep Learning Methods for Music Signal Processing—A Review of the State-of-the-Art. IEEE Access 2023, 11, 17031–17052. [Google Scholar] [CrossRef]
Chen, R.; Ghobakhlou, A.; Narayanan, A. Hierarchical Residual Attention Network for Musical Instrument Recognition Using Scaled Multi-Spectrogram. Appl. Sci. 2024, 14, 10837. [Google Scholar] [CrossRef]
Buisson, M.; McFee, B.; Essid, S.; Crayencour, H.C. Self-Supervised Learning of Multi-Level Audio Representations for Music Segmentation. IEEE/ACM Trans. Audio Speech Lang. Process. 2024, 32, 2141–2152. [Google Scholar] [CrossRef]
Thapa, N.; Lee, J. Dual-Path Beat Tracking: Combining Temporal Convolutional Networks and Transformers in Parallel. Appl. Sci. 2024, 14, 11777. [Google Scholar] [CrossRef]
Verma, P.; Berger, J. Audio transformers: Transformer architectures for large scale audio understanding. adieu convolutions. arXiv 2021, arXiv:2105.00335. [Google Scholar]
Grzeszick, R.; Plinge, A.; Fink, G.A. Bag-of-features methods for acoustic event detection and classification. IEEE/ACM Trans. Audio Speech Lang. Process. 2017, 25, 1242–1252. [Google Scholar] [CrossRef]
Dennis, J.; Tran, H.D.; Chng, E.S. Overlapping sound event recognition using local spectrogram features and the generalised hough transform. Pattern Recognit. Lett. 2013, 34, 1085–1093. [Google Scholar] [CrossRef]
Adavanne, S.; Politis, A.; Nikunen, J.; Virtanen, T. Sound Event Localization and Detection of Overlapping Sources Using Convolutional Recurrent Neural Networks. IEEE J. Sel. Top. Signal Process. 2019, 13, 34–48. [Google Scholar] [CrossRef]
Xia, W.; Koishida, K. Sound Event Detection in Multichannel Audio using Convolutional Time-Frequency-Channel Squeeze and Excitation. arXiv 2019, arXiv:1908.01399. [Google Scholar][Green Version]
Alcázar, J.N.; Zuccarello, P.; Cobos, M. Classification of Sound Scenes and Events in Real-World Scenarios with Deep Learning Techniques. Ph.D. Thesis, Universitat de Valencia, Valencia, Spain, 2020. Available online: https://roderic.uv.es/rest/api/core/bitstreams/d50a3d5d-8481-4e33-8196-ff5af972771f/content (accessed on 1 January 2026).
Wisdom, S.; Tzinis, E.; Erdogan, H.; Weiss, R.; Wilson, K.; Hershey, J. Unsupervised Sound Separation Using Mixture Invariant Training. In Proceedings of the Advances in Neural Information Processing Systems; Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M., Lin, H., Eds.; Curran Associates, Inc.: New York, NY, USA, 2020; Volume 33, pp. 3846–3857. [Google Scholar]
Kahl, S.; Wood, C.M.; Eibl, M.; Klinck, H. BirdNET: A deep learning solution for avian diversity monitoring. Ecol. Inform. 2021, 61, 101236. [Google Scholar] [CrossRef]
Nath, K.; Sarma, K.K. Separation of overlapping audio signals: A review on current trends and evolving approaches. Signal Process. 2024, 221, 109487. [Google Scholar] [CrossRef]
Sudo, Y.; Itoyama, K.; Nishida, K.; Nakadai, K. Environmental sound segmentation utilizing Mask U-Net. In Proceedings of the 2019 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Macau, China, 4 November 2019; pp. 5340–5345. [Google Scholar] [CrossRef]
Sudo, Y.; Itoyama, K.; Nishida, K.; Nakadai, K. Multi-channel Environmental sound segmentation. In Proceedings of the IEEE/SICE International Symposium on System Integration (SII), Honolulu, HI, USA, 12 January 2020; pp. 820–825. [Google Scholar] [CrossRef]
Baelde, M.; Biernacki, C.; Greff, R. A mixture model-based real-time audio sources classification method. In Proceedings of the 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), New Orleans, LA, USA, 5 March 2017; pp. 2427–2431. [Google Scholar] [CrossRef]
Xu, K.; Feng, D.; Mi, H.; Zhu, B.; Wang, D.; Zhang, L.; Cai, H.; Liu, S. Mixup-Based Acoustic Scene Classification Using Multi-channel Convolutional Neural Network. In Proceedings of the Advances in Multimedia Information Processing—PCM 2018; Hong, R., Cheng, W.H., Yamasaki, T., Wang, M., Ngo, C.W., Eds.; Springer International Publishing: Cham, Switzerland, 2018; pp. 14–23. [Google Scholar]
Phan, H.; Maass, M.; Mazur, R.; Mertins, A. Early event detection in audio streams. In Proceedings of the 2015 IEEE International Conference on Multimedia and Expo (ICME), Torino, Italy, 29 June 2015; pp. 1–6. [Google Scholar] [CrossRef]
Zhao, X.; Zhang, X.; Zhao, C.; Cho, J.H.; Kaplan, L.; Jeong, D.H. Multi-Label Temporal Evidential Neural Networks for Early Event Detection. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Rhodes, Greece, 4 June 2023. [Google Scholar] [CrossRef]
Zhao, X.; Zhang, X.; Cheng, W.; Yu, W.; Chen, Y.; Chen, H.; Chen, F. Seed: Sound event early detection via evidential uncertainty. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Singapore, 22 May 2022; pp. 3618–3622. [Google Scholar]
Mesaros, A.; Heittola, T.; Virtanen, T. TUT database for acoustic scene classification and sound event detection. In Proceedings of the 2016 24th European Signal Processing Conference (EUSIPCO), Budapest, Hungary, 29 August 2016; pp. 1128–1132. [Google Scholar] [CrossRef]
Stowell, D.; Giannoulis, D.; Benetos, E.; Lagrange, M.; Plumbley, M.D. Detection and Classification of Acoustic Scenes and Events. IEEE Trans. Multimed. 2015, 17, 1733–1746. [Google Scholar] [CrossRef]
Çakir, E.; Virtanen, T. End-to-End Polyphonic Sound Event Detection Using Convolutional Recurrent Neural Networks with Learned Time-Frequency Representation Input. In Proceedings of the 2018 International Joint Conference on Neural Networks (IJCNN), Rio de Janeiro, Brazil, 8 July 2018; pp. 1–7. [Google Scholar] [CrossRef]
Bittner, R.; McFee, B.; Salamon, J.; Li, P.; Bello, J. Deep Salience Representations for F₀ Estimation in Polyphonic Music. In Proceedings of the 18th International Society for Music Information Retrieval Conference, Suzhou, China, 23–27 October 2017. [Google Scholar]
Leng, Y.R.; Tran, H.D.; Kitaoka, N.; Li, H. Selective gammatone filterbank feature for robust sound event recognition. In Proceedings of the Interspeech, Makuhari, Chiba, Japan, 26 September 2010; pp. 2246–2249. [Google Scholar] [CrossRef]
Pham, L.; Phan, H.; Nguyen, T.; Palaniappan, R.; Mertins, A.; McLoughlin, I. Robust acoustic scene classification using a multi-spectrogram encoder-decoder framework. Digit. Signal Process. 2021, 110, 102943. [Google Scholar] [CrossRef]
Wang, Y.; Getreuer, P.; Hughes, T.; Lyon, R.F.; Saurous, R.A. Trainable frontend for robust and far-field keyword spotting. In Proceedings of the 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), New Orleans, LA, USA, 5 March 2017; pp. 5670–5674. [Google Scholar]
Yin, H.; Bai, J.; Xiao, Y.; Wang, H.; Zheng, S.; Chen, Y.; Das, R.K.; Deng, C.; Chen, J. Exploring Text-Queried Sound Event Detection with Audio Source Separation. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Hyderabad, India, 6 April 2025; pp. 1–5. [Google Scholar] [CrossRef]
Adavanne, S.; Virtanen, T. A Report on Sound Event Detection with Different Binaural Features; Technical Report, DCASE2017 Challenge; Munich, Germany, 2017; Available online: https://arxiv.org/abs/1710.02997 (accessed on 1 January 2025).
Heittola, T.; Mesaros, A. DCASE 2017 Challenge Setup: Tasks, Datasets and Baseline System; Technical Report, DCASE2017 Challenge; Munich, Germany, 2017; Available online: https://inria.hal.science/hal-01627981/ (accessed on 1 January 2025).
Lin, L.; Wang, X. Guided Learning Convolution System For Dcase 2019 Task 4; Technical Report; Institute of Computing Technology, Chinese Academy of Sciences: Beijing, China, 2019. [Google Scholar]
Gemmeke, J.F.; Ellis, D.P.W.; Freedman, D.; Jansen, A.; Lawrence, W.; Moore, R.C.; Plakal, M.; Ritter, M. Audio Set: An ontology and human-labeled dataset for audio events. In Proceedings of the IEEE ICASSP 2017, New Orleans, LA, USA, 5–9 March 2017. [Google Scholar]
Fonseca, E.; Pons, J.; Favory, X.; Font, F.; Bogdanov, D.; Ferraro, A.; Oramas, S.; Porter, A.; Serra, X. Freesound Datasets: A platform for the creation of open audio datasets. In Proceedings of the 18th International Society for Music Information Retrieval Conference (ISMIR), Suzhou, China, 23–27 October 2017; pp. 486–493. [Google Scholar]
McLoughlin, I.; Xie, Z.; Song, Y.; Phan, H.; Palaniappan, R. Time–Frequency Feature Fusion for Noise Robust Audio Event Classification. Circuits Syst. Signal Process. 2020, 39, 1672–1687. [Google Scholar] [CrossRef]
Ebbers, J.; Haeb-Umbach, R. Pre-Training and Self-Training for Sound Event Detection in Domestic Environments; Technical Report, DCASE2022 Challenge; Nancy, France, 2022; Available online: https://ris.uni-paderborn.de/record/49113 (accessed on 1 January 2025).
Turpault, N.; Serizel, R.; Parag Shah, A.; Salamon, J. Sound event detection in domestic environments with weakly labeled data and soundscape synthesis. In Proceedings of the Workshop on Detection and Classification of Acoustic Scenes and Events, New York, NY, USA, 25–26 October 2019. [Google Scholar]
Kim, J.W.; Son, S.W.; Song, Y.; Kim, H.K.; Song, I.H.; Lim, J.E. Semi-Supervised Learning-Based Sound Event Detection Using Frequency Dynamic Convolution With Large Kernel Attention For DCASE Challenge 2023 Task 4; Technical Report, DCASE2023 Challenge; Tampere, Finland, 2023; Available online: https://arxiv.org/abs/2306.06461 (accessed on 1 January 2026).
Schmid, F.; Primus, P.; Morocutti, T.; Greif, J.; Widmer, G. Improving Audio Spectrogram Transformers For Sound Event Detection Through Multi-Stage Training. arXiv 2024, arXiv:2408.00791. [Google Scholar]
Martín-Morató, I.; Mesaros, A. Strong Labeling of Sound Events Using Crowdsourced Weak Labels and Annotator Competence Estimation. IEEE/ACM Trans. Audio Speech Lang. Process. 2023, 31, 902–914. [Google Scholar] [CrossRef]
Tang, L.; Tian, H.; Huang, H.; Shi, S.; Ji, Q. A survey of mechanical fault diagnosis based on audio signal analysis. Measurement 2023, 220, 113294. [Google Scholar] [CrossRef]
Qurthobi, A.; Maskeliūnas, R.; Damaševičius, R. Detection of Mechanical Failures in Industrial Machines Using Overlapping Acoustic Anomalies: A Systematic Literature Review. Sensors 2022, 22, 3888. [Google Scholar] [CrossRef]
Zeng, X.M.; Song, Y.; McLoughlin, I.; Liu, L.; Dai, L.R. Robust Prototype Learning for Anomalous Sound Detection. In Proceedings of the Interspeech, Dublin, Ireland, 20 August 2023; pp. 261–265. [Google Scholar] [CrossRef]
Koizumi, Y.; Kawaguchi, Y.; Imoto, K.; Nakamura, T.; Nikaido, Y.; Tanabe, R.; Purohit, H.; Suefusa, K.; Endo, T.; Yasuda, M.; et al. Description and Discussion on DCASE2020 Challenge Task2; Tokyo, Japan: Unsupervised Anomalous Sound Detection for Machine Condition Monitoring. In Proceedings of the DCASE, Virtual, 2–4 November 2020; pp. 81–85. [Google Scholar]
Suefusa, K.; Nishida, T.; Purohit, H.; Tanabe, R.; Endo, T.; Kawaguchi, Y. Anomalous Sound Detection Based on Interpolation Deep Neural Network. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Virtual, 4 May 2020; pp. 271–275. [Google Scholar] [CrossRef]
Zeng, X.M.; Song, Y.; Dai, L.R.; Liu, L. Predictive AutoEncoders Are Context-Aware Unsupervised Anomalous Sound Detectors. In Proceedings of the Man-Machine Speech Communication; Zhenhua, L., Jianqing, G., Kai, Y., Jia, J., Eds.; Springer Nature: Singapore, 2023; pp. 101–113. [Google Scholar]
Hendrycks, D.; Mazeika, M.; Dietterich, T. Deep Anomaly Detection with Outlier Exposure. In Proceedings of the ICLR, New Orleans, LA, USA, 6 May 2019. [Google Scholar]
Liu, Y.; Guan, J.; Zhu, Q.; Wang, W. Anomalous Sound Detection Using Spectral-Temporal Information Fusion. In Proceedings of the ICASSP 2022—2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Singapore, 22 May 2022. [Google Scholar] [CrossRef]
Han, B.; Lv, Z.; Jiang, A.; Huang, W.; Chen, Z.; Deng, Y.; Ding, J.; Lu, C.; Zhang, W.Q.; Fan, P.; et al. Exploring Large Scale Pre-Trained Models for Robust Machine Anomalous Sound Detection. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, Seoul, Republic of Korea, 14 April 2024; pp. 1326–1330. [Google Scholar] [CrossRef]
Jiang, A.; Han, B.; Lv, Z.; Deng, Y.; Zhang, W.Q.; Chen, X.; Qian, Y.; Liu, J.; Fan, P. AnoPatch: Towards Better Consistency in Machine Anomalous Sound Detection. In Proceedings of the Interspeech, Kos, Greece, 1 September 2024; pp. 107–111. [Google Scholar] [CrossRef]
Chakrabarty, D.; Elhilali, M. Abnormal sound event detection using temporal trajectories mixtures. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Shanghai, China, 20 March 2016; pp. 216–220. [Google Scholar] [CrossRef]
Li, K.; Zaman, K.; Li, X.; Akagi, M.; Unoki, M. Machine Anomalous Sound Detection Using Spectral-temporal Modulation Representations Derived from Machine-specific Filterbanks. arXiv 2024, arXiv:2409.05319. [Google Scholar] [CrossRef]
Yin, J.; Gao, Y.; Zhang, W.; Wang, T.; Zhang, M. Diffusion Augmentation Sub-center Modeling for Unsupervised Anomalous Sound Detection with Partially Attribute-Unavailable Conditions. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Hyderabad, India, 6 April 2025; pp. 1–5. [Google Scholar] [CrossRef]
Stowell, D. Computational bioacoustics with deep learning: A review and roadmap. PeerJ 2022, 10, e13152. [Google Scholar] [CrossRef]
Tosato, G.; Shehata, A.; Janssen, J.; Kamp, K.; Jati, P.; Stowell, D. Auto deep learning for bioacoustic signals. arXiv 2023, arXiv:2311.04945. [Google Scholar] [CrossRef]
Heinrich, R.; Sick, B.; Scholz, C. AudioProtoPNet: An Interpretable Deep Learning Model for Bird Sound Classification. arXiv 2024, arXiv:2404.10420. [Google Scholar] [CrossRef]
Hershey, J.R.; Chen, Z.; Le Roux, J.; Watanabe, S. Deep clustering: Discriminative embeddings for segmentation and separation. arXiv 2015, arXiv:1508.04306. [Google Scholar] [CrossRef]
Dang, T.M.; Wang, T.S.; Lekhak, H.; Zhu, K.Q. EmotionalCanines: A Dataset for Analysis of Arousal and Valence in Dog Vocalization. In Proceedings of the ACM International Conference on Multimedia. Association for Computing Machinery, Dublin, Ireland, 27–31 October 2025; pp. 13281–13288. [Google Scholar] [CrossRef]
Suzuki, K.; Sakamoto, S.; Taniguchi, T.; Kameoka, H. Speak like a dog: Human to non-human creature voice conversion. In Proceedings of the Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC), Chiang Mai, Thailand, 7 November 2022; pp. 1388–1393. [Google Scholar]
Kang, M.; Lee, S.; Lee, C.; Cho, N. When Humans Growl and Birds Speak: High-Fidelity Voice Conversion from Human to Animal and Designed Sounds. arXiv 2025, arXiv:2505.24336. [Google Scholar] [CrossRef]
Lostanlen, V.; Salamon, J.; Cartwright, M.; McFee, B.; Farnsworth, A.; Kelling, S.; Bello, J.P. Per-Channel Energy Normalization: Why and How. IEEE Signal Process. Lett. 2019, 26, 39–43. [Google Scholar] [CrossRef]
Hexeberg, S.; Leite, R.; Ewers, R.M.; Stowell, D. Semi-supervised classification of bird vocalizations using spatiotemporal features. Sci. Rep. 2023, 13, 12345. [Google Scholar]
Bravo Sánchez, V.; Stowell, D.; Drossos, K.; Virtanen, T. Bioacoustic classification of avian calls from raw sound waveforms with an open-source deep learning architecture. Sci. Rep. 2021, 11, 15740. [Google Scholar] [CrossRef]
Zeghidour, N.; Luebs, F.; Synnaeve, G.; Collobert, R. LEAF: A learnable frontend for audio classification. In Proceedings of the International Conference on Learning Representations (ICLR), Virtual, 3 May 2021. [Google Scholar]
Ravanelli, M.; Bengio, Y. Speaker recognition from raw waveform with SincNet. In Proceedings of the IEEE Spoken Language Technology Workshop (SLT), Athens, Greece, 18 December 2018; pp. 1021–1028. [Google Scholar]
Lee, J.; Park, J.; Nam, J. Sample-level deep convolutional neural networks for music auto-tagging using raw waveforms. In Proceedings of the ISMIR, Suzhou, China, 23 October 2017. [Google Scholar]
Jung, J.w.; Kim, H.S.; Kim, M.J.; Yoon, S.H.; Lee, B.J.; Kim, H. RawNet: Advanced end-to-end deep neural network using raw waveforms for text-independent speaker verification. In Proceedings of the Interspeech, Graz, Austria, 15 September 2019; pp. 1268–1272. [Google Scholar]
Allen, A.N.; Harvey, M.; Harrell, L.; Jansen, A.; Merkens, K.P.; Wall, C.C.; Cattiau, J.; Oleson, E.M. A CNN for humpback whale song detection in diverse long-term datasets. Front. Mar. Sci. 2021, 8, 653740. [Google Scholar] [CrossRef]
Kong, Q.; Cao, Y.; Iqbal, T.; Wang, Y.; Wang, W.; Plumbley, M.D. PANNs: Large-scale pretrained audio neural networks for audio pattern recognition. IEEE/ACM Trans. Audio Speech Lang. Process. 2020, 28, 2880–2894. [Google Scholar] [CrossRef]
O’Shaughnessy, D. Spoken language identification: An overview of past and present research trends. Speech Commun. 2025, 167, 103167. [Google Scholar] [CrossRef]
Miao, X.; McLoughlin, I.; Wang, W.; Zhang, P. D-MONA: A dilated mixed-order non-local attention network for speaker and language recognition. Neural Netw. 2021, 139, 201–211. [Google Scholar] [CrossRef]
Liu, H.; Perera, L.P.G.; Khong, A.W.; Chng, E.S.; Styles, S.J.; Khudanpur, S. Efficient self-supervised learning representations for spoken language identification. IEEE J. Sel. Top. Signal Process. 2022, 16, 1296–1307. [Google Scholar] [CrossRef]
Dey, S.; Sahidullah, M.; Saha, G. Towards Cross-Corpora Generalization for Low-Resource Spoken Language Identification. IEEE/ACM Trans. Audio Speech Lang. Process. 2024, 32, 5040–5050. [Google Scholar] [CrossRef]
Kohler, M.A.; Kennedy, M. Language identification using shifted delta cepstra. In Proceedings of the 45th Midwest Symposium on Circuits and Systems (MWSCAS), Fukuoka, Japan, 7 August 2002; Volume 3, pp. III–69. [Google Scholar]
Cai, W.; Cai, Z.; Zhang, X.; Wang, X.; Li, M. A novel learnable dictionary encoding layer for end-to-end language identification. In Proceedings of the 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Calgary, AB, Canada, 15 April 2018; pp. 5189–5193. [Google Scholar]
Alumäe, T.; Kukk, K.; Le, V.B.; Barras, C.; Messaoudi, A.; Ben Kheder, W. Exploring the impact of pretrained models and web-scraped data for the 2022 NIST language recognition evaluation. In Proceedings of the Interspeech, Dublin, Ireland, 20 August 2023; pp. 516–520. [Google Scholar]
Jin, M.; Song, Y.; McLoughlin, I.; Dai, L.R. LID-Senones and Their Statistics for Language Identification. IEEE/ACM Trans. Audio Speech Lang. Process. 2018, 26, 171–183. [Google Scholar] [CrossRef]
Kaiyr, A.; Kadyrov, S.; Bogdanchikov, A. Automatic Language Identification from Spectorgam Images. In Proceedings of the 2021 IEEE International Conference on Smart Information Systems and Technologies (SIST), Nur-Sultan, Kazakhstan, 28 April 2021; pp. 1–4. [Google Scholar] [CrossRef]
Tjandra, A.; Choudhury, D.G.; Zhang, F.; Singh, K.; Conneau, A.; Baevski, A.; Sela, A.; Saraf, Y.; Auli, M. Improved language identification through cross-lingual self-supervised learning. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Singapore, 22 May 2022; pp. 6877–6881. [Google Scholar]
Reynolds, D.A.; Quatieri, T.F.; Dunn, R.B. Speaker Verification Using Adapted Gaussian Mixture Models. Digit. Signal Process. 2000, 10, 19–41. [Google Scholar] [CrossRef]
Dehak, N.; Kenny, P.; Dehak, R.; Dumouchel, P.; Ouellet, P. Front-End Factor Analysis for Speaker Verification. IEEE Trans. Audio Speech Lang. Process. 2011, 19, 788–798. [Google Scholar] [CrossRef]
Variani, E.; Lei, X.; McDermott, E.; Lopez-Moreno, I.; Gonzalez-Dominguez, J. Deep Neural Networks for Small-Footprint Text-Dependent Speaker Verification. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Florence, Italy, 25 May 2014; pp. 4052–4056. [Google Scholar] [CrossRef]
Liu, Z.L.; Song, Y.; Zeng, X.M.; Dai, L.R.; McLoughlin, I. DP-MAE: A dual-path masked autoencoder based self-supervised learning method for anomalous sound detection. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Seoul, Republic of Korea, 14 April 2024; pp. 1481–1485. [Google Scholar]
Nagrani, A.; Chung, J.S.; Zisserman, A. VoxCeleb: A Large-Scale Speaker Identification Dataset. In Proceedings of the Interspeech, Stockholm, Sweden, 20 August 2017; pp. 2616–2620. [Google Scholar] [CrossRef]
Campbell, W.M.; Campbell, J.P.; Reynolds, D.A.; Singer, E.; Torres-Carrasquillo, P.A. Support Vector Machines for Speaker and Language Recognition. Comput. Speech Lang. 2006, 20, 210–229. [Google Scholar] [CrossRef]
Schuller, B.; Steidl, S.; Batliner, A. The Interspeech 2009 Emotion Challenge 2009. In Proceedings of the Interspeech 2009, Brighton, UK, 6 September 2009; pp. 312–315. Available online: https://www.isca-archive.org/interspeech_2009/schuller09_interspeech.html (accessed on 1 January 2025).
Eyben, F.; Wöllmer, M.; Schuller, B. Opensmile: The munich versatile and fast open-source audio feature extractor. In Proceedings of the 18th ACM International Conference on Multimedia, Florence, Italy, 25 October 2010; pp. 1459–1462. [Google Scholar]
Schuller, B.W.; Zhang, Z.; Weninger, F.; Rigoll, G. Using multiple databases for training in emotion recognition: To unite or to vote? In Proceedings of the Interspeech, Florence, Italy, 27 August 2011; pp. 1553–1556. [Google Scholar]
Parry, J.; Palaz, D.; Clarke, G.; Lecomte, P.; Mead, R.; Berger, M.; Hofer, G. Analysis of Deep Learning Architectures for Cross-Corpus Speech Emotion Recognition. In Proceedings of the Interspeech, Graz, Austria, 15 September 2019; pp. 1656–1660. [Google Scholar]
Satt, A.; Rozenberg, S.; Hoory, R. Efficient Emotion Recognition from Speech Using Deep Learning on Spectrograms. In Proceedings of the Interspeech, Stockholm, Sweden, 20 August 2017. [Google Scholar] [CrossRef]
Trigeorgis, G.; Ringeval, F.; Brueckner, R.; Marchi, E.; Nicolaou, M.A.; Schuller, B.; Zafeiriou, S. Adieu features? end-to-end speech emotion recognition using a deep convolutional recurrent network. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Shanghai, China, 20 March 2016; pp. 5200–5204. [Google Scholar]
Mirsamadi, S.; Barsoum, E.; Zhang, C. Automatic speech emotion recognition using recurrent neural networks with local attention. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), New Orleans, LA, USA, 5 March 2017; pp. 2227–2231. [Google Scholar] [CrossRef]
Pepino, L.; Riera, P.; Ferrer, L. Emotion Recognition from Speech Using wav2vec 2.0 Embeddings. In Proceedings of the Interspeech 2021, Brno, Czech Republic, 30 August 2021; pp. 161–165. [Google Scholar]
Chen, S.; Wang, C.; Chen, Z.; Wu, Y.; Liu, S.; Chen, Z.; Li, J.; Kanda, N.; Yoshioka, T.; Xiao, X.; et al. Wavlm: Large-scale self-supervised pre-training for full stack speech processing. IEEE J. Sel. Top. Signal Process. 2022, 16, 1505–1518. [Google Scholar] [CrossRef]
Diatlova, D.; Udalov, A.; Shutov, V.; Spirin, E. Adapting WavLM for Speech Emotion Recognition. arXiv 2024, arXiv:2405.04485. [Google Scholar] [CrossRef]
Chowdhury, S.Y.; Banik, B.; Hoque, M.T.; Banerjee, S. A Novel Hybrid Deep Learning Technique for Speech Emotion Detection using Feature Engineering. arXiv 2025, arXiv:2507.07046. [Google Scholar] [CrossRef]
Amjad, A.; Khuntia, S.; Chang, H.T.; Tai, L.C. Multi-Domain Emotion Recognition Enhancement: A Novel Domain Adaptation Technique for Speech-Emotion Recognition. IEEE Trans. Audio Speech Lang. Process. 2025, 33, 528–541. [Google Scholar] [CrossRef]
Chen, Z.; Wang, J.; Hu, W.; Li, L.; Hong, Q. Unsupervised Speaker Verification Using Pre-Trained Model and Label Correction. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Rhodes, Greece, 4 June 2023; pp. 1–5. [Google Scholar] [CrossRef]
Ghaffari, H.; Devos, P. Robust Weakly Supervised Bird Species Detection via Peak Aggregation and PIE. IEEE Trans. Audio Speech Lang. Process. 2025, 33, 1427–1439. [Google Scholar] [CrossRef]

Figure 1. Illustration of spectrogram creation from input audio data as a stack of frequency vectors.

Figure 2. Illustrations of time–frequency spectrograms based on (a) stabilised auditory image; (b) Constant-Q transform; (c) Mel-scaled spectrogram; (d) stacked MFCC; (e) Linear magnitude spectrogram.

Figure 3. High level system diagram showing spectrogram features (a) being extracted from an input waveform as a stack of scaled transforms from windowed speech regions; (b) features were gathered from patches, pooled regions or a downsampled spectrogram image for (c) input to a deep learning classification pipeline.

Table 1. Taxonomy of spectrograms.

Description	Dimensions	Element Scale	Frequency Span
Linear spectrogram (LS)	time, frequency (T, F)	scalar (0, 1)	(0, Nyquist)
Log-scaled spectrogram (LSS)	T, F	log $(- 100$ dB $, 0)$	$(0,$ Nyquist)
A/ $μ$ -law scaling	T, F	log $(0, 255)$	$(0,$ Nyquist)
Mel-spectrogram (MS)	T, Mel-F	linear	$(0,$ Mel(Nyquist))
Log-Mel-spectrogram (LMS)	T, Mel-F	log $(- 100$ dB $, 0)$	$(0,$ Mel(Nyquist))
Gammatonegram (GTG)	trapeziodal/squared-T, F	log	(ERB(0), ERB(Nyquist))
Constant-Q transform (CQT)	trapeziodal/squared-T, F	log/linear	$(0,$ Nyquist)
Stabilised auditory image (SAI)	non-linear F, lag	scaled $(0, 1)$	$(0, 35$ ms by default)

Table 2. Performance of variance normalised features (VNF) compared to standard pooling for three tasks, aiming for higher accuracy and lower C_avg.

Task	Details	Fixed Pooling	VNFs
SED	50 class RWCP, 20 dB SNR	94.8% accuracy	96.3% accuracy
SED	50 class RWCP, 0 dB SNR	75.1% accuracy	84.0% accuracy
LID	NIST LRE07 DNN x-vector 3s	10.17 $C_{a v g}$	8.80 $C_{a v g}$
LID	NIST LRE07 CLSTM 3s	7.15 $C_{a v g}$	6.70 $C_{a v g}$
DID	Arabic dialect challenge	3.20 $C_{a v g}$	2.62 $C_{a v g}$

Table 3. Taxonomy of audio analysis. The input length (first column) can vary from a short frame to a continuous signal, with one or more features extracted from this, and an output class or timing is obtained.

Input	Feature Extraction	`Stack and Classify`	Output
continuous,	raw waveform/spectrum,		one-hot class per instance,
full recording,	named features,		posterior probabilities,
utterance,	trained feature extractor or		vote over multiple instances,
segment or	pre-trained feature extractor		average/threshold over time or
frame			localisation in time

Table 4. Prominent sound event detection methods that utilise spectrogram features, including linear (LS), log-mel (LMS), mel scale (MS), constant-Q (CQT) and gammatonegram (GTG). Tasks include the Real World Computing Partnership (RWCP) sounds, the TUT sound events database, and the Domestic Environment Sound Event Detection Dataset (DESED).

Year	Ref.	Task	Spectrogram Type	Resolution and Settings	Pooling
2014	[6]	50 class RWCP	LS	$[30 \times 24]$ , 16 kHz	vote
2017	[80]	6 class TUT events [81] ¹	LMS + LS	$[240 \times 256]$ , 44.1 kHz	max
2019	[82]	10 class from [83] + [84]	LMS	$[64 \times 500]$ , 16 kHz	median
2020	[85]	50 class RWCP	LS, GTG, CQT	$[52 \times 40]$ , 16 kHz	mean
2022	[86]	10 class DESED [87] ²	LMS	$[128 \times 960]$ , 16 kHz	mean
2023	[88]	10 class DESED [87] ²	MS	$[128 \times 1001]$ , 16 kHz	mean
2024	[89]	11 class DCASE24 task 4 [90]	MS	$[128 \times 100]$ ³ in AST + fPaSST, 16 kHz	ensemble-mean

¹ DCASE 2017 task 3; ² DCASE 2022/3/4 task 4; ³ In 16 × 16 patches.

Table 5. Spectrograms used in ASD, primarily log-Mel (LMS, their resolution, sample rate and scaling).

Method	Spectrogram	Pixels	Sample Rate	Scale
Chakrabarty et al. [101]	LMS	128 × T	8 kHz	log
Zeng et al. [93]	LMS	128 × T	16 kHz	inverted log
Li et al. [102]	LNS	128 × T	16 kHz	log
Liu et al. [98]	LMS + Tgram	128 × T	16 kHz	log
Yin et al. [103]	LMS	128 × T	16 kHz	normalised

Table 6. Deep learning front-ends in bioacoustic analysis that make use of raw waveforms (top three), linear and log-Mel spectrograms (LS, LMS) and stabilised auditory image (SAI) (middle four) and hybrid approaches (bottom two).

Technique	Input Type	Task(s)	Taxa
SincNet [115]	Raw waveform	Species classification	Birds
SampleCNN [116]	Raw waveform	Music auto-tagging	Music
RawNet [117]	Raw waveform	Speaker verification	Humans
CNN/ResNet [104,113]	LMS	Species classification, SED	Birds, Frogs, Whales
PCEN-enhanced CNN [78,114,118]	PCEN-Mel	Low-SNR event detection	Birds, Whales
CNN on STFT [112]	LS	Call segmentation	Bats
CNN with spectrogram and stabilised auditory image input [6]	LS + SAI	Sound event classification	General sounds
LEAF [114]	Learned spectrogram from raw waveform	Species classification, detection in noise	Birds, Whales
Wavegram-Logmel-CNN [119]	Wavegram + LMS	General classification	Various

Table 7. Prominent LID research showing various kinds of spectrogram, including linear (LS) and log-Mel spectrograms (LMS). EER is equal error rate.

Method	Spectrogram	Resolution	Task
Ma et al. [127]	PLP + bottleneck	48 × 21	23 languages, EER 4.38% ¹
Kaiyr et al. [128]	LS, CNN-RNN	116 × 200 5–10 s segments	7 languages, acc. 94.3% ²
Liu et al. [122]	MFCC + delta + delta-delta	39 + 39 + 39, 25 ms window, 10 ms hop	14 languages, EER 3.82% ³
Miao et al. [121]	MFCC+D-MONA	23 × 5 frames	14 languages, EER 1.15% ³
Tjanda et al. [129]	LMS	80 × 4, 25 ms window, 10 ms hop	26 languages, acc. 90.3% ⁴

¹ Evaluated on 10 s utterances using NIST LRE2009; ² Evaluated on 5–10 s clips; ³ Evaluated on 10 s utterances using NIST LRE2017; ⁴ Evaluated on 6 s utterances.

Table 8. Prominent SV research showing a progression of spectrogram use, where LMS refers to log-Mel spectrogram, and FBs are filterbanks.

Method	Spectrogram	Resolution
Reynolds et al. [130]	MFCC + context	13 × T
Dehak et al. [131]	i-vector from 60 d MFCC	200 WCCN
Variani et al. [132]	trained from 40 d FB + context	256 d-vector
Snyder et al. [27]	60 d MFCC + delta + delta-delta	150 x-vector
Desplanques et al. [8]	LMS	80 × 80
Liu et al. [133]	LMS	128 × 304

Table 9. Representative SER systems illustrating the evolution from hand-crafted features to high-resolution spectrograms and SSL-based embeddings. FB refers to filterbank, LMS is log-Mel spectrogram.

Method	Feature Type	Representation
Schuller et al. [138]	MFCC + prosody + energy	1582-d openSMILE
Satt et al. [140]	LMS	∼40–64 Mel bands, 25 ms window, 10 ms hop
Mirsamadi et al. [142]	FB with frame attention	40 dim FB, 25 ms window, 10 ms hop
Trigeorgis et al. [141]	LMS + channel attention	40 dim FB with 40 ms frame, 5 ms hop
Pepino et al. [143]	wav2vec 2.0 SSL embedding	768 dim contextual frames
Chen et al. [144]	WavLM SSL embedding	1024 dim contextual frames
Chowdhury et al. [146]	LMS + 5 other features	64 dim LMS 20–30 ms, and 126 dim other features

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

McLoughlin, I.; Pham, L.; Song, Y.; Miao, X.; Phan, H.; Cai, P.; Gu, Q.; Nan, J.; Song, H.; Soh, D. Spectrogram Features for Audio and Speech Analysis. Appl. Sci. 2026, 16, 572. https://doi.org/10.3390/app16020572

AMA Style

McLoughlin I, Pham L, Song Y, Miao X, Phan H, Cai P, Gu Q, Nan J, Song H, Soh D. Spectrogram Features for Audio and Speech Analysis. Applied Sciences. 2026; 16(2):572. https://doi.org/10.3390/app16020572

Chicago/Turabian Style

McLoughlin, Ian, Lam Pham, Yan Song, Xiaoxiao Miao, Huy Phan, Pengfei Cai, Qing Gu, Jiang Nan, Haoyu Song, and Donny Soh. 2026. "Spectrogram Features for Audio and Speech Analysis" Applied Sciences 16, no. 2: 572. https://doi.org/10.3390/app16020572

APA Style

McLoughlin, I., Pham, L., Song, Y., Miao, X., Phan, H., Cai, P., Gu, Q., Nan, J., Song, H., & Soh, D. (2026). Spectrogram Features for Audio and Speech Analysis. Applied Sciences, 16(2), 572. https://doi.org/10.3390/app16020572

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Spectrogram Features for Audio and Speech Analysis

Featured Application

Abstract

1. Introduction

2. Taxonomy of Spectrograms

2.1. Basic Spectrogram

2.2. Spectrograms Are Not Pictures

2.2.1. Colour and Greyscale

2.2.2. Translation Invariance and Scaling

2.2.3. Local Features

2.3. Mel-Spectrogram

2.4. Constant-Q Spectrogram

2.5. Correlogram

2.6. Stabilised Auditory Image

2.7. Patches and Regions

2.8. Scaling and Number Representation

2.9. Pooling and Downsampling

2.10. Variance Normalised Features

3. Audio Analysis

3.1. Taxonomy of Audio Analysis

3.1.1. Feature Extraction

3.1.2. Overlap and Occluded Sounds

3.1.3. Early Sound Event Classification

3.2. Sound Event Detection

3.3. Anomalous Sound Detection

3.4. Bioacoustics

4. Speech Analysis

4.1. Language and Dialect Identification

4.2. Speaker Verification

4.3. Speech Emotion Recognition

5. Conclusions

Future Directions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI