Consider an instantaneous linear mixing model for noisy speech

$y$ as

where

$t$ is the time index,

$s$ denotes the clean speech acoustic waveform and

$n$ denotes the time-domain interference noise signal. The goal of single-channel speech enhancement is to estimate the clean speech signal

$s$, usually by modelling some feature representation

$Y=d\left(y\right)$ of the observed mixture signal. In a general manner, the feature extraction process for a finite time segment of length

$L$ can be interpreted as an encoding function

$d:{\mathbb{R}}^{L}\to \mathbb{A}$, where

$\mathbb{A}$ denotes an arbitrary set. Our objective is to train a model

${g}_{\theta}:\mathbb{A}\to \mathbb{A}$, defined by the parameter set

$\theta $, that estimates the feature representation

$S=d\left(s\right)$. In cases when the same feature representation is used for both input and target signals, relevant studies [

37] suggest that better modelling can be achieved by estimating a mask vector

$M\in \mathbb{A}$, instead of directly estimating

$S$, and applying it to the input feature vector as

where ∘ denotes the Hadamard product, i.e., the elementwise multiplication, and

$\widehat{S}$ is the estimated vector corresponding to

$S$. Regarding the mask vector, the ideal ratio mask (IRM) provides a common target for speech enhancement methods that are based on spectral decomposition of the input signal [

16,

19]. The local gain for each feature value is quantified by a soft label in the range from 0 to 1. The instantaneous IRM for each frame

$j$ and channel

$k$ is defined in [

3]

where

${S}_{xx}$ and

${N}_{xx}$ denote the power spectral densities of

$s$ and

$n$, respectively. When

$\beta =0.5$, the IRM is equivalent to the Wiener filter gain, which is the optimal filtering method for stationary noises [

16]. An approximation of the IRM can therefore be defined for arbitrary feature representations, as

where

$\u03f5>0$ is a small quantity to avoid division by zero and

$\beta =0.5$. Generally,

${M}_{c}$ quantifies the ideal local gain to be applied to

$Y$ in order to approximate the target representation

$S$. The

$min(\xb7)$ function is used to constrain

${M}_{c}$ within a pre-defined range from

$0$ to

$\gamma $, depending on the distribution of

$S$ and

$Y$. With

$\gamma =1$, typical spectral energy representations can be adequately estimated, although higher values can account for phase cancellation.

Standard schemes of speech enhancement by means of deep neural networks adopt the log-magnitude spectrum as input and target features [

37]. The short-time Fourier transform (STFT) is applied to each overlapping windowed frame of the acoustic waveform, and the absolute values of the STFT coefficients are logarithmically compressed. To avoid the amplification of values that are close to zero, typically under the range of interest, a small quantity

$\beta $ can be added to the magnitudes before the logarithm operation. Hence, the model input vector is defined as

where

$\beta $ can be selected appropriately to restrict the available dynamic range, typically between 40 dB to 120 dB, depending on the application. The neural network model is trained in a supervised manner to estimate the target weighting function from noisy log-spectra. The synthesis stage combines the output vector

$\widehat{S}=g\left(Y\right)\circ Y$ with the phase of the noisy mixture and recovers the time-domain signal

$\widehat{s}$ via the inverse STFT transform and the overlap-add method.

The exploitation of alternative feature transforms is largely motivated by the utilization of biologically plausible processes to speech enhancement frameworks, which are inherently non-linear and irreversible [

38]. The intuition here is that by replicating the functional properties of the human auditory system that contribute to sound source segregation and robust speech-in-noise perception, better modelling of natural sounds can potentially be enabled. The gammatone spectrogram and the envelope modulation spectrogram are two feature paradigms that are under consideration in relevant works on speech recognition [

39]. However, the utilization of physiologically inspired feature representations to speech enhancement can only be achieved granted that the acoustic waveform reconstruction process preserves the desired quality and intelligibility of uttered speech.

In the following section, the latter approach is exploited to construct a novel framework for the analysis and synthesis of speech sounds based on auditory-motivated signal processing.