Temporal Auditory Coding Features for Causal Speech Enhancement

: Perceptually motivated audio signal processing and feature extraction have played a key role in the determination of high-level semantic processes and the development of emerging systems and applications, such as mobile phone telecommunication and hearing aids. In the era of deep learning, speech enhancement methods based on neural networks have seen great success, mainly operating on the log-power spectra. Although these approaches surpass the need for exhaustive feature extraction and selection, it is still unclear whether they target the important sound characteristics related to speech perception. In this study, we propose a novel set of auditory-motivated features for single-channel speech enhancement by fusing temporal envelope and temporal ﬁne structure information in the context of vocoder-like processing. A causal gated recurrent unit (GRU) neural network is employed to recover the low-frequency amplitude modulations of speech. Experimental results indicate that the exploited system achieves considerable gains for normal-hearing and hearing-impaired listeners, in terms of objective intelligibility and quality metrics. The proposed auditory-motivated feature set achieved better objective intelligibility results compared to the conventional log-magnitude spectrogram features, while mixed results were observed for simulated listeners with hearing loss. Finally, we demonstrate that the proposed analysis / synthesis framework provides satisfactory reconstruction accuracy of speech signals.


Introduction
Single-channel speech enhancement has attracted considerable research attention for years due to the emerging demand in various real-world applications, such as mobile phone telecommunication [1,2], automatic speech recognition [3], speech coding [4], and hearing aids [5]. The goal of speech enhancement is to improve the intelligibility and quality of degraded speech signals by suppressing the noise components that impede communication and proper analysis. These include interfering sounds, noise, reverberation, distortion, and other deficiencies [6]. One major challenge for speech enhancement systems is the ability to operate online. Real-time applications of speech enhancement, such as mobile telecommunication and hearing aids, usually cannot afford to access future observations, in favor of low-latency inference. Thus, the requisite for causal processing is apparent in real-world practice.
Numerous speech enhancement algorithms have been established in recent decades. Techniques such as spectral subtraction [7,8] and various forms of Wiener filtering [1,9] have been widely used in speech enhancement frameworks. With the surge of deep learning, supervised learning methods based on neural networks have shown promising performance in a variety of audio applications [10][11][12][13][14][15], including single-channel speech enhancement [5,16,17]. Current state-of-the-art methods exploit large amounts of training data captured under various noise and reverberation conditions [18] and can achieve high generalization performance, even in challenging acoustic environments [19]. Prevailing approaches are based on alternative types of neural networks, such as convolutional and recurrent neural networks, to formulate speech enhancement as a sequence-to-sequence mapping that leverages spectral and temporal correlations while reducing translational variance in signals [6].
Recently, Lang and Yang (2020) [20] demonstrated the effectiveness of fusing complementary features to magnitude-aware targets by separately learning phase representations. In addition, Bae et al. (2019) [21] explored a framework for disentangling speech and noise for noise-invariant speech enhancement, offering more robust noise-invariant properties. In Rao and Carney (2014) [22], a vowel enhancement strategy is proposed to restore the representation of formants at the level of the midbrain by performing formant tracking and enhancement. Causal methods that operate directly on the time domain have also been proposed for speech enhancement and text-to-speech tasks [23], although are of much greater complexity than conventional spectral masking approaches. Even though the time-frequency domain appears more convenient for exploiting the spectro-temporal structure of sound, time-domain methods recently demonstrated impressive results that minimize audible artifacts either by processing raw audio waveforms or temporal audio features, such as the temporal envelope of noise-corrupted signals [24].
While the analysis of speech sounds in mainstream audio applications emphasizes on the frequency spectrum, rate-place encoding on its own fails to account for the majority of perceptual aspects of complex sounds [25]. Nevertheless, a great interest in purely temporal properties of speech has emerged recently, due to their capability in determining the corresponding perceptual attributes. The temporal structure of speech is classified into three categories of speech cues based on dominant temporal fluctuation rates, according to [26]. They are envelope, periodicity, and fine structure cues. In this scheme, envelope cues contain modulation frequencies from 2 to 50 Hz, representing acoustical aspects of phonetic segments combined with stress and voicing information. Periodicity cues exist from 50 to 500 Hz and transmit information about voicing and intonation. Periodicities of higher frequencies, from 0.6 to 10 kHz, comprise the fine structure of the speech signal and convey information related to aspects of consonant place and vowel quality.
In recent years, a wealth of studies conducted with normal-hearing and hearing-impaired listeners showed that accurate speech identification can be obtained by preserving the low-frequency amplitude modulations, primarily the temporal envelope, even if speech is severely degraded in the spectral domain [26][27][28]. On the other hand, individuals with normal hearing show a great ability for understanding speech, even in adverse listening environments, where the acuity of envelope cues is not preserved. This requires the utilization of fine structure cues, in order to effectively facilitate modulation detection [29] and release from masking [30] processes.
To account for this ability in hearing loss, several studies have attempted to recover the temporal structure of speech from noisy environments [31][32][33], while others concentrated on the effects of auditory masking and vocoding on speech perception [4,34]. In Shetty (2016) [35], prominent envelope enhancement strategies for older adults are reviewed, including temporal envelope expansion [33] and deep band modulation [36] methods. They report improvements in speech-in-noise perception. Nevertheless, each method had its limitations. This fact combined with indications of a potential dichotomy in auditory perception regarding the processing of low-frequency and high-frequency amplitude modulations, calls the attention to delineate the potential of incorporating both temporal envelope and fine structure information into state-of-the-art speech processing systems. Hence, the main goal of this study is to determine the potential of auditory coding features on supervised learning algorithms for speech enhancement, i.e., if signal processing strategies that support findings of our auditory system should be preferred in data-driven approaches of speech processing.

Problem Formulation
Consider an instantaneous linear mixing model for noisy speech y as where t is the time index, s denotes the clean speech acoustic waveform and n denotes the time-domain interference noise signal. The goal of single-channel speech enhancement is to estimate the clean speech signal s, usually by modelling some feature representation Y = d(y) of the observed mixture signal. In a general manner, the feature extraction process for a finite time segment of length L can be interpreted as an encoding function d : R L → A , where A denotes an arbitrary set. Our objective is to train a model g θ : A → A , defined by the parameter set θ, that estimates the feature representation S = d(s). In cases when the same feature representation is used for both input and target signals, relevant studies [37] suggest that better modelling can be achieved by estimating a mask vector M ∈ A, instead of directly estimating S, and applying it to the input feature vector aŝ where • denotes the Hadamard product, i.e., the elementwise multiplication, andŜ is the estimated vector corresponding to S. Regarding the mask vector, the ideal ratio mask (IRM) provides a common target for speech enhancement methods that are based on spectral decomposition of the input signal [16,19]. The local gain for each feature value is quantified by a soft label in the range from 0 to 1. The instantaneous IRM for each frame j and channel k is defined in [3] where S xx and N xx denote the power spectral densities of s and n, respectively. When β = 0.5, the IRM is equivalent to the Wiener filter gain, which is the optimal filtering method for stationary noises [16]. An approximation of the IRM can therefore be defined for arbitrary feature representations, as where > 0 is a small quantity to avoid division by zero and β = 0.5. Generally, M c quantifies the ideal local gain to be applied to Y in order to approximate the target representation S. The min(·) function is used to constrain M c within a pre-defined range from 0 to γ, depending on the distribution of S and Y. With γ = 1, typical spectral energy representations can be adequately estimated, although higher values can account for phase cancellation.
Then, the estimated signalŝ can be derived fromŜ using a decoding function d . If the encoding function d is invertible, an intuitive way to recover the original signal from the feature representation is to use the inverse transform d −1 . Otherwise, d can be determined to be a decoding function or method that achieves perfect or near-perfect signal reconstruction (i.e., by employing iterative methods). In a more perceptually-oriented approach, d denotes a function that produces an estimate of s, having the least effect on the perceptible properties of the signal, namely minimizing the distance between the perceptual representations of s andŝ.
Standard schemes of speech enhancement by means of deep neural networks adopt the log-magnitude spectrum as input and target features [37]. The short-time Fourier transform (STFT) is applied to each overlapping windowed frame of the acoustic waveform, and the absolute values of the STFT coefficients are logarithmically compressed. To avoid the amplification of values that are close to zero, typically under the range of interest, a small quantity β can be added to the magnitudes before the logarithm operation. Hence, the model input vector is defined as where β can be selected appropriately to restrict the available dynamic range, typically between 40 dB to 120 dB, depending on the application. The neural network model is trained in a supervised manner to estimate the target weighting function from noisy log-spectra. The synthesis stage combines the output vectorŜ = g(Y) • Y with the phase of the noisy mixture and recovers the time-domain signalŝ via the inverse STFT transform and the overlap-add method. The exploitation of alternative feature transforms is largely motivated by the utilization of biologically plausible processes to speech enhancement frameworks, which are inherently non-linear and irreversible [38]. The intuition here is that by replicating the functional properties of the human auditory system that contribute to sound source segregation and robust speech-in-noise perception, better modelling of natural sounds can potentially be enabled. The gammatone spectrogram and the envelope modulation spectrogram are two feature paradigms that are under consideration in relevant works on speech recognition [39]. However, the utilization of physiologically inspired feature representations to speech enhancement can only be achieved granted that the acoustic waveform reconstruction process preserves the desired quality and intelligibility of uttered speech.
In the following section, the latter approach is exploited to construct a novel framework for the analysis and synthesis of speech sounds based on auditory-motivated signal processing.

Temporal Auditory Processing: Features and Targets
In this section, we provide details and rationale about the perceptually-motivated audio signal analysis and the proposed front-end design architecture for online speech enhancement systems. Finally, we introduce a novel feature set based on temporal envelope (ENV) and temporal fine structure cues (TFS), inspired by the temporal processing mechanisms of the human auditory periphery and midbrain. Dynamically, the proposed feature extraction process effectively encodes both slow and fast temporal modulations that fall within the capabilities of the auditory system and produces a spectrogram-like representation of a speech signal.

ERB-Scaled Gabor Filter Bank: Analysis Framework
Established models of the peripheral filtering function of the cochlea utilize the Gammatone filter bank, which attains a balance between computational complexity and physiological accuracy [40]. The bandwidth of each auditory Gammatone filter is determined as an equivalent rectangular bandwidth (ERB), based on the linear approximation of the ERB by Glasberg and Moore (1990) [41] where f and ERB( f ) are in Hz. Although Gammatone-based representations exhibit the indicated behavior, regarding the temporal and spectral aspects of auditory processing, they fail to produce reconstructed signals with non-audible distortions in a direct way. Variations on Gammatone filter banks have been proposed towards a more effective analysis-synthesis framework [42]. Moreover, a perfectly invertible constant-Q transform based on nonstationary Gabor frames has been recently constructed [43]. Despite allowing for adaptable resolution of the time-frequency plane, Constant-Q transforms mismatch the auditory spectral resolution at low frequencies [44].
In this study, we propose a direct implementation of an analysis and synthesis framework for speech enhancement, which is compatible with the auditory resolution and demonstrates satisfactory intelligibility and quality of reconstructed speech signals. In particular, frequency selectivity was accounted for by a set of Gabor filters υ k that are uniformly distributed along the ERB scale (ERBS) [38] ERBS( f ) = 9.2645·ln 1 + f 228.8455 .
The kernels of the analysis filter bank are defined in the frequency domain aŝ where k = 0, . . . , K indexes the filterbank kernels with center frequencies f k in Hz. The effective bandwidth of υ k is determined by Γ k = ERB( f k ), while the factor Γ − 1 2 k imposes each filter to have the same energy. This rate-place encoding is based on the recently proposed Audlet filterbank [42] and ERBlet transform [45]. In this study, we use a density of 4.55 filters per ERB, resulting in a total of K = 128 filters from 80 to 6000 Hz. Frequency bands below 80 Hz and over 6000 Hz are not considered in the analysis and were attenuated in the processed signals. Moreover, pre-emphasis and de-emphasis filters (c = 0.97) were applied to audio signals at the input and output stages, respectively.

Temporal Envelope Features
Let y coch (t, k) denote the sub-band signal at the output of the kth ERB filter. The full-scale envelope vector y AN (t, k) is obtained by half-wave rectification followed by low-pass filtering. This is a realistic model of the signal transduction of the inner hair cells and is widely supported by auditory modelling studies [46,47]. Moreover, this approach surpasses the limitations of the Hilbert envelope [48] extraction process, by providing a framework that is efficient and valid for short observation times. The energy of the excitation signals is preserved using a weighting vector w G = (w 1 w 2 · · · w K−1 w K ) and the final output vector Y ENV ∈ R N×K , N = L / τ, is resolved by integrating the envelope power over a short time window µ(t; τ) = e −t/τ ·u(t), with time constant τ = 8 ms [49]. This step can be interpreted as a down-sampling operation and is attributed to the loss of phase locking observed in the midbrain [47]. The mathematical formulation of this model can be summarized in Equations (9)- (12).
In the equations above, the · is the floor operator, n is the frame index, and w E is the impulse response of a zero-phase forward-backward low-pass filter with a cutoff frequency of 50 Hz. In favor of good signal reconstruction, we ignored all the non-linearities and adaptive mechanisms in the basilar membrane, that are attributable to the motion of the outer hair cells and the feedback communicated via the efferent system (for a detailed analysis see [50]).

Temporal Fine Structure Features
Even though speech intelligibility seems to depend on the acuity of slow temporal modulations, listening in the presence of noise is strongly correlated with the ability to utilize fast amplitude modulations [30], namely the perceptual cues involved in the temporal fine structure. This information is usually hindered in typical audio signal processing setups, where energy-based spectral transformations are mainly encountered, seeking a balance between time and frequency resolution. The intuition here is that spectro-temporal regions with source-dependent modulations can potentially facilitate speech enhancement systems in the identification and segregation of sound streams with similar spectral content. Hence, this stage aims to capture the temporal coding patterns associated with voicing and periodicity in a noisy speech mixture and to assess their contribution to the recovery of degraded envelope patterns.
First, a binary operator h(·) is applied to the filter bank output signals y coch . A zero-phase low-pass filter w F is then utilized to simulate the deterioration of phase-locking for frequencies above 2 kHz. Excitation signals are transformed by a lateral inhibitory network (LIN) [50] to enhance the frequency selectivity of the filter bank, mimicking the functional properties of the cochlear nucleus [51]. The LIN is simply modeled as a first-order difference with respect to the tonotopic axis, followed by a half-wave rectifier to produce y LIN [52]. The onset spikes of y LIN are obtained by time derivation and half-wave rectification. The resulting signal y SP conveys information about the inter-spike intervals between successive zero-crossings in the stimulus waveform, seeking correlates of periodicity pitch [53]. The y SP sub-band signals are finally integrated into short-time frames of τ = 8 ms, aligning to the corresponding temporal envelope features. This transformation attempts a conversion of temporal spike information into rate information, rendering a measure of TFS fluctuations in the original signal. Thus, feature values are close to zero for an unmodulated sinusoidal carrier signal, while reach higher values (close to 1) for highly modulated signals. In contrast to ENV features, this description is amplitude independent, similar to the auditory TFS coding mechanism; TFS coding is reported to be relatively level independent in the most range of above-threshold presentation levels [54,55]. The computation of the proposed temporal fine structure features can be summarized in the following sequence of operations.
where h(·) is the Heaviside step function, and Y TFS denotes the final feature representation. The scaling factor Γ k − 1 2 is used to normalize the feature distribution along the rate-place axis. In the current implementation, temporal fine structure encoding retains the same spectral density (4.55 channels per ERB) as the temporal envelope extraction process. TFS signals are obtained only over the low-frequency range, between 80 and 1000 Hz, where the effect of periodic cues on speech understanding is prominent. On this basis, Y TFS comprises a 59-dimensional feature vector for each short-time frame and complements Y ENV as the model input vector.

Synthesis Framework
The synthesis framework consists of two stages: an envelope post-processing stage and a time-domain signal reconstruction stage. First, the desired sub-band envelope gains are up-sampled to the original signal sampling rate using a causal interpolating filter to produce the vectorŜ ∈ R L×K . The dynamic range of each band is then restricted to a maximum modulation depth of D r = 60 dB by a gating function and the output envelope signals are filtered for modulation frequencies below 50 Hz. Second, the sub-band temporal fine structures of the noisy mixture y TFS are modulated by the processed envelopesŜ and a linear operator is employed to map the sub-band signals to the output audio signalŝ.

Model Architecture & Training
In this study, we consider the causal approach for speech enhancement, where the model can access information from past time-steps and the prediction of target features is not dependent on future audio segments. A sliding rectangular window with length J in the time domain is applied to the input feature representation to enable frame-based processing. The model estimates the target features corresponding to the Jth time frame of the input representation. This approach is suitable for real-time applications, such as hearing prosthesis and audio streaming. It is also promoted by numerous other studies [16,56], as it incorporates valuable information from neighbor time-frames and reduces the redundancy of the output dimensionality, compared to sequence-to-sequence modelling [24].
Based on the above premise, let g θ : R JxΛ → R K denote a neural network model parameterized by the learnable parameters θ, where J determines the available temporal receptive field of the model. The parameters Λ and K denote the number of input and output features, respectively, which are determined by the desired feature representation (see Sections 2.1 and 2.2). A two-layer gated recurrent unit (GRU) neural network with 512 units is considered as the main component of g θ . A recurrent cell processes the log-compressed input feature vector across each timestep in sequential order, and the output of the Jth step is passed to a fully-connected layer with K units. Then, the sigmoid activation function is applied to yield the final predictions. Dropout regularization with a probability of 0.4 is applied to the outputs of the recurrent layers. The prediction of the model is locally-constrained in the time domain through the parameter J, to avoid learning long-term dependencies (> 400 ms), leading to a more accurate and stable training procedure. This is technically achieved by setting the initial hidden state vector of the GRU to zero after each sample inference.
The model has a total of 2.71 million parameters and is trained using the Adam optimizer algorithm [57] in mini-batch mode with a batch size of 2048, to minimize the mean squared error between the target and the estimated values. The learning rate was set to 10 −4 , while the learning progress was monitored by an early-stopping algorithm to avoid over-fitting to training data. The proposed speech enhancement system is depicted in Figure 1. Predictions are used to modulate the unprocessed envelopes, and the enhanced audio waveformŝ is synthesized using the sub-band y coch signals.

Dataset
Experiments were conducted on the TIMIT speech corpus of read speech [58], which is widely used for various speech tasks. It contains a total of 6300 clean utterances, spoken by 630 speakers (438 male). The speech material comes subdivided into separate training and test sets, balancing factors such as subset duration, speaker gender, phoneme occurrence, and dialectal coverage. No speaker or sentence appears in both partitions to avoid overlap with the training material.
The clean speech data were corrupted by various real-world and synthetic noises to form the experimental dataset. Noises were mismatched between training and test sets to ensure a legitimate experimental setup. For the training set, we selected six noise types from the UrbanSound8k dataset [59] (air conditioner, children playing, drilling, engine, gunshot, street music) with a total duration of five hours. Furthermore, training set noises were augmented by an additional 10 common noise types (ambience, babble, cafe, restaurant, street) collected from the Freesound website (http://www.freesound.org). To assess the generalization performance of the proposed approach to novel conditions, thirteen real-world noises from the DEMAND database [60] were included in the test set, recorded over a variety of environments (domestic, office, public, transportation, street, and nature). Four noise samples of the database had minimal effect on intelligibility and quality metrics even at very low signal-to-noise ratios (SNR) and thus were substituted by two common categories (speech-shaped noise and copy machine sound). All recordings were initially converted to single-channel audio clips, and were afterwards mixed with the clean speech signals of the test set. Audio files were in WAV format and were sampled at 16 kHz with a 16-bit depth.
Each speech sample is mixed in utterance level with a randomly selected noise segment, which was adjusted in level to reach a pre-defined long-term SNR, based on the energy of active speech regions. For the training and validation set, the SNR values were sampled from a uniform distribution between −5 and 3 dB. To assist the presentation and interpretability of the results, the test set audio mixtures were obtained in discrete SNRs {−8, −6, −4, −2, 0, 2, 4, 6} dB, following a uniform distribution. Silent regions longer than one second were detected and excluded/discarded from clean speech samples before the mixing procedure, using an energy-based voice activity detector.

Objective Evaluation Criteria
We evaluated the performance of the proposed approach in terms of intelligibility and perceptual quality based on four well-known objective metrics that reflect the aspects of normal and impaired auditory perception.
The enhancement of speech intelligibility for normal-hearing listeners is evaluated through the extended short-term objective intelligibility (eSTOI) measure. The eSTOI (ranging from 0 to 1) is able to objectively quantify the intelligibility of speech obscured by temporally modulated noises and noisy signals processed with time-frequency weighting for a group of normal-hearing listeners [61]. The eSTOI algorithm incorporates spectral correlation of short-time segments (384-ms) by comparing the energy-normalized spectrograms of processed and clean speech signals. Prior to the computation of the intelligibility score, pauses between sentences are removed from both reference and processed audio signals. The temporal envelopes of one-third octave frequency bands (15 frequency bands from 150 Hz to 4.3 kHz) are then approximated by summing the corresponding STFT coefficient energies.
The eSTOI measure is complemented by the full reference algorithm of the PESQ score, as defined in the ITU-T P.862 standard [62]. PESQ estimates the subjective mean opinion score for a group of normal-hearing listeners regarding the perceived audio quality over telephone networks, when degraded by speech or noise distortions. It ranges from −0.5 (or 1.0 in most cases) to 4.5 and is widely used to assess speech processing algorithms [2,21,56,63], indicating the speech quality measurement of enhanced speech.
Moreover, the hearing-aid speech perception index (HASPI) [64] and the hearing-aid sound quality index (HASQI) [65] are employed to objectively evaluate the model performance for listeners with various degrees of hearing loss. These metrics (both ranging from 0 to 1) rely on hearing profile-dependent auditory model representations to compute the long-term correlations between reference and processed speech signals. HASPI incorporates changes in the low-, mid-and high-intensity regions of the spectral envelope and the harmonic structure to construct a viable model of consonant and vowel perception for hearing-impaired listeners. On the other hand, the HASQI metric combines aspects of linear filtering and non-linear distortions found in a hearing device to measure the sound quality of a processed signal as perceived by a hearing aid user. In this study, six generic hearing loss profiles (Figure 2) are considered to simulate the evaluation of sound quality and perception for hearing-impaired listeners [66]. For both metrics, the reference signal presentation level was set to 65 dB SPL (corresponding to signal RMS value of 1), while clean, noisy, and enhanced speech signals were imposed to equal RMS values.
Electronics 2020, 9, x FOR PEER REVIEW 9 of 16 profiles ( Figure 2) are considered to simulate the evaluation of sound quality and perception for hearing-impaired listeners [66]. For both metrics, the reference signal presentation level was set to 65 dB SPL (corresponding to signal RMS value of 1), while clean, noisy, and enhanced speech signals were imposed to equal RMS values.

Experimental Setup
The experimental setup comprises three speech enhancement systems, namely the ENV, the ENV-TFS and the reference STFT system. The ENV-TFS system is trained to predict the mask representation ( = 1) of = 128 clean speech envelope features, as described in Section 2.2.2, by processing = 187 auditory coding features of noisy speech mixture (128 ENV features and 59 TFS features). The relative contribution of TFS features to the system performance is evaluated by assessing the model when having access to solely ENV information ( = = 128). Finally, the baseline STFT method employs the conventional IRM as a target and the log-scaled magnitude spectrogram (Equation (5)) as the model input. Features are obtained by the STFT of size 512 (31.25 ms) with a 50% overlap and the Hann window function. Frequency bins between 80 and 6000 Hz ( = = 256 ) are considered to match the ENV and ENV-TFS systems. The same model architecture is employed in all setups, to ensure a fair comparison between the three systems. The only difference lies in the linear output layer; the number of output units is determined by the desired output dimensionality . The source code of the experimental procedure along with the complete set of results on the TIMIT dataset is freely available at the dedicated online repository (https://doi.org/10.5281/zenodo.4028860).

Reconstruction from Envelope Spectrogram
The first 100 speech signals of the TIMIT test set were employed for the evaluation of the proposed analysis-synthesis framework. The intelligibility and quality of reconstructed acoustic waveforms are evaluated through the eSTOI and PESQ objective metrics, and are compared to the Typical hearing loss audiograms incorporated in the objective evaluation of sound quality and intelligibility using HASPI and HASQI measures. These include two high-frequency hearing loss audiograms (HL1: blue, HL2: orange), two mild hearing loss audiograms (HL3: yellow, HL4: purple), one moderate hearing loss (HL5: green), and one low-frequency hearing loss (HL6: cyan).

Experimental Setup
The experimental setup comprises three speech enhancement systems, namely the ENV, the ENV-TFS and the reference STFT system. The ENV-TFS system is trained to predict the M c mask representation (γ = 1) of K = 128 clean speech envelope features, as described in Section 2.2.2, by processing Λ = 187 auditory coding features of noisy speech mixture (128 ENV features and 59 TFS features). The relative contribution of TFS features to the system performance is evaluated by assessing the model when having access to solely ENV information (Λ = K = 128). Finally, the baseline STFT method employs the conventional IRM as a target and the log-scaled magnitude spectrogram (Equation (5)) as the model input. Features are obtained by the STFT of size 512 (31.25 ms) with a 50% overlap and the Hann window function. Frequency bins between 80 and 6000 Hz (Λ = K = 256) are considered to match the ENV and ENV-TFS systems. The same model architecture is employed in all setups, to ensure a fair comparison between the three systems. The only difference lies in the linear output layer; the number of output units is determined by the desired output dimensionality K. The source code of the experimental procedure along with the complete set of results on the TIMIT dataset is freely available at the dedicated online repository (https://doi.org/10.5281/zenodo.4028860).

Reconstruction from Envelope Spectrogram
The first 100 speech signals of the TIMIT test set were employed for the evaluation of the proposed analysis-synthesis framework. The intelligibility and quality of reconstructed acoustic waveforms are evaluated through the eSTOI and PESQ objective metrics, and are compared to the reference STFT-based analysis framework.
Results are depicted in Table 1, where three scenarios are considered. First, clean speech signals are transformed into time-frequency representations (denoted as ENV and STFT) and are synthesized back using the corresponding synthesis function. Results indicate that the proposed approach preserves intelligibility and quality of the reconstructed speech signal, with minimal loss compared to the perfectly-reconstructed speech signals obtained via the STFT. The second and third cases consider the ideal case where the speech enhancement model unerringly estimates the target feature representation, indicating the maximum perceptual gain that can be achieved in each case. To simulate this, the soft mask M c is applied to the noisy feature representation to obtain the ideally processed speech signals in the context of masking-based speech enhancement. In detail, ENV and STFT methods have a maximum capacity of increasing eSTOI by 34% and 36%, respectively, and speech quality by 1.52 and 1.87 PESQ units. Clipping the target mask M c to be between 0 and 1, with the use of the upper bound parameter γ, results in a small decrease in the maximum capacity of both methods. The above results provide a validation of the proposed envelope feature representation for speech enhancement. An auditory-motivated envelope representation is provided which is accommodated by sufficient speech reconstruction quality and intelligibility, close to the reference STFT analysis.

Objective Evaluation for Normal-Hearing Listeners
In this section, we evaluate the performance of the utilized GRU model in a typical masking-based speech enhancement setup, in terms of objective metrics for intelligibility and perceived quality for normal-hearing listeners. Table 2 shows the predicted intelligibility and quality results averaged over all SNRs and noises. It is apparent that all methods improved the eSTOI and PESQ scores relative to unprocessed noisy speech. Mean intelligibility scores were higher for the ENV-TFS method compared to the baseline STFT method, with improvements ranging from 2.5% to 3.5%. Marginally higher (0.02) mean PESQ scores are also observed across different SNR values.
The effect of TFS features on the model performance is assessed further. The ENV-TFS system led to improvements for eSTOI (3%) and PESQ (0.05) mean scores compared to the ENV system, using the same model architecture. The performance of the ENV system matches the STFT approach in most conditions for normal-hearing listeners. Moreover, ENV and ENV-TFS methods motivate a narrower receptive field (48 ms) compared to the best performing STFT model with a temporal receptive field of 176 ms (10 time-steps). The performance of the ENV-TFS approach across different SNRs can be depicted in Figure 3. Intelligibility and quality scores vary significantly across SNR values. Similar performance to seen and unseen SNR conditions is observed, indicating a strong generalization ability of the model. The utilized processing led to increased intelligibility for the 99.1% of speech utterances. Moreover, PESQ scores were improved for 99.9% of the samples. The ENV-TFS model provided better intelligibility results (eSTOI) than the STFT-based model for the 65% of test set samples, while no significant advantage can be reported for either method in terms of perceived quality for normal hearing-listeners (PESQ).  As shown in Figure 4, these results were consistent across all noise types. The utilized GRUbased speech processing facilitated a solid increase in objective intelligibility for the majority of noise conditions. In most cases, the proposed approach performed considerably better than the STFT-based model. Performance gains for the proposed approach were evident in highly modulated acoustic environments (cafeteria, town square, living room, copy machine). At the same time, marginal improvements over the baseline system are observed in noises of a steadier nature (Subway, bus, river) and in SNRs where the original signals already feature adequate intelligibility (>0.75). As shown in Figure 4, these results were consistent across all noise types. The utilized GRU-based speech processing facilitated a solid increase in objective intelligibility for the majority of noise conditions. In most cases, the proposed approach performed considerably better than the STFT-based model. Performance gains for the proposed approach were evident in highly modulated acoustic environments (cafeteria, town square, living room, copy machine). At the same time, marginal improvements over the baseline system are observed in noises of a steadier nature (Subway, bus, river) and in SNRs where the original signals already feature adequate intelligibility (>0.75).

Objective Evaluation for Hearing-Impaired Listeners
The performance of speech enhancement approaches is evaluated for six typical hearing loss configurations (presented in Figure 2). The processed and unprocessed noisy utterances of the TIMIT test set were compared to the corresponding clean speech samples via HASPI and HASQI models. Table 3 shows the objective intelligibility (HASPI) and quality (HASQI) values between the proposed ENV-TFS system and the conventional STFT system. Mean values of both HASPI and HASQI measures were improved in all test conditions. In detail, simulation results indicate that ENV-TFS approach yields better results for HL1, HL3 and HL6 compared to the STFT approach. In contrast, results on the hearing loss configurations HL2 and HL4 indicate similar or slightly lower performance of the proposed speech enhancement system. Lower results were obtained for the HL5 configuration. The proposed algorithm improved HASPI and HASQI values for the 94.6% and 98.2% of test set samples, respectively. Predicted intelligibility results of the ENV-TFS system were higher than the STFT for the 55% of test set samples, while HASQI values were lower for the 54% of test set samples.  Moreover, the HASPI values across different SNR conditions can be depicted in Figure 5. Processed signals yield improved intelligibility scores by 14-30% across different SNR values. The total values were slightly higher at SNRs of 0, −2, and −4 dB for both ENV-TFS and STFT methods. In addition, the generalization ability of the models to unseen SNR condition is also validated, as HASPI and HASQI differences were not significant. Finally, the above results indicate that the conventional and the proposed methods were comparable at this stage and the performance of each approach is highly dependent on the degree and configuration of the simulated audiograms.

Discussion
In this study, we investigated the performance of deep learning-based speech enhancement algorithms on vocoder-based processing of speech. The method introduces a novel feature set on the basis of recent findings on peripheral auditory processing and embeds these attributes into an effective speech analysis/synthesis framework. Features are designed to capture temporal modulation information from sub-band audio signals and extract low-dimensionality attributes from short-time audio segments. The processing method applies to causal inference systems, while providing an alternative to conventional spectrum-wise speech analysis frameworks. It features sufficiently low latency on inference, even with no available hardware acceleration for the implemented digital filters. The computational processing time was slightly below T/2, where T is the duration of an audio segment, running on a standard personal computer equipped with the Intel core i7-7700 processor (3.6 GHz base frequency).
A simple two-layer GRU-based neural network architecture was employed to model the temporal envelope dynamics of degraded speech signals. The network accepts temporal envelope and fine structure features for a short-time segment. Then, it computes the mask vector for the low frequency amplitude modulations of speech. A generalization of the widely used IRM is exploited on this account, adapting to arbitrary feature representations. This target is intuitive and general, as it can be adopted by any algorithm that concerns masking-based audio source separation and enhancement. In addition, the GRU architecture offers advantages over alternative architectures, such as fully-connected deep neural networks or long short-term memory networks, due to the fewer learned parameters for the same receptive field, a more stable training procedure, and good generalization to novel noises. Nevertheless, speech enhancement performance was not substantially affected by the employed architecture, given proper network configuration. The deployed GRU model architecture consists of two hidden layers, each with a size of 512. This model provided superior performance for all employed representations, while pairwise differences of the proposed and baseline systems were not significantly affected by the model hyperparameters. Thus, a high level of consistency in the results was observed across different experimental model parameterizations.
Based on our experimental results, it can be concluded that gated recurrent unit neural network architectures have indeed potential to increase speech intelligibility and perceived quality in a wide range of real-world conditions, given a modest amount of training data and sufficient computational power. To our knowledge, this method is the first approach based on established auditory-motivated methods that provides an alternative to raw waveform or spectrum-wise processing for deep learning-based speech enhancement. The proposed temporal auditory-motivated features effectively encode both slow and fast amplitude modulations, providing thus valuable information to the speech enhancement models, which are otherwise obscured in usual spectral representations. In the future, dynamic processing of both low-frequency and high-frequency amplitude modulations of speech, i.e., the temporal fine structure of degraded speech signals could enhance the proposed methodology and provide an intuitive and complete speech analysis framework based on modulation cues.

Conclusions
In this study, we propose a novel set of auditory-motivated features for single-channel speech enhancement by fusing temporal envelope and temporal fine structure information in the context of vocoder-like processing. In order to investigate the potential of temporal auditory coding features on the enhancement of speech intelligibility, we employ a GRU neural network to recover the low-frequency amplitude modulations of speech. Experimental results showed that the proposed features achieved improvements compared to the conventional log-scaled magnitude spectrogram, in terms of objective metrics for intelligibility and perceptual quality of normal-hearing listeners. Mixed results were observed regarding simulated listeners with hearing loss. Both the proposed and standard approaches significantly increased the personalized objective metrics for intelligibility and perceived quality, while the expediency of each method is postulated to mainly depend on the audiogram type. Finally, the exploited framework applies to causal speech processing systems, as it provides valuable feature representation with sufficient intelligibility and quality of reconstructed speech signals.